Denmark Statistics and large health data: We don’t know what we’re doing

Luke W. Johnston

September 25, 2025

File size between CSV, Parquet, Stata, and SAS for `bef` register for 2017.
File type	Size (MB)
SAS (`.sas7bdat`)	1.45 Gb
CSV (`.csv`)	~90% of SAS
Stata (`.dta`)	745 Mb
Parquet (`.parquet`)	398 Mb

Denmark Statistics and large health data: We don’t know what we’re doing Luke W. Johnston September 25, 2025

Denmark Statistics and large health data: We don’t know what we’re doing
Who am I? 👋 👋
History
My work on large data
Rationale for this talk
We are woefully behind on data engineering and programming practices
This impacts how we do effective and accurate research
This is important because:
Aim of this talk
General “roadmap” 🗺️
Denmark Statistics, the good and the bad
The good: Amazing resource, gold mine of data
And the bad…
Everyone works on same server
Everyone works in same folder in a project
Data is stored in a proprietary SAS format
Data updates make more work for us
Metadata is confusing and poorly documented
DST is either unaware of or indifferent to improving things
Highlights lack of engineering and design expertise at DST
Need programming expertise, especially for large data
Two tools as examples: Parquet and DuckDB
Parquet should be used to store large data
Column-based storage has many advantages
Parquet is 50-75% smaller than other formats
Can partition data by a value (e.g. year) and load all at once
DuckDB is a recent SQL engine designed for analytical queries
SQL, or Structured Query Language, is a language for managing and querying databases
DuckDB is impressively fast
Example in DST
So… why is this important? Seems like easy stuff!
Reproducibility and verification of research
Non-reproducibility is a big, though mostly unknown, issue
Analyzing, and reproducing, large data requires programming skills
No one reviews code, so, no one knows if it’s correct
“Science as amateur software development”
Working in DST makes reproducibility harder
Science is not about trust, it’s about verification
It’s difficult to trust what researchers do in DST
What to do?
If reviewing papers, request or demand code is accessible
Recognize, value, and reward those with programming expertise
Pressure DST to improve things