Denmark Statistics and large health data: We don’t know what we’re doing

Luke W. Johnston

September 25, 2025

Who am I? 👋 👋

History

  • MSc and PhD in Nutritional Science in Toronto, Canada

  • (Previous) Research in diabetes epidemiology

  • Team leader at SDCA for the Seedcase Project, an NNF funded software project to simplify building FAIR data

My work on large data

  • ukbAid: R package and website

  • DARTER Project: Website of application to and documentation on a DST project

Rationale for this talk

We are woefully behind on data engineering and programming practices

This impacts how we do effective and accurate research

This is important because:

  • Validity of results

  • Speed impacts time to results

  • Ability to do more complex analysis with more data

  • Resources (which cost money)

Aim of this talk

  1. Highlight issues with Denmark Statistics and the research using the registers (and any other large data)

  2. Spread awareness of how critical programming skills are, especially for large data

  3. Showcase a few tools for doing research faster, so you can focus on doing science

General “roadmap” 🗺️

  • Issues with DST

  • Need for dedicated programming expertise

  • Questionable reproducibility and validity

Denmark Statistics, the good and the bad

The good: Amazing resource, gold mine of data

And the bad…

Everyone works on same server

  • No queue system for analyses

  • Can crash your analysis if others use more memory

Everyone works in same folder in a project

project/
├── luke/
│   └── analysis/
└── omar/
    └── paper/

Collaboration is difficult:

  • Anyone can edit anything
  • Can’t know who changed what and when (no “version control”)
  • Not easy to review and improve other’s code

Data is stored in a proprietary SAS format

For example, BEF register:

bef2018.sas7bdat
bef2019.sas7bdat
bef2020.sas7bdat
bef2021.sas7bdat
bef2022.sas7bdat

Takes many minutes to load one year of data (in R)

Data updates make more work for us

bef2021.sas7bdat
bef2022.sas7bdat
December_2023/bef2022.sas7bdat
December_2023/bef2023.sas7bdat

Can you see the issue?

Metadata is confusing and poorly documented

  • Variables are not consistent across years

  • Finding the metadata is difficult

DST is either unaware of or indifferent to improving things

Highlights lack of engineering and design expertise at DST

  • Puts tech burden onto researchers

Need programming expertise, especially for large data

Two tools as examples: Parquet and DuckDB

Parquet should be used to store large data

Most data formats are row-based, like CSV or SAS. Newer formats tend to be column-based like Parquet.

Row-based

name,sex,age
Tim,M,30
Jenny,F,25

Column-based

name,Tim,Jenny
sex,M,F
age,30,25

Column-based storage has many advantages

Compression

name,Tim,Sam,Jenny
sex,M,F,F
age,30,30,25

…becomes…

name,Tim,Jenny,Sam
sex,M,F{2}
age,30{2},25

Loading

  • Computers read by lines
  • Per line = same data type
  • Only read needed columns
sex,M,F
age,30,25

Parquet is 50-75% smaller than other formats

File size between CSV, Parquet, Stata, and SAS for bef register for 2017.
File type Size (MB)
SAS (.sas7bdat) 1.45 Gb
CSV (.csv) ~90% of SAS
Stata (.dta) 745 Mb
Parquet (.parquet) 398 Mb

Can partition data by a value (e.g. year) and load all at once

bef/year=2018/part-0.parquet
bef/year=2019/part-0.parquet
bef/year=2020/part-0.parquet
bef/year=2021/part-0.parquet
bef/year=2022/part-0.parquet

Load in R with arrow package:

bef <- arrow::open_dataset("bef")

Load all years in < 1 sec, compared to ~5 min for one year from SAS format

DuckDB is a recent SQL engine designed for analytical queries

SQL, or Structured Query Language, is a language for managing and querying databases

DuckDB is impressively fast

  • Faster than almost all other tools

    • Relatively complex queries (joins, group by, aggregates) on 55 Gb takes < 7.5 min 1

    • Generally, simpler queries take < 10 seconds for massive datasets

  • Easily connects with Parquet datasets

Example in DST

  1. Load all 45 years of BEF
  2. Drop all missing PNR
  3. Group by year
  4. Count sex

Takes < 6 seconds

So… why is this important? Seems like easy stuff!

  • Because it highlights a lack of expertise and understanding of some basic things

Reproducibility and verification of research

Reproducibility: same data + same code = same results?

Non-reproducibility is a big, though mostly unknown, issue

DOI: 10.1093/gigascience/giad113

Analyzing, and reproducing, large data requires programming skills

  • With massive data, you really need to know how to code and programming

  • But, researchers are not trained for this kind of skill

No one reviews code, so, no one knows if it’s correct

  • This problem is bigger than people realize.

“Science as amateur software development”

“But the code runs!”

Working in DST makes reproducibility harder

  • Hard to review code

  • Hard to collaborate

  • No version control

  • Use proprietary data formats

  • No queue system

  • No training materials or courses

Science is not about trust, it’s about verification

It’s difficult to trust what researchers do in DST

What to do?

If reviewing papers, request or demand code is accessible

Recognize, value, and reward those with programming expertise

Pressure DST to improve things