Parquet, DuckDB, and fastreg for blazingly fast data analysis in the Danish Registers

Luke W. Johnston

June 23, 2026

Outline of talk

  • Rationale and aim
  • Parquet: A modern format for efficiently storing and loading data
  • DuckDB: Fast and powerful data processing and analysis
  • fastreg: Convert SAS to Parquet and quickly load registers

Rationale for this talk

Recent developments spark a push for better tooling

Big picture: Academia is woefully behind in using and knowing modern computing tools

This impacts our ability to do research effectively and productively

Working with massive data has its own challenges and needs

DST recently requires limiting RAM and resource usage

DST charges for storage space

Aim of this talk

  1. Spread knowledge of better practices in structuring and working with data.

  2. Showcase fastreg to improve how to work more effectively on DST.

Parquet file format

https://parquet.apache.org/

Parquet is a column-based data storage format

Most data formats are row-based, like CSV. Newer formats tend to be column-based.

Row-based

name,gender,age
Tim,Man,30
Sam,Woman,30
Jenny,Woman,25

Column-based

name,Tim,Sam,Jenny
gender,Man,Woman,Woman
age,30,30,25

Column-based storage has many advantages

Compression

name,Tim,Sam,Jenny
gender,Man,Woman,Woman
age,30,30,25

…becomes…

name,Tim,Jenny,Sam
gender,Man,Woman{2}
age,30{2},25

Loading

  • Computers read by lines
  • Per line = same data type
  • Only read needed columns
gender,Man,Woman,Woman
age,30,30,25

Parquet is 50-75% smaller than other formats

File size between CSV, Parquet, Stata, and SAS for bef register for 2017. Note: SAS files on DST are bigger than normal.
File type Size (MB)
SAS (.sas7bdat) 1.45 Gb
CSV (.csv) ~90% of SAS
Stata (.dta) 745 Mb
Parquet (.parquet) 398 Mb

Can partition data by a value (e.g. year) and load all at once

Called “Hive partitioning”:

bef/
├── year=2020/
│   └── part-0.parquet
├── year=2021/
│   └── part-0.parquet
└── year=2022/
    └── part-0.parquet

SAS and Python can load Parquet but not Stata

(StataNow can, but isn’t on DST…)

Should use R anyway 😉

DuckDB

DuckDB is a fairly recent SQL engine designed for analytical queries

SQL, or Structured Query Language, is a language for managing and querying databases

Is impressively fast

  • Faster than almost all other tools.

    • Relatively complex queries (joins, group by, aggregates) on 55 Gb takes < 7.5 min 1.

    • Generally, simpler queries take < 10 seconds for massive datasets.

  • Easily connects with Parquet datasets.

Can be easily used in R

library(arrow)
bef <- open_dataset("path/to/bef") |>
  to_duckdb() 

# Or can use duckplyr
library(duckplyr)
bef <- read_parquet_duckdb("path/to/bef/*/*.parquet")

Integrates with tidyverse and its syntax

library(tidyverse)
bef |> 
  filter(year = 2020) |>
  select(pnr, koen)

SAS recently added DuckDB support, Stata has community packages for it

(Not sure if either are on DST though.)

Should use R anyway 😉

fastreg

https://dp-next.github.io/fastreg/

Key features

  • Converts SAS into Hive partitioned Parquet datasets

  • Automatically handle data updates from DST and Externe registers

  • Can do it in parallel for faster conversion

  • Easily load all years of a register at once and connect to DuckDB, with minimal use of memory

Don’t use StatTransfer on DST, use fastreg

StatTransfer doesn’t create a Hive partition! 😲

Comes with targets pipeline template for easy setup

Recommended to do it this way, create with fastreg::use_template().

A simple demo (without targets)

Need to set location of SAS files and eventual Parquet output location:

options(
  # With a fake project ID and the temporary directory.
  fastreg.project_rawdata_dir = fs::path_temp("E/rawdata/701020/"),
  fastreg.project_workdata_dir = fs::path_temp(
    "E/workdata/701020/parquet-registers/"
  )
)

Converting these SAS files …

E
└── rawdata
    └── 701020
        ├── bef2020.sas7bdat
        ├── bef2021.sas7bdat
        ├── bef2022.sas7bdat
        ├── bef2023.sas7bdat
        ├── bef2024.sas7bdat
        ├── bef2025.sas7bdat
        ├── lmdb2020.sas7bdat
        ├── lmdb2021.sas7bdat
        ├── lmdb2022.sas7bdat
        ├── lmdb2023.sas7bdat
        ├── lmdb2024.sas7bdat
        └── lmdb2025.sas7bdat

… Using this fastreg code …

(note: targets template has better parallel processing)

library(tidyverse)
library(furrr)
# Where SAS files are.
rawdata_dir <- getOption("fastreg.project_rawdata_dir")
# Set where Parquet should be saved.
workdata_dir <- getOption("fastreg.project_workdata_dir")
plan(multisession)
conversion_log <- rawdata_dir |>
  fastreg::list_sas_files() |>
  future_map(\(path) fastreg::convert(path, output_dir = workdata_dir)) |>
  list_rbind()

… Into these Parquet datasets

workdata
└── 701020
    └── parquet-registers
        ├── bef
        │   ├── year=2020
        │   │   └── part-1d51e6.parquet
        │   ├── year=2021
        │   │   └── part-8fc8bb.parquet
        │   ├── year=2022
        │   │   └── part-fdf68f.parquet
        │   ├── year=2023
        │   │   └── part-50b423.parquet
        │   ├── year=2024
        │   │   └── part-d7f44c.parquet
        │   └── year=2025
        │       └── part-b8d709.parquet
        └── lmdb
            ├── year=2020
            │   └── part-bba52a.parquet
            ├── year=2021
            │   └── part-063582.parquet
            ├── year=2022
            │   └── part-59ec6d.parquet
            ├── year=2023
            │   └── part-b3e5e3.parquet
            ├── year=2024
            │   └── part-7de2cc.parquet
            └── year=2025
                └── part-1e4771.parquet

… And you can quickly list all registers as datasets

Requires the options() to be set.

fastreg::list_parquet_datasets()
[1] "E/workdata/701020/parquet-registers/bef" 
[2] "E/workdata/701020/parquet-registers/lmdb"

Parquet is much smaller than SAS, saving on storage costs

  • All SAS files: 452M bytes
  • All Parquet files: 286M bytes

Note: SAS files are larger than normal in DST, so this difference will be bigger.

Reading all years of a register at once

(available for anyone working in that DST project.)

library(fastreg)
options(
  fastreg.project_workdata_dir = "E:/workdata/701010/parquet/"
)
bef <- read_register("bef")
lmdb <- read_register("lmdb")
# A query:  ?? x 4
# Database: DuckDB 1.5.2 [luke@Linux 6.17.0-35-generic:R 4.6.0/:memory:]
   koen pnr          foed_dato   year
  <dbl> <chr>        <date>     <int>
1     2 121380373234 1963-01-29  2020
2     1 922851172404 1990-08-08  2020
3     2 603009341504 1980-11-02  2020
4     1 444957773304 1951-03-24  2020
# ℹ more rows

Tests comparing simple analysis fastreg vs direct from SAS

How long does it take to do a simple count of gender in each 5 years in bef (1 million rows)?

  1. Read in all the data for all the years
  2. Group by year
  3. Summarise number of people by gender
  4. Arrange by year, koen, number

Using fastreg (with Parquet and DuckDB) is blazing fast!

Approach Median time Memory used
From SAS 3.94s 453MB
From RDS 2.64s 453MB
fastreg 63.02ms 1006KB

(Code available here)

Limitations of DuckDB

  • Can’t do advanced statistical analysis

  • A few tidyverse functions don’t work

Solution: Process with DuckDB, collect into R, then do stats

Check out the fastreg website for more details on how to use it

https://dp-next.github.io/fastreg/