Parquet, DuckDB, and fastreg for blazingly fast data analysis in the Danish Registers

Luke W. Johnston

June 23, 2026

Outline of talk

Rationale and aim
Parquet: A modern format for efficiently storing and loading data
DuckDB: Fast and powerful data processing and analysis
fastreg: Convert SAS to Parquet and quickly load registers

Rationale for this talk

Recent developments spark a push for better tooling

Big picture: Academia is woefully behind in using and knowing modern computing tools

This impacts our ability to do research effectively and productively

Working with massive data has its own challenges and needs

DST recently requires limiting RAM and resource usage

DST charges for storage space

Aim of this talk

Spread knowledge of better practices in structuring and working with data.
Showcase fastreg to improve how to work more effectively on DST.

Parquet file format

https://parquet.apache.org/

Parquet is a column-based data storage format

Most data formats are row-based, like CSV. Newer formats tend to be column-based.

Row-based

name,gender,age
Tim,Man,30
Sam,Woman,30
Jenny,Woman,25

Column-based

name,Tim,Sam,Jenny
gender,Man,Woman,Woman
age,30,30,25

Column-based storage has many advantages

Compression

name,Tim,Sam,Jenny
gender,Man,Woman,Woman
age,30,30,25

…becomes…

name,Tim,Jenny,Sam
gender,Man,Woman{2}
age,30{2},25

Loading

Computers read by lines
Per line = same data type
Only read needed columns

gender,Man,Woman,Woman
age,30,30,25

Parquet is 50-75% smaller than other formats

File size between CSV, Parquet, Stata, and SAS for `bef` register for 2017. Note: SAS files on DST are bigger than normal.
File type	Size (MB)
SAS (`.sas7bdat`)	1.45 Gb
CSV (`.csv`)	~90% of SAS
Stata (`.dta`)	745 Mb
Parquet (`.parquet`)	398 Mb

Can partition data by a value (e.g. year) and load all at once

Called “Hive partitioning”:

bef/
├── year=2020/
│   └── part-0.parquet
├── year=2021/
│   └── part-0.parquet
└── year=2022/
    └── part-0.parquet

SAS and Python can load Parquet but not Stata

(StataNow can, but isn’t on DST…)

Should use R anyway 😉

DuckDB

DuckDB is a fairly recent SQL engine designed for analytical queries

SQL, or Structured Query Language, is a language for managing and querying databases

Is impressively fast

Faster than almost all other tools.
- Relatively complex queries (joins, group by, aggregates) on 55 Gb takes < 7.5 min ¹.
- Generally, simpler queries take < 10 seconds for massive datasets.
Easily connects with Parquet datasets.

Can be easily used in R

library(arrow)
bef <- open_dataset("path/to/bef") |>
  to_duckdb()

# Or can use duckplyr
library(duckplyr)
bef <- read_parquet_duckdb("path/to/bef/*/*.parquet")

Integrates with tidyverse and its syntax

library(tidyverse)
bef |>
  filter(year = 2020) |>
  select(pnr, koen)

SAS recently added DuckDB support, Stata has community packages for it

(Not sure if either are on DST though.)

Should use R anyway 😉

fastreg

https://dp-next.github.io/fastreg/

Key features

Converts SAS into Hive partitioned Parquet datasets
Handles data updates from DST and Externe registers
Can do it in parallel for faster conversion
Easily load all years of a register at once and connect to DuckDB, with minimal use of memory

Don’t use StatTransfer on DST, use fastreg

StatTransfer doesn’t create a Hive partition! 😲

Comes with targets pipeline template for easy setup

Recommended to do it this way, create with fastreg::use_template().

A simple demo (without targets)

Need to set location of SAS files and eventual Parquet output location:

options(
  # This is a random (fake?) project ID.
  fastreg.project_rawdata_dir = "E:/rawdata/701020/",
  fastreg.project_workdata_dir = "E:/workdata/701020/parquet/"
)

Converting these SAS files …

rawdata
└── 701020
    ├── bef2020.sas7bdat
    ├── bef2021.sas7bdat
    ├── bef2022.sas7bdat
    ├── bef2023.sas7bdat
    ├── bef2024.sas7bdat
    ├── bef2025.sas7bdat
    ├── lmdb2020.sas7bdat
    ├── lmdb2021.sas7bdat
    ├── lmdb2022.sas7bdat
    ├── lmdb2023.sas7bdat
    ├── lmdb2024.sas7bdat
    └── lmdb2025.sas7bdat

… Using this fastreg code …

(note: targets template has better parallel processing)

library(tidyverse)
library(furrr)
# Where SAS files are.
rawdata_dir <- getOption("fastreg.project_rawdata_dir")
# Set where Parquet should be saved.
workdata_dir <- getOption("fastreg.project_workdata_dir")
plan(multisession)
conversion_log <- rawdata_dir |>
  # Show *all* SAS files recursively in `rawdata/`.
  fastreg::list_sas_files() |>
  future_map(\(path) fastreg::convert(path, output_dir = workdata_dir)) |>
  list_rbind()

… Into these Parquet datasets

workdata
└── 701020
    └── parquet
        ├── bef
        │   ├── year=2020
        │   │   └── part-0b0c4a.parquet
        │   ├── year=2021
        │   │   └── part-ec92ba.parquet
        │   ├── year=2022
        │   │   └── part-3b4b7a.parquet
        │   ├── year=2023
        │   │   └── part-b5a82a.parquet
        │   ├── year=2024
        │   │   └── part-e8c3ba.parquet
        │   └── year=2025
        │       └── part-c78fd7.parquet
        └── lmdb
            ├── year=2020
            │   └── part-753a15.parquet
            ├── year=2021
            │   └── part-c5f56c.parquet
            ├── year=2022
            │   └── part-367cf8.parquet
            ├── year=2023
            │   └── part-03acdc.parquet
            ├── year=2024
            │   └── part-f50253.parquet
            └── year=2025
                └── part-f3fc6a.parquet

… And you can quickly list all registers as datasets

Requires the options() to be set.

fastreg::list_parquet_datasets()

[1] "E:/workdata/701020/parquet/bef"  "E:/workdata/701020/parquet/lmdb"

Parquet is much smaller than SAS, saving on storage costs

All SAS files: 452M bytes
All Parquet files: 286M bytes

Note: SAS files are larger than normal in DST, so this difference will be bigger.

Reading all years of a register at once

(available for anyone working in that DST project.)

library(fastreg)
options(
  fastreg.project_workdata_dir = "E:/workdata/701010/parquet/"
)
bef <- read_register("bef")
lmdb <- read_register("lmdb")

# A query:  ?? x 4
# Database: DuckDB 1.5.2 [luke@Linux 6.17.0-35-generic:R 4.6.0/:memory:]
   koen pnr          foed_dato   year
  <dbl> <chr>        <date>     <int>
1     2 121380373234 1963-01-29  2020
2     1 922851172404 1990-08-08  2020
3     2 603009341504 1980-11-02  2020
4     2 444957773304 1951-03-24  2020
# ℹ more rows

Tests comparing simple analysis fastreg vs direct from SAS or Rds

How long does it take to do a simple count of gender in each 5 years in bef (1 million rows)?

Read in all the data for all the years
Group by year
Summarise number of people by gender
Arrange by year, koen, number

Using fastreg (with Parquet and DuckDB) is blazing fast!

Approach	Median time	Memory used
From SAS	3.58s	453MB
From RDS	2.32s	453MB
fastreg	58.06ms	1006KB

(Code available here)

Limitations of DuckDB/fastreg

Can’t do advanced statistical analysis. Instead, process data with DuckDB, collect into R, then do stats.
A few tidyverse functions don’t work.
fastreg doesn’t fix problems within the data itself, only converts/reads.
Updates to registers of the same year maybe lead to duplicates in conversion.

Check out the fastreg website for more details on how to use it

https://dp-next.github.io/fastreg/