June 23, 2026
Recent developments spark a push for better tooling
Spread knowledge of better practices in structuring and working with data.
Showcase fastreg to improve how to work more effectively on DST.
Most data formats are row-based, like CSV. Newer formats tend to be column-based.
| File type | Size (MB) |
|---|---|
SAS (.sas7bdat) |
1.45 Gb |
CSV (.csv) |
~90% of SAS |
Stata (.dta) |
745 Mb |
Parquet (.parquet) |
398 Mb |
Called “Hive partitioning”:
(StataNow can, but isn’t on DST…)
Should use R anyway 😉
Faster than almost all other tools.
Relatively complex queries (joins, group by, aggregates) on 55 Gb takes < 7.5 min 1.
Generally, simpler queries take < 10 seconds for massive datasets.
Easily connects with Parquet datasets.
(Not sure if either are on DST though.)
Should use R anyway 😉
Converts SAS into Hive partitioned Parquet datasets
Automatically handle data updates from DST and Externe registers
Can do it in parallel for faster conversion
Easily load all years of a register at once and connect to DuckDB, with minimal use of memory
StatTransfer doesn’t create a Hive partition! 😲
Recommended to do it this way, create with fastreg::use_template().
Need to set location of SAS files and eventual Parquet output location:
E
└── rawdata
└── 701020
├── bef2020.sas7bdat
├── bef2021.sas7bdat
├── bef2022.sas7bdat
├── bef2023.sas7bdat
├── bef2024.sas7bdat
├── bef2025.sas7bdat
├── lmdb2020.sas7bdat
├── lmdb2021.sas7bdat
├── lmdb2022.sas7bdat
├── lmdb2023.sas7bdat
├── lmdb2024.sas7bdat
└── lmdb2025.sas7bdat
(note: targets template has better parallel processing)
library(tidyverse)
library(furrr)
# Where SAS files are.
rawdata_dir <- getOption("fastreg.project_rawdata_dir")
# Set where Parquet should be saved.
workdata_dir <- getOption("fastreg.project_workdata_dir")
plan(multisession)
conversion_log <- rawdata_dir |>
fastreg::list_sas_files() |>
future_map(\(path) fastreg::convert(path, output_dir = workdata_dir)) |>
list_rbind()workdata
└── 701020
└── parquet-registers
├── bef
│ ├── year=2020
│ │ └── part-1d51e6.parquet
│ ├── year=2021
│ │ └── part-8fc8bb.parquet
│ ├── year=2022
│ │ └── part-fdf68f.parquet
│ ├── year=2023
│ │ └── part-50b423.parquet
│ ├── year=2024
│ │ └── part-d7f44c.parquet
│ └── year=2025
│ └── part-b8d709.parquet
└── lmdb
├── year=2020
│ └── part-bba52a.parquet
├── year=2021
│ └── part-063582.parquet
├── year=2022
│ └── part-59ec6d.parquet
├── year=2023
│ └── part-b3e5e3.parquet
├── year=2024
│ └── part-7de2cc.parquet
└── year=2025
└── part-1e4771.parquet
Requires the options() to be set.
[1] "E/workdata/701020/parquet-registers/bef"
[2] "E/workdata/701020/parquet-registers/lmdb"
Note: SAS files are larger than normal in DST, so this difference will be bigger.
(available for anyone working in that DST project.)
# A query: ?? x 4
# Database: DuckDB 1.5.2 [luke@Linux 6.17.0-35-generic:R 4.6.0/:memory:]
koen pnr foed_dato year
<dbl> <chr> <date> <int>
1 2 121380373234 1963-01-29 2020
2 1 922851172404 1990-08-08 2020
3 2 603009341504 1980-11-02 2020
4 1 444957773304 1951-03-24 2020
# ℹ more rows
How long does it take to do a simple count of gender in each 5 years in bef (1 million rows)?
| Approach | Median time | Memory used |
|---|---|---|
| From SAS | 3.94s | 453MB |
| From RDS | 2.64s | 453MB |
| fastreg | 63.02ms | 1006KB |
Can’t do advanced statistical analysis
A few tidyverse functions don’t work
Solution: Process with DuckDB, collect into R, then do stats
Licensed under CC-BY 4.0.
Slides at slides.lwjohnst.com