Modern, fast, and powerful approaches to working with data on Denmark Statistics

Luke W. Johnston

April 4, 2024

Modern, fast, and powerful approaches to working with data on Denmark Statistics Luke W. Johnston April 4, 2024

Modern, fast, and powerful approaches to working with data on Denmark Statistics
Outline of talk
Rationale for this talk
We are woefully behind in the tools we use and knowledge we have on data engineering and programming
And this impacts our ability to do research effectively and productively
Why is this important?
Aim of this talk
Parquet file format
Parquet is a column-based data storage format
Column-based storage has many advantages
Parquet is 50-75% smaller than other formats
Can partition data by a value (e.g. year) and load all at once
SAS and Python can load Parquet but not Stata
DuckDB
DuckDB is a recent SQL engine designed for analytical queries
SQL, or Structured Query Language, is a language for managing and querying databases
Is impressively fast
Example in DST
Can be easily used in R
Python can use DuckDB, but not Stata or SAS
Parallel processing
Running multiple sessions or cores at once
Incredibly easy to do in R with furrr
Greatly reduce time to results
Inside DARTER Project?

Outline of talk