April 4, 2024
Speed impacts time to results
Ability to do more complex analysis with more data
Resources (which cost money)
Spread awareness of tools to do research faster and let you focus on the science and knowledge generation.
Spread knowledge of current practices in structuring and using data.
Strongly advocate for and give rationale to using R, especially within DST.
Most data formats are row-based, like CSV. Newer formats tend to be column-based.
File type | Size (MB) |
---|---|
SAS (.sas7bdat ) |
1.45 Gb |
CSV (.csv ) |
~90% of SAS |
Stata (.dta ) |
745 Mb |
Parquet (.parquet ) |
398 Mb |
Faster than almost all other tools
Relatively complex queries (joins, group by, aggregates) on 55 Gb takes < 7.5 min 1
Generally, simpler queries take < 10 seconds for massive datasets
Easily connects with Parquet datasets
Takes < 6 seconds
Integrates with tidyverse
:
furrr
Example: Converting 1800 SAS files to Parquet takes <12 hours, compared to >7 days
Install dstDataPrep
package in workspace/luke/dstDataPrep/
List all databases we have with dstDataPrep::list_databases()
Access and easily load our databases with dstDataPrep::load_databases()
(e.g. "bef"
)
Licensed under CC-BY 4.0.
Slides at slides.lwjohnst.com