A framework for an open and scalable infrastructure for health data exemplified by the DD2 initiative

A framework for an open and scalable infrastructure for health data exemplified by the DD2 initiative

Purpose of talk and "take home messages"

Will cover:

Introduce the project
Bring awareness to issue and its importance
How project will help as a (potential) solution

Purpose of talk and "take home messages"

Will cover:

Introduce the project
Bring awareness to issue and its importance
How project will help as a (potential) solution

Won't cover:

Any technical details of the project (it's more of a software/data engineering project)

Not cover...

Another caveat, since this is a newly started project, that is way more of the computational and technical side maybe many of you are interested in hearing about.

I could talk easily talk about all the technical details of this project for many hours, but I will spare you that.

Because of this the talk will be on the shorter side, unless people have questions that get more into the details...

Background: Increasing need in science for...

Computational tools and technologies
Secure and reliable IT infrastructure
Greater openness and transparency
More reproducibility of studies
Highly technical skills and knowledge

Background: Increasing need in science for...

Computational tools and technologies
Secure and reliable IT infrastructure
Greater openness and transparency
More reproducibility of studies
Highly technical skills and knowledge

... especially in relation to data management.

Questions like:

How do store your data? In what file format?
Where do you store your data and how do you name the files?
How do you keep track of changes to the data?
(For multi-center studies) Who has which datasets and how do you combine them together?
How do you or your collaborators find out what variables there are in the data, what do they mean?
When there are errors or problems in your data, and you've already published with or analyzed on it, how can you easily determine which publications used the in correct data and how can you easily update the publications with the correct data?
How can you easily share your data with colleagues or reviewers to check your findings?

Current barriers to addressing these needs

Funding agencies don't fully recognize these challenges, so don't provide funding

Current barriers to addressing these needs

Funding agencies don't fully recognize these challenges, so don't provide funding
Researchers aren't aware of or understand the issues, or don't have skills to tackle them

Current barriers to addressing these needs

Funding agencies don't fully recognize these challenges, so don't provide funding
Researchers aren't aware of or understand the issues, or don't have skills to tackle them
People with needed technical skills leave for industry

Recent new funding: NNF Data Science Research Infrastructure 5 year grant

... which lead to this current project

And getting the funding for it.

Development of new ... methods and technologies within data science, ..., data engineering, ...

Brief backstory:

Application process started by the DD2 study to update their own infrastructure and SDCA was asked to join

Not going to go into DD2.

Myself, Alisa, and Annelli were asked to join in the application process and we ended up taking the lead on it and expanding it into a project that could be used by more than just the DD2 study.

Aims of the Framework

These aims may seem vague, but bare with me.

Aims of the Framework

Primary aim: Create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings (starting within the DD2 study)

These aims may seem vague, but bare with me.

Aims of the Framework

Primary aim: Create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings (starting within the DD2 study)
Secondary aim: Create this framework so that other research groups and companies, who are unable or can't build something similar, can relatively easily implement it and modify as needed for their own purposes.

These aims may seem vague, but bare with me.

Aims of the Framework

Primary aim: Create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings (starting within the DD2 study)
Secondary aim: Create this framework so that other research groups and companies, who are unable or can't build something similar, can relatively easily implement it and modify as needed for their own purposes.

In short: Make a software product that makes it easier to find, store, and use data for research projects, and make it so that it is easy and free to use for others.

These aims may seem vague, but bare with me.

Again, these might not be really tangible to grasp what this actually means. That's where visuals are really powerful.

Detailed schematic of the framework for building a data infrastructure for scientific studies.

A few example uses of this Framework

Solo researcher or a small research group starts a study to have data for their PhD students, but have limited funds and technical expertise.

A few example uses of this Framework

Solo researcher or a small research group starts a study to have data for their PhD students, but have limited funds and technical expertise.
A small startup company that relies on data collection for business needs to get operational quickly, but doesn't yet have funds to hire technical personnel.

A few example uses of this Framework

Solo researcher or a small research group starts a study to have data for their PhD students, but have limited funds and technical expertise.
A small startup company that relies on data collection for business needs to get operational quickly, but doesn't yet have funds to hire technical personnel.
A large multi-center, multi-country study has an aim of widely disseminating their data for maximal, and cost-effective, use by their collaborators and others.

All of these could use the framework to abide by the best practices in FAIR data management.

Guiding principles of Framework

Follow and enable FAIR principles
Openly licensed and re-usable (e.g. CC-BY)
State-of-the-art principles and tools in software design
Friendly to beginner and non-technical users

Short-term plan

Full 5 year timeline found on website.

Hire software/data engineers as soon as possible
Developing "Minimum Viable Product" of first component within ~2 years
Emphasize making training and documentation targeted to non-technical users throughout project

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

A framework for an open and scalable infrastructure for health data exemplified by the DD2 initiative

Purpose of talk and "take home messages"

Purpose of talk and "take home messages"

Background: Increasing need in science for...

Background: Increasing need in science for...

Current barriers to addressing these needs

Current barriers to addressing these needs

Current barriers to addressing these needs

Recent new funding: NNF Data Science Research Infrastructure 5 year grant

... which lead to this current project

Brief backstory:

Application process started by the DD2 study to update their own infrastructure and SDCA was asked to join

Aims of the Framework

Aims of the Framework

Aims of the Framework

Aims of the Framework

A few example uses of this Framework

A few example uses of this Framework

A few example uses of this Framework

Guiding principles of Framework

Short-term plan

Purpose of talk and "take home messages"

Help