Approaches to open, scalable, and reproducible data management and analysis: Training and software

Outline: My current projects

NNF-funded Data Infrastructure Framework (DIF) Project
Reproducible Research in R hands-on courses with Danish Diabetes Academy

We'll go through this outline in this order, not expecting to cover them all.

Data Infrastructure Framework (DIF) Project

Setting the stage

Imagine that you are a new professor, just starting getting a group and research programme going... or solo researcher or a small research group starts a study to have data for their PhD students, but have limited funds and technical expertise.

Or, you are a small startup company trying get investment and build income quickly... in the research realm so need to follow best practices/requirements for data management... relies on data collection for business. Needs to get operational quickly, but doesn't yet have funds to hire technical personnel.
Or, you are a large, multi-national/center consortium that wants to keep better track of who's working on what, and how to discover and share data added to the project... or has an aim of widely disseminating their data for maximal, and cost-effective, use by their collaborators and others.

All of these could use the framework to abide by the best practices in FAIR data management.

Aims of the DIF Project

We're still working out a better name, but for now we're calling it DIF

These aims may seem vague, but bare with me.

Aims of the DIF Project

Primary aim: Create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings (starting within the DD2 study)

Check out DIF Project Website for more details.

We're still working out a better name, but for now we're calling it DIF

These aims may seem vague, but bare with me.

Just for some clarification, infrastructure here meaning the computational structure of the data and all its support structures, for instance, how the files and folders are structured, where the data files are saved and what file format, how to connect to data. In many ways like the roads and buildings of a city, where data is the people moving about.

"Framework" on the other hand is the bundle or package that contains the instructions to create an infrastructure, that someone can take and use to create the infrastructure somewhere else. You can think of this as the blueprint for building a city.

Aims of the DIF Project

Primary aim: Create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings (starting within the DD2 study)

Check out DIF Project Website for more details.

Secondary aim: Create this framework so that other research groups and companies, who are unable or can't build something similar, can relatively easily implement it and modify as needed for their own purposes.

We're still working out a better name, but for now we're calling it DIF

These aims may seem vague, but bare with me.

Aims of the DIF Project

Primary aim: Create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings (starting within the DD2 study)

Check out DIF Project Website for more details.

Secondary aim: Create this framework so that other research groups and companies, who are unable or can't build something similar, can relatively easily implement it and modify as needed for their own purposes.

In short: Make a software product that makes it easier to find, store, and use data for research projects that abide by best practices, and make it so that it is easy and free to use for others.

We're still working out a better name, but for now we're calling it DIF

These aims may seem vague, but bare with me.

Again, these might not be really tangible to grasp what this actually means.

Why is this important? 🤔

Large trends across science in computing, data quantity, accountability, transparency

Increasing need in science for...

Computational tools and technologies
Secure and reliable IT infrastructure
Greater openness and transparency
More reproducibility of studies
Highly technical skills and knowledge ... especially in relation to data management.

Questions like:

How do store your data? In what file format?
Where do you store your data and how do you name the files?
How do you keep track of changes to the data?
(For multi-center studies) Who has which datasets and how do you combine them together?
How do you or your collaborators find out what variables there are in the data, what do they mean?
When there are errors or problems in your data, and you've already published with or analyzed on it, how can you easily determine which publications used the in correct data and how can you easily update the publications with the correct data?
How can you easily share your data with colleagues or reviewers to check your findings?

Past and current barriers : Lack of funding, awareness, understanding, skill, and knowledge

Funding agencies don't fully recognize these challenges, so don't provide funding
Researchers aren't aware of or understand the issues, or don't have skills to tackle them
People with needed technical skills leave for industry

Recent new funding 💰: NNF Data Science Research Infrastructure 5 year grant

Which lead to this DIF Project and getting the funding for it 🤩

Development of new ... methods and technologies within data science, ..., data engineering, ...

Guiding principles

Follow and enable FAIR principles
Openly licensed and re-usable (e.g. CC-BY, MIT)
State-of-the-art principles and tools in software and UI design
Built from software that may be more familiar to researchers/academia
Friendly to beginner and non-technical users

FAIR = Findable Accessible Interoperable Reusable

Interested in being involved or learning more? 🤓 Let me know! 🙋

Check out DIF Project Website.

Reproducible Research in R (r-cubed) courses

Reproducibility, a core principle of science, is rarely done

Reproducibility: Same data + same analysis = same results?
Replication: Same design + different data + same analysis = same results?

Non-replication is a known major problem, but extent of non-reproducible results is unknown. Barriers to addressing the problem include:
- Lack of incentives to be reproducible
- Emphasis on novelty and original work

There are few studies on the extent of code and data availability, and whether study results can be reproduced. Figure shows results of some of them: 1) 10.1177/2515245920918872, 2) 10.1007/s11306-017-1299-3, 3) 10.1371/journal.pone.0251194.

Estimating the reproducibility of scientific studies is currently very difficult because of:
- Nearly non-existent publishing of code/data
- General lack of awareness of and training in it

We in research need more skills in data analysis 👩🏽‍💻

... and for more awareness and training on reproducibility and open science 🤯

While I've been teaching these general topics since my Masters, this course specifically I started during my postdoc because one, there was a need for more computational skills in my field and two, because the awareness around reproducibility and open science was very lacking.

Reproducible Research in R (R3 or r-cubed) course/workshop for PhD students and postdocs doing biomedical research

Introduction course: r-cubed.rostools.org
- JOSE paper: 10.21105/jose.00122
Intermediate course: r-cubed-intermediate.rostools.org

The course is teaching reproducible research in R to PhD students and postdocs who do biomedical research, largely diabetes research. Participants are working/full-time researchers (including PhD students), not necessarily in an undergraduate context and related to learning data analysis or more practical type skills.

This course is 3 full days, composing of 5 code along sessions where the instructor types and the learners follow along, a few lectures, and a final group project. For more info on the course, check out the links below.

Key 🔑 features of course

Multiple activities to learning in class (reading, doing, listening, discussing, teaching, group, and solo)
Openly licensed and easily accessible online
Written not just for participants but also (future) instructors
Largely hands-on (code-along), limit lectures and slides

Briefly discuss before showing website.

Try out the material and give us feedback on it! 🤓

And to end, please, if you try out the material, lets us know! We'd love more feedback on it! Thanks for listening!

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Approaches to open, scalable, and reproducible data management and analysis: Training and software

Outline: My current projects

Data Infrastructure Framework (DIF) Project

Aims of the DIF Project

Aims of the DIF Project

Aims of the DIF Project

Aims of the DIF Project

Why is this important? 🤔

Past and current barriers : Lack of funding, awareness, understanding, skill, and knowledge

Recent new funding 💰: NNF Data Science Research Infrastructure 5 year grant

Guiding principles

Interested in being involved or learning more? 🤓 Let me know! 🙋

Reproducible Research in R (r-cubed) courses

Reproducibility, a core principle of science, is rarely done

We don't share as much as we should

We in research need more skills in data analysis 👩🏽‍💻

... and for more awareness and training on reproducibility and open science 🤯

Reproducible Research in R (R3 or r-cubed) course/workshop for PhD students and postdocs doing biomedical research

Key 🔑 features of course

Try out the material and give us feedback on it! 🤓

Outline: My current projects

Help