class: center, middle, inverse, title-slide # A framework for an open and scalable infrastructure for health data exemplified by the DD2 initiative --- layout: true <style type="text/css"> .footer-right { background-color: #FFFFFF; position: absolute; bottom: 10px; right: 8px; height: 60px; width: 30%; font-size: 11pt; } </style> .footer-right[ Website: [steno-aarhus.github.io/dif-project](https://steno-aarhus.github.io/dif-project) Slides: [slides.lwjohnst.com/steno/2022-04-26](https://slides.lwjohnst.com/steno/2022-04-26/) ] ---
??? ## Setting the stage Imagine that you are a new professor, just starting getting a group and research programme going... or solo researcher or a small research group starts a study to have data for their PhD students, but have limited funds and technical expertise. - Or, you are a small startup company trying get investment and build income quickly... in the research realm so need to follow best practices/requirements for data management... relies on data collection for business. Needs to get operational quickly, but doesn't yet have funds to hire technical personnel. - Or, you are a large, multi-national/center consortium that wants to keep better track of who's working on what, and how to discover and share data added to the project... or has an aim of widely disseminating their data for maximal, and cost-effective, use by their collaborators and others. All of these could use the framework to abide by the best practices in FAIR data management. --- ## Data Infrastructure Framework (DIF) Project ??? We're still working out a better name, but for now we're calling it DIF These aims may seem vague, but bare with me. -- 1. **Primary aim**: Create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings (starting within the DD2 study) -- 2. **Secondary aim**: Create this framework so that *other research groups and companies*, who are unable or can't build something similar, can relatively easily implement it and modify as needed for their own purposes. -- > In short: Make a software product that makes it easier to find, store, and use data for research projects that abide by best practices, and make it so that it is easy and free to use for others. ??? Again, these might not be really tangible to grasp what this actually means. --- class: middle ## Why is this important? 🤔 **Large trends across science in computing, data quantity, accountability, transparency** ??? Increasing need in science for... - Computational tools and technologies - Secure and reliable IT infrastructure - Greater openness and transparency - More reproducibility of studies - Highly technical skills and knowledge *... especially in relation to data management.* Questions like: - How do store your data? In what file format? - Where do you store your data and how do you name the files? - How do you keep track of changes to the data? - (For multi-center studies) Who has which datasets and how do you combine them together? - How do you or your collaborators find out what variables there are in the data, what do they mean? - When there are errors or problems in your data, and you've already published with or analyzed on it, how can you easily determine which publications used the in correct data and how can you easily update the publications with the correct data? - How can you easily share your data with colleagues or reviewers to check your findings? --- class: middle ## Past and current barriers
: Lack of funding, awareness, understanding, skill, and knowledge ??? - Funding agencies don't fully recognize these challenges, so don't provide funding - Researchers aren't aware of or understand the issues, or don't have skills to tackle them - People with needed technical skills leave for industry --- class: middle ## Recent new funding 💰: NNF Data Science Research Infrastructure 5 year grant .footnote[Which lead to this DIF Project and getting the funding for it 🤩] ??? Development of new ... methods and technologies within data science, ..., data engineering, ... --- ##
Guiding principles 1. Follow and enable FAIR principles 2. Openly licensed and re-usable (e.g. CC-BY, MIT) 3. State-of-the-art principles and tools in software and UI design 4. Friendly to beginner and non-technical users ??? --- <img src="images/detailed-schematic.png" width="58%" style="display: block; margin: auto;" /> --- <img src="images/layers.png" width="58%" style="display: block; margin: auto;" /> --- <img src="images/user-1.png" width="58%" style="display: block; margin: auto;" /> --- <img src="images/user-2.png" width="58%" style="display: block; margin: auto;" /> --- <img src="images/user-3.png" width="58%" style="display: block; margin: auto;" /> --- <img src="images/user-4.png" width="58%" style="display: block; margin: auto;" /> --- class: middle ## What similar infrastructures exist? Found in most large companies, some research based ones (UK Biobank)... ... but few have the product be the infrastructure itself ??? Show it off? One plan is to do as much of a search as possible for similar projects. Unlike scientific papers, it's not as easy to find software projects. We know of two similar projects, one in Oslo related to a brain mapping project and another in the US called gen3 that's managed by the University of Chicago. Depending on how they fit our needs and aims, we might "fork" their projects and contribute back to them. (Explain "forking"). --- ## Short-term plan
> [Full 5 year timeline found on website.](https://steno-aarhus.github.io/dif-project/#deliverables-and-milestones) - Hire software/data engineers and build team as soon as possible - Developing "Minimum Viable Product" of first component within ~2 years - Emphasize making training and documentation targeted to non-technical users throughout project --- class: middle ## Interested in being involved or learning more? 🤓 Let us know! 🙋