The forgotten child in research: Data engineering and infrastructure

Luke W. Johnston

September 13, 2023

Some questions before starting 🤔

🙋 How many of you have worked with or tried to discover data for projects?

Who am I? 👋

  • Team Leader at Steno Diabetes Center Aarhus and Aarhus University, Denmark
  • Research/work:
    • Teach how to do open and reproducible science
    • Build software to automate research
    • Do epidemiological research

Two main goals of this (informal) talk 🔈

Spreading awareness…

… on the vital importance of the foundation of our data-driven world

What is that foundation? The data itself and the engineering of and around it.

… that innovation and commercialization can come from anywhere

Not to limit thinking to only traditional areas like results from research.

The stages and lifecycle of research are like a big family

Unmet and (often) unaware basic needs in health research

Personal past experiences: Data management tasks often given to unskilled MSc/PhD students

In research organizations: Focused on beginning and end of lifecycle (collecting data and publishing), not the middle

In small- to medium-sized companies: Don’t have maturity and/or funds to have internal data engineering team

NovoNordiskFonden: Increase impact of funding by expanding use of data from funded projects

Many, substantial negative effects of this unmet need

Examples often distill down to wasted time and money

  • Retractions because of data processing error (e.g. with Excel)

  • Wasted time looking for data or resolving issues with data

  • Time spent learning niche skill to understand or structure data

  • Unusable data because of lack of documentation (e.g. units of measure)

Limited options and solutions for data infrastructure within research world

… they are often custom-built

… they are often designed for industry, expensive, or “over-engineered”

… they are often heavy on the tech jargon

Our solution: A framework for building a modern data infrastructure

Seedcase: Improving discoverability, structure, and management of research data

Designed for typical use cases of doing research

From Seedcase Project Design Docs

Central philosophies and value

  1. Follow FAIR (Findable, Accessible, Interoperable, and Reusable), open, and transparent principles

  2. Openly licensed and re-usable

  3. Use state-of-the-art principles and tools

  4. Friendly to beginners and non-tech people

Who are we: The team

Sia Kromann Nikolaisen
DD2 Data Manager

Kristiane Beicher
Database Administrator

Richard Ding
Research Software Engineer

Signe Kirk Brødbæk
Research Software Engineer

Future steps: Ensuring financial sustainability

Creating a company around Seedcase

Offer research software development and data engineering services, keeping product free and open source

  • Consulting
  • Training and education
  • Adding features
  • Sponsorship from industry
  • Embedding it early in newly funded projects
  • Cloud-based products?