+ - 0:00:00
Notes for current slide
Notes for next slide

Not going to read these in detail, briefly summarize

Open science and FAIR principles: Under-valued and under-appreciated

Luke Johnston

Learning objectives

  1. Accept that if you do any data management, you will struggle a lot and that's normal

  2. Explain the definition of open and what it ultimately means for research

  3. Describe at least some of the components of open science

  4. Explain what FAIR is and its importance to research

  5. Understand why creating open and FAIR data is hard... and why managing non-open and non-FAIR data is harder

  6. Know about some practical ways to making data FAIRer and open

Slides are more text heavy, to use as reference later.

Not going to read these in detail, briefly summarize

What is open science?

Show of hands, who could explain it?

Aka: open research or open scholarship

I'll be explaining it anyway.

Brief history on open science: Initially about open access to papers

Who could explain open access?

Depending on how you define open, could be accessible to more people. If that's the case, "open" science started when journals came into existence, because before that scientists were very secretive about sharing results.

Brief history on open science: Initially about open access to papers

"Open": Usually defined by the copyright license assigned to it (e.g. Creative Commons).

Definition from Open Knowledge Foundation and Open Source Initiative.

Who could explain open access?

Depending on how you define open, could be accessible to more people. If that's the case, "open" science started when journals came into existence, because before that scientists were very secretive about sharing results.

But let's go with the more modern definition of open.

License example are these slides (point to footer).

But there are many components to open science... though most are neglected

neato science Open Science materials Open Materials/Methods science--materials infra Open Infrastructure science--infra access Open Access science--access data Open Data science--data repro Open Code (Reproducibility) science--repro source Open Source science--source

Focus on what gets excluded, e.g. open source, code, open metadata standards, open infrastructure, etc

Can anyone tell me the difference between Open Code and Open Source?

Has anyone ever heard of Open Materials/Methods or Open Infrastructure?

  • Reagents or lab protocols.

Open science isn't completely binary: It's a spectrum

+ Open access (like preprints)
+ Open protocol
+ Open data/data format

+ Open analysis plan/code
+ Open source (like software used)

Even within each of these components, "openness" can be a spectrum.

Open science isn't completely binary: It's a spectrum

+ Open access (like preprints)
+ Open protocol
+ Open data/data format

+ Open analysis plan/code
+ Open source (like software used)

A bit open is better than nothing open.

Even within each of these components, "openness" can be a spectrum.

I'll get into why this is important for data, but not for science in general.

What is open data?

Most basic level: Is publicly accessible with an open license.

Can be open and follow law, not mutually exclusive, just trickier.

Want to note about the previous talk, you can have both openness and still follow the law on privacy and security. Takes a bit more work but its definitely possible in many cases (not all, e.g. DST).

Discussion time (~6 min) 💬

Is data with a closed license less useful? 🤔 ... it's complicated ... 😕

  1. Why is it complicated?
  2. How can "closed" data still be useful?
  1. Think on your own for ~1 min .

  2. Discuss with your neighbour for ~4 minutes.

  3. Share all together for ~1 min.

Data can be closed, but still "more easily accessible" and be useful to science

Example: UK Biobank as an "open access" data resource (really called "gated-access")

Data can be closed, but still "more easily accessible" and be useful to science

Example: UK Biobank as an "open access" data resource (really called "gated-access")

Part of UK Biobank's License

"UK Biobank hereby grants to the Applicant a revocable, worldwide, royalty-free, non-exclusive, non-transferable licence (but not any ownership rights) during the Term to use the Materials for the Permitted Purpose, subject to the terms and conditions of this Agreement."

Discussion time with everyone (~6 min) 💬

... but it is even more complicated than that... 😬. Why doesn't making a dataset accessible (with or without a closed license) make it useful or even usable?

For instance, UK Biobank is useful, usable, mostly-accessible, but not open.

More to openness than an open license or "open access"

... it needs to be FAIR too!

  1. Findable

  2. Accessible

  3. Interoperable

  4. Reusable

Emphasize that while open data is a thing, the practicalities of it, e.g. the "physical" infrastructure of it, isn't always so open or well defined

Interoperable and Reusable: Can it be easily integrated with other tools or re-used?

Let's focus on the I and R of FAIR.

Interoperable and Reusable: Can it be easily integrated with other tools or re-used?

Answer these questions:

  • If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?

Let's focus on the I and R of FAIR.

Keyword: easily.

Interoperable and Reusable: Can it be easily integrated with other tools or re-used?

Answer these questions:

  • If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?

  • If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?

Let's focus on the I and R of FAIR.

Keyword: easily.

For instance, why did Excel change their file ending from xls to xlsx?

Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx

Interoperable and Reusable: Can it be easily integrated with other tools or re-used?

Answer these questions:

  • If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?

  • If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?

  • If metadata isn't included... Yes or no?

Let's focus on the I and R of FAIR.

Keyword: easily.

For instance, why did Excel change their file ending from xls to xlsx?

Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx

Interoperable and Reusable: Can it be easily integrated with other tools or re-used?

Answer these questions:

  • If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?

  • If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?

  • If metadata isn't included... Yes or no?

  • If the variables in the dataset aren't in a standard location... Yes or no?

Let's focus on the I and R of FAIR.

Keyword: easily.

For instance, why did Excel change their file ending from xls to xlsx?

Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx

For instance, some are on the first line, others are not.

Interoperable and Reusable: Can it be easily integrated with other tools or re-used?

Answer these questions:

  • If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?

  • If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?

  • If metadata isn't included... Yes or no?

  • If the variables in the dataset aren't in a standard location... Yes or no?

  • If a numeric value is used to indicate missingness (e.g. 99)... Yes or no?

Let's focus on the I and R of FAIR.

Keyword: easily.

For instance, why did Excel change their file ending from xls to xlsx?

Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx

For instance, some are on the first line, others are not.

... making things FAIR is hard! 🫣

  • Even harder when researchers don't understand nor value skill needed for it.
  • There's whole, highly-technical careers dedicated to data/software.

A lot of considerations need to go into making a dataset usable and open.

We didn't really need to do worry about this before, why now?

A few decades ago this wasn't really an issue at all, or at least not that big of an issue.

Discussion time (~6 min) 💬

  • Why are these things important? And why now?

  • What are some challenges you've faced with data and how it is structured?

  1. Think on your own for ~1 min.

  2. Discuss with your neighbour for ~4 minutes.

  3. Share all together for ~1 min.

Time a minute to think about it before sharing.

Many reasons why it is important 🧐

  • Tightly tied to reproducibility

    • How to verify claim if data can't easily be shared?

Many reasons why it is important 🧐

  • Tightly tied to reproducibility

    • How to verify claim if data can't easily be shared?
  • Increase impact and visibility of your work and more re-use

Many reasons why it is important 🧐

  • Tightly tied to reproducibility

    • How to verify claim if data can't easily be shared?
  • Increase impact and visibility of your work and more re-use

  • Collection and volume are getting larger and more complex

    • Very cost-efficient to fund larger studies that share widely

Many reasons why it is important 🧐

  • Tightly tied to reproducibility

    • How to verify claim if data can't easily be shared?
  • Increase impact and visibility of your work and more re-use

  • Collection and volume are getting larger and more complex

    • Very cost-efficient to fund larger studies that share widely
  • Cross-country and institute collaboration continues to increase

    • Need easier ways to collaborate effectively

Especially those of the older generation, data was different. We have SO much of it now. The skills needed are way more complicated and difficult.

More time spent considering these things = less time spent later on

More effort and more time spent for both users and data owners if...

More time spent considering these things = less time spent later on

More effort and more time spent for both users and data owners if...

  • More restrictions (example: closed copyright, gated-access)
    • Need lawyers for license
    • Need enforcement systems
    • Need gated access control systems
    • Need personnel to manage systems and deal with issues

More time spent considering these things = less time spent later on

More effort and more time spent for both users and data owners if...

  • More restrictions (example: closed copyright, gated-access)
    • Need lawyers for license
    • Need enforcement systems
    • Need gated access control systems
    • Need personnel to manage systems and deal with issues
  • Not well documented and not FAIR
    • Need personnel (likely grad students/postdocs) to deal with basic issues
    • Need in-person training/onboarding to work with data
    • Need software license to access/use data

UK Biobank Data Showcase: "Legal notice: Without a written licence, you may not copy, reproduce, republish, download, distribute, make available to the public or otherwise use any of the content displayed on this website in whole or in part or permit or assist any third party to do the same, except to the extent permitted at law."

Can't make an R package to help organize or display this information.

Need lawyers if someone violates copyright.

Need system to report violations.

Need people to check if it is a violation.

A simple switch from this license to a CC-BY means that you don't have those issues any more.

So many benefits... Why isn't this normal then?

One reason: Not valuing software and data engineering in research environments.

  • Mention data engineering here.

Um, how is data and software engineering relevant?

Um, how is data and software engineering relevant?

Make computers do more work, so we humans can be creative and think.

Initially more of IT or software engineering work

Source: Google Trends

Data engineering is still is a very new area

Data engineering is still is a very new area

  • Industry > academia

Industry tends to be better at it than academia.

Data engineering is still is a very new area

  • Industry > academia

  • Some companies ( ) > others

    • Many have manual/non-standardized processes, with data in e.g. Excel

Industry tends to be better at it than academia.

Surprising number of cases where someone is hired as a data scientist or data engineer and their entire job is spent fixing issues with the process being in Excel and having to do manual stuff. Stuff that was not described in the job ad.

Data engineering is still is a very new area

  • Industry > academia

  • Some companies ( ) > others

    • Many have manual/non-standardized processes, with data in e.g. Excel
  • Some research institutes (e.g. UK Biobank) > others

    • Overall, research is far behind best practices

Industry tends to be better at it than academia.

Surprising number of cases where someone is hired as a data scientist or data engineer and their entire job is spent fixing issues with the process being in Excel and having to do manual stuff. Stuff that was not described in the job ad.

Data engineering is still is a very new area

  • Industry > academia

  • Some companies ( ) > others

    • Many have manual/non-standardized processes, with data in e.g. Excel
  • Some research institutes (e.g. UK Biobank) > others

    • Overall, research is far behind best practices
  • Some fields (physics, ecology) > biomedical/health research

    • Likely because of funding, privacy, and IP issues

Industry tends to be better at it than academia.

Surprising number of cases where someone is hired as a data scientist or data engineer and their entire job is spent fixing issues with the process being in Excel and having to do manual stuff. Stuff that was not described in the job ad.

How does engineering relate to open and FAIR?

By making use of some of their principles! 🤩

How does engineering relate to open and FAIR?

By making use of some of their principles! 🤩

  • Composable

  • Modular

  • Single-purpose

  • Interoperable

    • (across ALL components of science)

Think of Lego!

How does engineering relate to open and FAIR?

By making use of some of their principles! 🤩

  • Composable

  • Modular

  • Single-purpose

  • Interoperable

    • (across ALL components of science)

Think of Lego!

All together ➡️ build ecosystems

Discussion time (~6 min) 💬

How do you think data and software engineering can help?

  1. Think on your own for ~1 min.

  2. Discuss with your neighbour for ~4 minutes.

  3. Share all together for ~1 min.

So I want to hear what you think might help you out? Whether you know a bit or nothing about engineering, what can you

Might be making theoretical sense, but how? Practically?

Data engineering can make data FAIRer 🤩. How exactly?

Data engineering can make data FAIRer 🤩. How exactly?

  • Building APIs to connect with other ecosystems (A, I, R)

Data engineering can make data FAIRer 🤩. How exactly?

  • Building APIs to connect with other ecosystems (A, I, R)

  • Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)

Data engineering can make data FAIRer 🤩. How exactly?

  • Building APIs to connect with other ecosystems (A, I, R)

  • Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)

  • Using modern and efficient database (open) software like Postgres, DuckDB, Parquet (A, I, R)

Example: I converted the DST SAS file formats into Parquet storage and can load them into DuckDB... for MUCH faster analysis.

Data engineering can make data FAIRer 🤩. How exactly?

  • Building APIs to connect with other ecosystems (A, I, R)

  • Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)

  • Using modern and efficient database (open) software like Postgres, DuckDB, Parquet (A, I, R)

  • Automatic versioning and publishing of data (F, A, R)

Example: I converted the DST SAS file formats into Parquet storage and can load them into DuckDB... for MUCH faster analysis.

We have versions for software, for papers/manuscripts. Why not data?

Data engineering can make data FAIRer 🤩. How exactly?

  • Building APIs to connect with other ecosystems (A, I, R)

  • Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)

  • Using modern and efficient database (open) software like Postgres, DuckDB, Parquet (A, I, R)

  • Automatic versioning and publishing of data (F, A, R)

  • Using Git and GitHub to collaboratively build (and share) the infrastructure to manage data (F, I, R)

Example: I converted the DST SAS file formats into Parquet storage and can load them into DuckDB... for MUCH faster analysis.

We have versions for software, for papers/manuscripts. Why not data?

Example: We have a (recent NNF-funded) project on an open infrastructure for data:

Data Infrastucture Framework (DIF) Project (title WIP)

Not going to go into any detail here, just a description.

How can I make my data more open and FAIR?

Question: How many work with human data?

Before continuing, I want to know how many here work with human data?

As we learned in the previous section, human data has special considerations. If you don't work with personal data, you have it easy!

Simple first steps to making data FAIRer:

  1. Advocate for making data open and FAIR (push for funding, hiring, learning)

This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.

Simple first steps to making data FAIRer:

  1. Advocate for making data open and FAIR (push for funding, hiring, learning)

  2. Save as CSV (or other open file format)

This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.

This is the second biggest thing you can do to make it FAIRer. Any type of data, either raw data or your results as a data file, save as a CSV.

Simple first steps to making data FAIRer:

  1. Advocate for making data open and FAIR (push for funding, hiring, learning)

  2. Save as CSV (or other open file format)

  3. Use consistent and descriptive variable names, example:

    glucose_0min, glucose_30min, family_history_t2dm, gender, birth_year

This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.

This is the second biggest thing you can do to make it FAIRer. Any type of data, either raw data or your results as a data file, save as a CSV.

Simple first steps to making data FAIRer:

  1. Advocate for making data open and FAIR (push for funding, hiring, learning)

  2. Save as CSV (or other open file format)

  3. Use consistent and descriptive variable names, example:

    glucose_0min, glucose_30min, family_history_t2dm, gender, birth_year
  1. Include a metadata file (example name metadata.csv) with the data, example:

    variable, description
    glucose_0min, Fasting glucose measured before the glucose challenge.
    glucose_30min, Glucose measured 30 minutes after glucose challenge.
    weight, Weight measured on a standard scale, in underclothes.
    height, Height measured without shoes.

This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.

This is the second biggest thing you can do to make it FAIRer. Any type of data, either raw data or your results as a data file, save as a CSV.

Practical ways of licensing and sharing data

Licenses (if you have ownership), as easy as having license text in a file (LICENSE.md) with the data:

Practical ways of licensing and sharing data

Licenses (if you have ownership), as easy as having license text in a file (LICENSE.md) with the data:

Sharing can also be easy! (depends on size of data though)

Discussion time (~6 min) 💬

How might you be able to make your data more open and FAIR? What are some challenges you foresee?

  1. Think on your own for ~1 min.

  2. Discuss with your neighbour for ~4 minutes.

  3. Share all together for ~1 min.

Key messages linked to learning objectives

Key messages linked to learning objectives

  1. If you're struggling with data... that is totally normally!!

Key messages linked to learning objectives

  1. If you're struggling with data... that is totally normally!!

  2. Open = license used (CC-BY). Easier collaboration, re-use, and management.

Key messages linked to learning objectives

  1. If you're struggling with data... that is totally normally!!

  2. Open = license used (CC-BY). Easier collaboration, re-use, and management.

  3. Open science has many components, like open data and open source.

Key messages linked to learning objectives

  1. If you're struggling with data... that is totally normally!!

  2. Open = license used (CC-BY). Easier collaboration, re-use, and management.

  3. Open science has many components, like open data and open source.

  4. FAIR: Findable, Accessible, Interoperable, Re-usable. Includes file format, software used, metadata. Makes your own work easier.

Key messages linked to learning objectives

  1. If you're struggling with data... that is totally normally!!

  2. Open = license used (CC-BY). Easier collaboration, re-use, and management.

  3. Open science has many components, like open data and open source.

  4. FAIR: Findable, Accessible, Interoperable, Re-usable. Includes file format, software used, metadata. Makes your own work easier.

  5. Creating open, FAIR data needs skill and knowledge. Saves times during management and analysis.

Key messages linked to learning objectives

  1. If you're struggling with data... that is totally normally!!

  2. Open = license used (CC-BY). Easier collaboration, re-use, and management.

  3. Open science has many components, like open data and open source.

  4. FAIR: Findable, Accessible, Interoperable, Re-usable. Includes file format, software used, metadata. Makes your own work easier.

  5. Creating open, FAIR data needs skill and knowledge. Saves times during management and analysis.

  6. Advocacy is first step to making data FAIR and open. Second step, use CSV/open formats

Learning objectives

  1. Accept that if you do any data management, you will struggle a lot and that's normal

  2. Explain the definition of open and what it ultimately means for research

  3. Describe at least some of the components of open science

  4. Explain what FAIR is and its importance to research

  5. Understand why creating open and FAIR data is hard... and why managing non-open and non-FAIR data is harder

  6. Know about some practical ways to making data FAIRer and open

Slides are more text heavy, to use as reference later.

Not going to read these in detail, briefly summarize

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow