Not going to read these in detail, briefly summarize
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Accept that if you do any data management, you will struggle a lot and that's normal
Explain the definition of open and what it ultimately means for research
Describe at least some of the components of open science
Explain what FAIR is and its importance to research
Understand why creating open and FAIR data is hard... and why managing non-open and non-FAIR data is harder
Know about some practical ways to making data FAIRer and open
Slides are more text heavy, to use as reference later.
Not going to read these in detail, briefly summarize
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Aka: open research or open scholarship
I'll be explaining it anyway.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Who could explain open access?
Depending on how you define open, could be accessible to more people. If that's the case, "open" science started when journals came into existence, because before that scientists were very secretive about sharing results.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
"Open": Usually defined by the copyright license assigned to it (e.g. Creative Commons).
Definition from Open Knowledge Foundation and Open Source Initiative.
Who could explain open access?
Depending on how you define open, could be accessible to more people. If that's the case, "open" science started when journals came into existence, because before that scientists were very secretive about sharing results.
But let's go with the more modern definition of open.
License example are these slides (point to footer).
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Focus on what gets excluded, e.g. open source, code, open metadata standards, open infrastructure, etc
Can anyone tell me the difference between Open Code and Open Source?
Has anyone ever heard of Open Materials/Methods or Open Infrastructure?
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
+ Open access (like preprints)
+ Open protocol
+ Open data/data format
+ Open analysis plan/code
+ Open source (like software used)
Even within each of these components, "openness" can be a spectrum.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
+ Open access (like preprints)
+ Open protocol
+ Open data/data format
+ Open analysis plan/code
+ Open source (like software used)
A bit open is better than nothing open.
Even within each of these components, "openness" can be a spectrum.
I'll get into why this is important for data, but not for science in general.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Most basic level: Is publicly accessible with an open license.
Can be open and follow law, not mutually exclusive, just trickier.
Want to note about the previous talk, you can have both openness and still follow the law on privacy and security. Takes a bit more work but its definitely possible in many cases (not all, e.g. DST).
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Is data with a closed license less useful? 🤔 ... it's complicated ... 😕
- Why is it complicated?
- How can "closed" data still be useful?
Think on your own for ~1 min .
Discuss with your neighbour for ~4 minutes.
Share all together for ~1 min.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Example: UK Biobank as an "open access" data resource (really called "gated-access")
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Example: UK Biobank as an "open access" data resource (really called "gated-access")
Part of UK Biobank's License
"UK Biobank hereby grants to the Applicant a revocable, worldwide, royalty-free, non-exclusive, non-transferable licence (but not any ownership rights) during the Term to use the Materials for the Permitted Purpose, subject to the terms and conditions of this Agreement."
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
... but it is even more complicated than that... 😬. Why doesn't making a dataset accessible (with or without a closed license) make it useful or even usable?
For instance, UK Biobank is useful, usable, mostly-accessible, but not open.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
... it needs to be FAIR too!
Findable
Accessible
Interoperable
Reusable
Emphasize that while open data is a thing, the practicalities of it, e.g. the "physical" infrastructure of it, isn't always so open or well defined
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Let's focus on the I and R of FAIR.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Answer these questions:
Let's focus on the I and R of FAIR.
Keyword: easily.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Answer these questions:
If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?
If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?
Let's focus on the I and R of FAIR.
Keyword: easily.
For instance, why did Excel change their file ending from xls to xlsx?
Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Answer these questions:
If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?
If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?
If metadata isn't included... Yes or no?
Let's focus on the I and R of FAIR.
Keyword: easily.
For instance, why did Excel change their file ending from xls to xlsx?
Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Answer these questions:
If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?
If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?
If metadata isn't included... Yes or no?
If the variables in the dataset aren't in a standard location... Yes or no?
Let's focus on the I and R of FAIR.
Keyword: easily.
For instance, why did Excel change their file ending from xls to xlsx?
Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx
For instance, some are on the first line, others are not.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Answer these questions:
If the software to import, export, or load the data is closed (e.g. Stata, SAS)... Yes or no?
If file format is closed or proprietary (e.g. Excel, SAS)... Yes or no?
If metadata isn't included... Yes or no?
If the variables in the dataset aren't in a standard location... Yes or no?
If a numeric value is used to indicate missingness (e.g. 99)... Yes or no?
Let's focus on the I and R of FAIR.
Keyword: easily.
For instance, why did Excel change their file ending from xls to xlsx?
Usually open software to load closed formats needs to be back engineered, since companies don't share how they encoded things. E.g. LibreOffice for docx
For instance, some are on the first line, others are not.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
A lot of considerations need to go into making a dataset usable and open.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
A few decades ago this wasn't really an issue at all, or at least not that big of an issue.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Why are these things important? And why now?
What are some challenges you've faced with data and how it is structured?
Think on your own for ~1 min.
Discuss with your neighbour for ~4 minutes.
Share all together for ~1 min.
Time a minute to think about it before sharing.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Tightly tied to reproducibility
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Tightly tied to reproducibility
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Tightly tied to reproducibility
Increase impact and visibility of your work and more re-use
Collection and volume are getting larger and more complex
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Tightly tied to reproducibility
Increase impact and visibility of your work and more re-use
Collection and volume are getting larger and more complex
Cross-country and institute collaboration continues to increase
Especially those of the older generation, data was different. We have SO much of it now. The skills needed are way more complicated and difficult.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
More effort and more time spent for both users and data owners if...
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
More effort and more time spent for both users and data owners if...
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
More effort and more time spent for both users and data owners if...
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
UK Biobank Data Showcase: "Legal notice: Without a written licence, you may not copy, reproduce, republish, download, distribute, make available to the public or otherwise use any of the content displayed on this website in whole or in part or permit or assist any third party to do the same, except to the extent permitted at law."
Can't make an R package to help organize or display this information.
Need lawyers if someone violates copyright.
Need system to report violations.
Need people to check if it is a violation.
A simple switch from this license to a CC-BY means that you don't have those issues any more.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
One reason: Not valuing software and data engineering in research environments.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Make computers do more work, so we humans can be creative and think.
Initially more of IT or software engineering work
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Source: Google Trends
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Industry tends to be better at it than academia.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Industry > academia
Some companies ( ) > others
Industry tends to be better at it than academia.
Surprising number of cases where someone is hired as a data scientist or data engineer and their entire job is spent fixing issues with the process being in Excel and having to do manual stuff. Stuff that was not described in the job ad.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Industry > academia
Some companies ( ) > others
Some research institutes (e.g. UK Biobank) > others
Industry tends to be better at it than academia.
Surprising number of cases where someone is hired as a data scientist or data engineer and their entire job is spent fixing issues with the process being in Excel and having to do manual stuff. Stuff that was not described in the job ad.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Industry > academia
Some companies ( ) > others
Some research institutes (e.g. UK Biobank) > others
Some fields (physics, ecology) > biomedical/health research
Industry tends to be better at it than academia.
Surprising number of cases where someone is hired as a data scientist or data engineer and their entire job is spent fixing issues with the process being in Excel and having to do manual stuff. Stuff that was not described in the job ad.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
By making use of some of their principles! 🤩
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
By making use of some of their principles! 🤩
Composable
Modular
Single-purpose
Interoperable
Think of Lego!
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
By making use of some of their principles! 🤩
Composable
Modular
Single-purpose
Interoperable
Think of Lego!
All together ➡️ build ecosystems
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
How do you think data and software engineering can help?
Think on your own for ~1 min.
Discuss with your neighbour for ~4 minutes.
Share all together for ~1 min.
So I want to hear what you think might help you out? Whether you know a bit or nothing about engineering, what can you
Might be making theoretical sense, but how? Practically?
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Building APIs to connect with other ecosystems (A, I, R)
Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Building APIs to connect with other ecosystems (A, I, R)
Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)
Using modern and efficient database (open) software like Postgres, DuckDB, Parquet (A, I, R)
Example: I converted the DST SAS file formats into Parquet storage and can load them into DuckDB... for MUCH faster analysis.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Building APIs to connect with other ecosystems (A, I, R)
Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)
Using modern and efficient database (open) software like Postgres, DuckDB, Parquet (A, I, R)
Automatic versioning and publishing of data (F, A, R)
Example: I converted the DST SAS file formats into Parquet storage and can load them into DuckDB... for MUCH faster analysis.
We have versions for software, for papers/manuscripts. Why not data?
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Building APIs to connect with other ecosystems (A, I, R)
Automatic implementing and checking of standards for metadata, file formats (CSV, JSON), and file/folder organization (F, A, I, R)
Using modern and efficient database (open) software like Postgres, DuckDB, Parquet (A, I, R)
Automatic versioning and publishing of data (F, A, R)
Example: I converted the DST SAS file formats into Parquet storage and can load them into DuckDB... for MUCH faster analysis.
We have versions for software, for papers/manuscripts. Why not data?
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Not going to go into any detail here, just a description.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Before continuing, I want to know how many here work with human data?
As we learned in the previous section, human data has special considerations. If you don't work with personal data, you have it easy!
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Advocate for making data open and FAIR (push for funding, hiring, learning)
Save as CSV (or other open file format)
This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.
This is the second biggest thing you can do to make it FAIRer. Any type of data, either raw data or your results as a data file, save as a CSV.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Advocate for making data open and FAIR (push for funding, hiring, learning)
Save as CSV (or other open file format)
Use consistent and descriptive variable names, example:
glucose_0min, glucose_30min, family_history_t2dm, gender, birth_year
This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.
This is the second biggest thing you can do to make it FAIRer. Any type of data, either raw data or your results as a data file, save as a CSV.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Advocate for making data open and FAIR (push for funding, hiring, learning)
Save as CSV (or other open file format)
Use consistent and descriptive variable names, example:
glucose_0min, glucose_30min, family_history_t2dm, gender, birth_year
Include a metadata file (example name metadata.csv
) with the data, example:
variable, descriptionglucose_0min, Fasting glucose measured before the glucose challenge. glucose_30min, Glucose measured 30 minutes after glucose challenge.weight, Weight measured on a standard scale, in underclothes.height, Height measured without shoes.
This is the biggest thing you can do to make it FAIRer. This is especially important if you personally don't own or have control over what happens to the data.
This is the second biggest thing you can do to make it FAIRer. Any type of data, either raw data or your results as a data file, save as a CSV.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Licenses (if you have ownership), as
easy as having license text in a file (LICENSE.md
) with the data:
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Licenses (if you have ownership), as
easy as having license text in a file (LICENSE.md
) with the data:
Sharing can also be easy! (depends on size of data though)
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
How might you be able to make your data more open and FAIR? What are some challenges you foresee?
Think on your own for ~1 min.
Discuss with your neighbour for ~4 minutes.
Share all together for ~1 min.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
If you're struggling with data... that is totally normally!!
Open = license used (CC-BY). Easier collaboration, re-use, and management.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
If you're struggling with data... that is totally normally!!
Open = license used (CC-BY). Easier collaboration, re-use, and management.
Open science has many components, like open data and open source.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
If you're struggling with data... that is totally normally!!
Open = license used (CC-BY). Easier collaboration, re-use, and management.
Open science has many components, like open data and open source.
FAIR: Findable, Accessible, Interoperable, Re-usable. Includes file format, software used, metadata. Makes your own work easier.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
If you're struggling with data... that is totally normally!!
Open = license used (CC-BY). Easier collaboration, re-use, and management.
Open science has many components, like open data and open source.
FAIR: Findable, Accessible, Interoperable, Re-usable. Includes file format, software used, metadata. Makes your own work easier.
Creating open, FAIR data needs skill and knowledge. Saves times during management and analysis.
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
If you're struggling with data... that is totally normally!!
Open = license used (CC-BY). Easier collaboration, re-use, and management.
Open science has many components, like open data and open source.
FAIR: Findable, Accessible, Interoperable, Re-usable. Includes file format, software used, metadata. Makes your own work easier.
Creating open, FAIR data needs skill and knowledge. Saves times during management and analysis.
Advocacy is first step to making data FAIR and open. Second step, use CSV/open formats
Slides: slides.lwjohnst.com/steno/2022-10-25 Licensed under CC-BY
Accept that if you do any data management, you will struggle a lot and that's normal
Explain the definition of open and what it ultimately means for research
Describe at least some of the components of open science
Explain what FAIR is and its importance to research
Understand why creating open and FAIR data is hard... and why managing non-open and non-FAIR data is harder
Know about some practical ways to making data FAIRer and open
Slides are more text heavy, to use as reference later.
Not going to read these in detail, briefly summarize
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |