Making data freely available is a common practice within some scientific disciplines. This practice reinforces transparent reporting, supports open scientific inquiry, and encourages re-evaluation which can lead to novel insights otherwise not described by the original researchers.
Increasingly, journals and funders are developing policies that require or promote open data availability. You can read more about existing funder policies on open data here. Within the Canadian context, the Tri-Agency has adopted a Statement of Principles on Digital Data Management which touches on the “the value of digital research data, the importance of fostering reuse of digital research data, and the need for policies to facilitate excellence in data stewardship”.
It appears likely that open data policies will become commonplace.
At the beginning of the COVID-19 pandemic many organizations came together to commit to open data sharing, among other open science practices. Research funded by CIHR and other stakeholders that pertains to COVID-19 must make date openly available where possible. Read more here.
In 2016, the ‘FAIR Guiding Principles for scientific data management and stewardship’ were published in Scientific Data. The authors intended to provide guidelines to improve the findability, accessibility, interoperability, and reuse of digital assets. The principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.
Findable
The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.
F1. (Meta)data are assigned a globally unique and persistent identifier
F2. Data are described with rich metadata (defined by R1 below)
F3. Metadata clearly and explicitly include the identifier of the data they describe
F4. (Meta)data are registered or indexed in a searchable resource
Accessible
Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorisation.
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
A1.1 The protocol is open, free, and universally implementable
A1.2 The protocol allows for an authentication and authorisation procedure, where necessary
A2. Metadata are accessible, even when the data are no longer available
Interoperable
The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
I2. (Meta)data use vocabularies that follow FAIR principles
I3. (Meta)data include qualified references to other (meta)dataReusable
The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (Meta)data are released with a clear and accessible data usage license
R1.2. (Meta)data are associated with detailed provenance
R1.3. (Meta)data meet domain-relevant community standards
The principles refer to three types of entities: data (or any digital object), metadata (information about that digital object), and infrastructure. For instance, principle F4 defines that both metadata and data are registered or indexed in a searchable resource (the infrastructure component).
*section content obtained from Go Fair website
Sharing data and materials is only valuable if the sharing occurs in a transparent way. Outputs shared need to be clearly labeled and organized. Research teams need to consider how they will share research data at the start of a research project.
Some important considerations include:
1. File names
File names should be easy to understand, allow for version control, and be consistently formatted between documents.
Research teams should agree upon standard file naming conventions. QUT has a great resource for suggested naming conventions.
2. Version control
Research is rarely done in isolation. Data and materials from a study are typically shared and revised by several team members. Version control is essential to track changes in documents over time and to keep track of the most current version.
Research teams should use a numbered version control system. It may also be valuable to incorporate information on the researcher making the change.
For example:
Study name_ version number_date_team member editing
TAP trial_V3_2019_11_22_SJ
3. Organizing data
Research data should be organized into folders that are clearly labelled on particular topic areas. We suggest folders should be structured hierarchically moving from broad to specific topics. Naming and organization practices should be agreed upon by the team so that practices are consistent.
Some researchers find it easiest to have a folder of “current documents” which contains the files they are presently working on. Then, every few weeks, they take some time to archive these files appropriately. Calendar reminders are a helpful way to ensure archiving takes place regularly.
4. Backing up data and materials
Files should be regularly backed-up (e.g. to your institution network or local drive) to ensure their security.
There are many options for data and materials sharing. Some common options include:
1. Sharing of data/materials via a journal/publisher at the time of publication of an article (check to see whether the journal you are submitting to has this option- data and materials are linked directly to the published paper)
2. Share your data or materials via an institutional repository (e.g. at OHRI/uOttawa you can use RUOR)
3. Share your data/materials via externally available tools for general research or via tools for specific research areas/topics (for example see box below)
OpenDOAR is an online searchable directory of open access repositories. Use OpenDOAR to discover suitable repositories for your work.
The Open Science Framework (OSF) OSF can be used to store any type of research data and related research documentation. Researchers can make all digital aspects of their research project (e.g., protocol, images, raw data, manuscript drafts) publicly accessible through this tool. OSF has a version control feature.
Dataverse at Scholars Portal Dataverse is a general data repository that hosts a wide range of data types.
DRYAD is a general data repository that hosts a wide range of data types.
GenBank is a genetic sequence database. Options for submitting data can be viewed here.
Gene Expression Omnibus Use this tool to share gene expression data.
Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank should be used to share structure data.
The Single Nuceleotide Polymorphism Database Use this tool to share SNPs data.
International Molecular Exchange Consortium (IMEx) partners Use this tool to share molecular interaction data.
Peptide Atlas Use this tool to share proteomics data.
Global Proteomics Machine Use this tool to share proteomics data.
The Centre for Open Science maintains the Open Science Framework which can be used as a way to make one’s data openly available. Specifically, the Open Science Framework offers cloud-based management of your research projects. After creating a research project folder, you can subsequently control access to which documents within your project are publicly accessible.
It is recommended that research collected be anonymous or anonymized. In some clinical research this is not feasible; as a consequence data that is confidential, sensitive or contains potential identifying information should be protected. Sharing of patient data, even if deidentified, requires patient consent. In many instances a de-identified data set may be appropriately shared. However, there may be very specific instances where data from a project that is submitted, published, or externally funded, cannot be shared and should be exempted from this guideline. An example of a research project to which this may apply is the study of a rare disease, wherein publication of individual patient data, even in de-identified form, may nonetheless be easily related to particular individuals. Even in these instances, publishing deidentified data may still be possible if patient consent to do so is given. Ultimately, the risks and benefits of data sharing should be carefully weighed. Data sharing must be in line with the approved data management plans described in the research ethics application and subsequently approved by the research ethics board
Referencing data records in repositories When referencing data records in repositories it is important that the data set digital object identifiers (DOIs), and the full web link of where the data are located, the repository name, the author(s) names, and date of repository access are given. Referencing your own data records in the text of a manuscript Standard wording for you to use when referring to your data in your manuscript is found below. Some journals have dedicated spaces where this information should be reported (e.g., supplementary materials sections), be sure to carefully read the instructions for authors where you are submitting your manuscript. “The research data relevant to this work is available at [insert repository name] and can be accessed at [insert direct link to research data].” Referencing other researchers’ data records We recommend that you cite both the original publication where the data are reported and the repository you accessed the data records from when using other researchers’ data. This information should be reported in both the text of your manuscript and the reference section.