Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

*Note: This page is under development

**Note: All links imbedded in the text are also available with full URLs at the end of this document

What is Data Citation and Why Is It Important?

The current trend towards improved transparency and reproducibility in science is pushing researchers and institutions to develop new strategies for managing the data they produce. Publishers increasingly request access to the datasets underpinning submissions, and national funding agencies are establishing policies requiring the open sharing of data as a condition of receiving grants. These changes are driving the creation of new tools to ensure data is findable, accessible, and reusable and remains so into the future. 

Much like citations for published works, data citations based on persistent identifiers (PIDs) provide a way to find the exact source material used in a given piece of research. For digital objects, PIDs are needed to provide a single, consistent “location” where that object can be found on the web. One of the most commonly used PIDs is the Digital Object Identifier, or DOI, a permanent link which will always resolve to a particular resource (or a comprehensive record for that resource).

Dynamic Data Citation

Persistent identifiers are straightforward to create for finished objects such as a published paper or completed dataset. Dynamic data that changes over time, like ONC’s continuous data streams, are more difficult to affix identifiers to, as the content is constantly evolving. To reliably and reproducibly cite dynamic data requires more detailed information such as the exact date and time the data was retrieved, as well as any search parameters used in selecting a particular subset. However, it’s not feasible to mint a new DOI for every single change to an evolving dataset, and preserving complete previous versions of all data is beyond any institution’s storage capacity.

In February 2015, the Research Data Alliance Working Group on Dynamic Data Citation released a set of 14 recommendations to guide best practices for persistently and reproducibly identifying these kinds of dataset. The recommendations rest on 3 pillars: 

  • Versioning - Major changes to a dataset are marked with a new version number.
  • Timestamping - Queries made to the database are saved along with metadata about exactly when they were made.
  • Query Preservation - PIDs are assigned not only to the whole dataset, but also to each time-stamped query used to extract a particular data subset from the repository’s database.

Combining these strategies, we can narrow down the parameters of a dataset until it exactly matches the state it was in when it was previously retrieved. New version releases mark significant changes to the dataset, whether to the data itself or the ways in which it was processed. Timestamps for the date and time the dataset was accessed further refine recall within the context of the frequent, smaller changes that don’t necessitate a new version. Finally, assigning a persistent identifier to each individual query - an actual data request sent to the database - allows previously accessed subsets to be recreated with ease, eliminating the need to painstakingly replicate complex search parameters by hand.

With the MINTED Project, Ocean Networks Canada is proud to be leading the way in implementing the RDA Recommendations. In partnership with CANARIE, the DataCite Canada Consortium, and our own crack team of software engineers, ONC’s work will help establish best practices for good data stewardship now and into the future.

MINTED Project Overview

The MINTED project (Making Identifiers Necessary to Track Evolving Data) is funded by the CANARIE Network, and supports Ocean Network Canada’s need to implement PIDs to datasets to be renewed for the CoreTrustSeal, a certification of repositories using best practices. 

This project, MINTED, aims to apply dataset citation DOIs and RORs (Research Organization Registry identifiers) into ONC’s Oceans 2.0 digital infrastructure. At ONC, the data are very dynamic due to continually accumulating data streams, data reprocessing and data product code versioning. While there has been a growing recognition of the benefits and need for data citations, made evident by the reception of the FAIR Principles, existing platforms and tools such as DataVerse and the Federated Research Data Repository (FRDR) are currently only able to serve the needs of static or non-frequently updated datasets. There is now an opportunity to apply recommendations and lessons learned from the Research Data Alliance (RDA) Working Group (WG) on Dynamic Data Citations. Their 14 recommendations were detailed in a 2015 publication. 

Benefits of MINTED

MINTED allows data users to use and cite data in the repository from the beginning of a deployment, providing traceability of the dataset life cycle, so that users can better interpret data integrity, respect the terms and conditions under which the data were accessed, and increase the credibility of users. Data citation also allows users to link datasets to publication DOIs, contributor ORCIDs, funder reference IDs, ARK IDs, and more. 

It makes strides in enabling reproducibility, provides curated datasets to end users, provides credit for publishing and providing data, helps repositories and users to track metrics of data usage and the impact of the data, and enhances metadata catalogues and products for citation content. Additionally, data citation stimulates related research as published datasets that have been cited, lending themselves more readily to further analysis. 

Lastly, the MINTED project also allows ONC to renew our certification with World Data System (WDS) CoreTrustSeal (CTS), identifying ONC as a repository that has implemented and supports FAIR (Findable, Accessible, Interoperable, and Reusable) data principles and best practices, encouraging confidence in content within. 

Landing Page

The landing page describes the high level information (‘metadata’) associated with a dataset. ONC has defined a dataset as one device from one deployment, e.g. Aanderaa Optode 3830 (S/N 911), device ID 20116, deployed at Folger Passage on 11-Sep-2015, recovered 02-May-2017. To see this page, users can include the DOI from any dataset.

Example landing page for a DOI:

Landing Page Metadata Sections:

  • The Title is composed of several pieces of information describing the dataset: Deployment Area, Deployment Location, Device Category, Date of Deployment. 
  • The DOI is listed by ONC, as pertains to the entire dataset, beginning with 10.34943/ from DataCite/specific identifier for the dataset. The DOI is included in the provided in data citation.
  • The Abstract provides the user with a high level description of the device deployment from which the data was collected, to help them identify if this dataset is appropriate for their purposes. The abstract consists of the device name, the date and site it was deployed at, a description of the site, the device category, a description of the device category, what type of platform it was deployed on (fixed, mobile, profiler), and a note about where the data was archived. There is also a link provided that takes the user directly to Data Search and the instrument data they have identified.
  • The Creators are listed in this section. ONC has data partners who should be attributed appropriately within the landing page and citation. The Creator(s) is/are linked with their ROR, and resolves to the ROR entry on their portal
  • The Date Created sections include the date the data became publicly available.
  • The Funding References section includes any/all funding organizations that contributed to the data collection and archival.
  • The Publisher section specifies the organization that made the data available to the public. The Publisher(s) is/are linked with their ROR, and resolves to the ROR entry on their portal
  • The Publication Year is the year that the dataset became publicly available. 
  • The Resource Type will always be ‘One Deployment’, specifying that the dataset only covers the length of a single deployment of the instrument.
  • The Rights section points towards ONC’s Data Usage Policy, helping users understand the constraints and licensing of the associated dataset. 
  • The Formats section provides a list of the types of formats the data is available in. 
  • The Geolocations provide the user with the longitude and latitude of the area the data was collected. Note: for mobile platforms, multiple points are provided to create a polygon or a bounding box. 
  • The Contributors section lists the organization(s) and role involved with data collection and archival. The Contributors are linked with their ROR, and resolves to the ROR entry on their portal
  • The Citation section provides the actual citation that ONC and our data partners would like users to use in outputs utilizing the data associated with the particular DOI. More and more publishers are requiring datasets to be referenced with DOIs as support for the COPDESS statement grows. Users can simply copy and paste the citation directly into their resources, bibliography, citation list, etc. The citation text was created using the ESIP Data Citation Guidelines for Earth Sciences following their best practices. 
  • The Data Links section provides the user with a direct link to:
    • The Download data using Data Search: the instrument and location in the Data Search page on Oceans 2.0
    • The View device details for ____: additional metadata and details in Device Listing page on Oceans 2.0, about the instrument associated with the DOI
    • The Download latest ISO 19115 XML metadata: computer readable metadata making up the information presented on the Landing Page
  • The Version History section provides details about the general provenance of the dataset after it was initially archived with a DOI. This can include reprocessing, DOI minting, and more, including the date that activities correspond with.

Query PID

The landing page for a Query PID has all the same dataset metadata as a DOI landing page, plus some additional information. Query Details describe the specific filters and parameters used in a particular query/search. To see this page, users can enter any Query PID into the resolver search field. 

Example landing page for a Query:

→ Query-PID Specific Metadata Sections (yellow box):

  • The Data Product section lists the specific type of output(s)/product(s) selected for download in the query parameters. 
  • The Query Date Created section gives the date the query was first run
  • The Query Date From/To section defines the time period within which data was selected from the complete dataset
  • The Properties section includes the variables the user selected from the list of available recorded measurements. 
  • The Format section lists the file format(s) the user selected to download
  • The Citation section provides the user with the preferred Query Citation to use, including the Query ID and the Access date.

Web Services

In addition to the DOI/Query PID Search tool, Ocean Networks Canada offers several web services as an alternate method of interacting with data and metadata through our Oceans 2.0 API

If you are unfamiliar with APIs, or Application Programming Interfaces, there are many easy-to-understand introductions available on the web. Try Codenewbie, What Is an API? or APIs 101. APIs are especially useful for automating computer-to-computer processes, such as batch citation downloads or scheduled data retrievals.

The MINTED project has added two new Ocean Networks Canada web service offerings: a citation text service, and a metadata retrieval service.

Data Citation Text Service

The API citation service returns a citation for a given DOI or Query PID, formatted to ESIP guidelines established by the earth science community. When you call our API service using a DOI or Query PID, it automatically generates a ready citation for that dataset in JSON format, like this:

{"citationText":["Ocean Networks Canada Society. 2015. Central_Strait of Georgia VENUS Instrument Platform_Conductivity Temperature Depth_20-Sep-2014. Ocean Networks Canada Society. https://doi.org/10.21383/5cxhry6t2x. Accessed 2020-02-13."]}

To learn more about using this service, go here.

Dataset Metadata Service

Given a DOI or a Query PID, you can use the API to retrieve detailed metadata about the associated dataset or subset. This isn’t a new concept for data citation in general, but the ability to specify a precise subset of data via the Query PID and retrieve exact details about the timestamp of the query, the time series of the data, and other subset parameters is a feature unique to ONC.

To learn more about using this service, go here.

Future Plans

This CANARIE funded project ends March 31, 2020, but ONC intends to leverage the work done within the parameters of the initial project proposal, to enhance the value for our community.

ONC has done extensive research on schema.org and is working to implement these conventions to extend our reach to users outside of the Oceans community, allowing data to be findable through Google Dataset Search. We have a representative in both ESIP and RDA schema.org Working Groups, who is immersed in the community conversation on how best to implement this in the geosciences. 

ONC maintains partnerships with many other research organizations and thus has already implemented RORs in MINTED outputs, but additionally we aim to include ORCIDs for individual researchers who are interested in partnering with ONC, or preserving their data within our Oceans 2.0 database. ORCIDs will allow contributors to be credited appropriately and universally for their intellectual work in the production of data in our system. ORCIDs streamline the process of accurately identifying a specific researcher, since personal names can be represented in several ways and belong to more than one individual, e.g., it is not always clear if Jane Smith is J.Smith or J.M.Smith. 

Another important feature slated for development is the implementation of W3C’s PROV model within our system. Currently, our existing provenance model focuses on activities undertaken only after data is ingested into Oceans 2.0, such as reprocessing data and assigning a new DOI for reprocessed data. However, it is also important for users to understand how data is received and parsed from the instrument from the beginning, as well as how derived data is produced. This was outside of the scope of the project within the CANARIE funding period, but efforts to develop this feature are slated for the future.

ONC has data partners where we host the data they collect, but we also have data partners where we collect data and they host it on their repositories, such as IRIS. In instances like this, we host ancillary data associated with the data we sent to IRIS, and we are planning to tag the XML document with the RelatedIdentifier associated with RelationType. This will allow for users to better identify and access data that relates to it outside of Oceans 2.0. Linked Data is a best practice in Web 2.0 technology, and allows organizations to leverage their relationships in structured ways. 

Additionally, ONC is engaged with the Make Data Count Project, where the focus is to identify how best to track metrics related to data. They are attempting to learn if metrics that currently relate to other research outputs, such as journal articles, are appropriate for datasets, as well. ONC is interested in tracking the use of their data in meaningful ways for our data partners, funders, as well as improving our supports and services. 

Finally, ONC is considering the feasibility of adding an aggregation option for datasets. This functionality would best serve users who download several datasets, across deployments, manually or using ONC’s API. It would allow users to cite the aggregated dataset, instead of several singular datasets. This feature is still in preliminary discussion. 

Recommended Resources

Below is a list of resources that users can investigate at their own pace: