Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

MINTED User Documentation

*Note: This page is under development

**Note: All links imbedded in the text are also available with full URLs at the end of this document

What is Data Citation and Why Is It Important?

The current trend towards improved transparency and reproducibility in science is pushing researchers and institutions to develop new strategies for managing the data they produce. Publishers increasingly request access to the datasets underpinning submissions, and national funding agencies are establishing policies requiring the open sharing of data as a condition of receiving grants. These changes are driving the creation of new tools to ensure data is findable, accessible, and reusable and remains so into the future. 

Much like citations for published works, data citations based on persistent identifiers (PIDs) provide a way to find the exact source material used in a given piece of research. For digital objects, PIDs are needed to provide a single, consistent “location” where that object can be found on the web. One of the most commonly used PIDs is the Digital Object Identifier, or DOI, a permanent link which will always resolve to a particular resource (or a comprehensive record for that resource).

Dynamic Data Citation

Persistent identifiers are straightforward to create for finished objects such as a published paper or completed dataset. Dynamic data that changes over time, like ONC’s continuous data streams, are more difficult to affix identifiers to, as the content is constantly evolving. To reliably and reproducibly cite dynamic data requires more detailed information such as the exact date and time the data was retrieved, as well as any search parameters used in selecting a particular subset. However, it’s not feasible to mint a new DOI for every single change to an evolving dataset, and preserving complete previous versions of all data is beyond any institution’s storage capacity.

In February 2015, the Research Data Alliance Working Group on Dynamic Data Citation released a set of 14 recommendations to guide best practices for persistently and reproducibly identifying these kinds of dataset. The recommendations rest on 3 pillars: 

  • Versioning - Major changes to a dataset are marked with a new version number.
  • Timestamping - Queries made to the database are saved along with metadata about exactly when they were made.
  • Query Preservation - PIDs are assigned not only to the whole dataset, but also to each time-stamped query used to extract a particular data subset from the repository’s database.

Combining these strategies, we can narrow down the parameters of a dataset until it exactly matches the state it was in when it was previously retrieved. New version releases mark significant changes to the dataset, whether to the data itself or the ways in which it was processed. Timestamps for the date and time the dataset was accessed further refine recall within the context of the frequent, smaller changes that don’t necessitate a new version. Finally, assigning a persistent identifier to each individual query - an actual data request sent to the database - allows previously accessed subsets to be recreated with ease, eliminating the need to painstakingly replicate complex search parameters by hand.

With the MINTED Project, Ocean Networks Canada is proud to be leading the way in implementing the RDA Recommendations. In partnership with CANARIE, the DataCite Canada Consortium, and our own crack team of software engineers, ONC’s work will help establish best practices for good data stewardship now and into the future.

MINTED Project Overview

The MINTED project (Making Identifiers Necessary to Track Evolving Data) is funded by the CANARIE Network, and supports Ocean Network Canada’s need to implement PIDs to datasets to be renewed for the CoreTrustSeal, a certification of repositories using best practices. 

This project, MINTED, aims to apply dataset citation Digital Object Identifiers (DOIs) and research organization RORs (Research Organization Registry) into ONC’s Oceans 2.0 digital infrastructure. At ONC, the data are very dynamic due to continually accumulating data streams, data reprocessing and data product code versioning. While there has been a growing recognition of the benefits and need for data citations made evident by the reception of the FAIR Principles, existing platforms and tools such as DataVerse and the Federated Research Data Repository (FRDR) are currently only able to serve the needs of static or non-frequently updated datasets. There is now an opportunity to apply recommendations and lessons learned from the Research Data Alliance (RDA) Working Group (WG) on Dynamic Data Citations. Their 14 recommendations were detailed in a 2015 publication. 

Benefits of MINTED

MINTED allows data users to use and cite data in the repository from the beginning of a deployment, providing traceability of the dataset life cycle, so that users can better interpret data integrity, respect the terms and conditions under which the data were accessed, and increase the credibility of users. Data citation also allows users to link datasets to publication DOIs, contributor ORCIDs, funder reference IDs, ARK IDs, and more. 

It makes strides in enabling reproducibility, provides curated datasets to end users, provides credit for publishing and providing data, helps repositories and users to track metrics of data usage and the impact of the data, and enhances metadata catalogues and products for citation content. Additionally, data citation stimulates related research as published datasets that have been cited, lending themselves more readily to further analysis. 

Lastly, the MINTED project also allows ONC to renew our certification with World Data System (WDS) CoreTrustSeal (CTS), identifying ONC as a repository that has implemented and supports FAIR (Findable, Accessible, Interoperable, and Reusable) data principles and best practices, encouraging confidence in content within. 

Landing Page

The landing page describes the high level information (‘metadata’) associated with a dataset. ONC has defined a dataset as one device from one deployment, e.g. Aanderaa Optode 3830 (S/N 911), device ID 20116, deployed at Folger Passage on 11-Sep-2015, recovered 02-May-2017. To see this page, users can include the DOI from any dataset.

Example landing page for a DOI:

Landing Page Metadata Sections:

  • The Title is composed of several pieces of information describing the dataset: Deployment Location_Deployment Area_Device Category_Date of Deployment. 
  • The DOI is listed by ONC, as pertains to the entire dataset, beginning with 10.organizationID from DataCite/specific number for the dataset. The DOI is included in data citations.
  • The Abstract provides the user with a high level description of the device deployment from which the data was collected, to help them identify if this dataset is appropriate for their purposes. The abstract consists of the device name, the date and site it was deployed at, a description of the site, the device category, a description of the device category, what type of platform it was deployed on, and a note about where the data was archived. 
  • The Creators are listed in this section. ONC has data partners who should be attributed appropriately within the landing page and citation. 
  • The Date sections include the date the record was created, the date when collection of the data began, and the publication year when the dataset was published. 
  • The Resource Type will always be ‘One Deployment’, specifying that the data only covers the length of a single deployment of the instrument.
  • The Rights section points towards ONC’s Data Usage Policy, helping users understand the constraints and licensing of the associated dataset. 
  • The Formats section provides a list of the types of formats the data is available in. 
  • The Geolocations provide the user with the longitude and latitude of the area the data was collected. Note: for mobile platforms, multiple points are provided to create a polygon or a bounding box. 
  • The Contributors section lists the organization(s) and role involved with the data. The ROR is also included in this section. A ROR is a contributor identifier, much like an ORCID, where research organizations can more easily be attributed to research outputs. The Research Organization Registry provides this service. 
  • The Citation section provides the actual citation that ONC and our data partners would like users to use in outputs utilizing the data associated with the particular DOI. More and more publishers are requiring datasets to be referenced with DOIs as support for the COPDESS statement grows. Users can simply copy and paste the citation directly into their resources, bibliography, citation list, etc. The citation text was created using the ESIP Data Citation Guidelines for Earth Sciences following their best practices. 
  • The Version History section provides details about the general provenance of the dataset after it was initially archived with a DOI. This can include reprocessing, DOI minting, and more, including the date that activities correspond with.

Query PID

The landing page for a Query PID has all the metadata for the dataset as a DOI landing page, plus some additional information. Query Details describe the specific filters and parameters used in a particular query/search. To see this page, users can enter any Query ID into the search field. 

Example landing page for a Query:

→ Query-PID Specific Metadata Sections (orange box):

  • The Data Product section lists the specific type of output(s)/product(s) selected for download in the query parameters. 
  • The Dates sections include the date the query was run, and the dates/times the user selected data for within the dataset itself
  • The Properties section includes the variables the user selected from the list of available recorded measurements. 
  • The Format section lists the format(s) the user selected to download
  • The Citation section provides the user with the preferred citation to use, including the Query ID and the Access date.

Web Services

In addition to the DOI/Query PID Search tool, Ocean Networks Canada offers several web services as an alternate method of interacting with data and metadata through our Oceans 2.0 API

If you are unfamiliar with APIs, or Application Programming Interfaces, there are many easy-to-understand introductions available on the web. Try Codenewbie, What Is an API? or APIs 101. APIs are especially useful for automating computer-to-computer processes, such as batch citation downloads or scheduled data retrievals.

The MINTED project has added two new Ocean Networks Canada web service offerings: a citation text service and a dataset metadata retrieval service.

Data Citation Text Service

The API citation service returns a citation for a given DOI or Query PID, formatted to ESIP guidelines established by the earth science community. When you call our API service using a DOI, it automatically generates a ready citation for that dataset in JSON format, like this:

{"citationText":["Ocean Networks Canada Society. 2015. Central_Strait of Georgia VENUS Instrument Platform_Conductivity Temperature Depth_20-Sep-2014. Ocean Networks Canada Society. https://doi.org/10.21383/5cxhry6t2x. Accessed 2020-02-13."]}

To learn more about using this service, go here.

Dataset Metadata Service

Given a DOI or a Query PID, you can use the API to retrieve detailed metadata about the associated dataset or subset. This isn’t a new concept for data citation in general, but the ability to specify a precise subset of data via the Query PID and retrieve exact details about the timestamp of the query, the time series of the data, and other subset parameters is a feature unique to ONC.

To learn more about using this service, go here.

Future Plans

This CANARIE funded project ends March 31, 2020, but ONC intends to leverage the work done within the parameters of the initial project proposal, to enhance the value for our community.

ONC has done extensive research on schema.org and is working to implement these conventions to extend our reach to users outside of the Oceans community, allowing data to be findable through Google Dataset Search. We have a representative in both ESIP and RDA schema.org Working Groups, who is immersed in the community conversation on how best to implement this in the geosciences. 

ONC maintains partnerships with many other research organizations and thus has already implemented RORs in MINTED outputs, but additionally we aim to include ORCIDs for individual researchers who are interested in partnering with ONC, or preserving their data within our Oceans 2.0 database. ORCIDs will allow contributors to be credited appropriately and universally for their intellectual work in the production of data in our system. ORCIDs streamline the process of accurately identifying a specific researcher, since personal names can be represented in several ways and belong to more than one individual, e.g., it is not always clear if Jane Smith is J.Smith or J.M.Smith. 

Another important feature slated for development is the implementation of W3C’s PROV model within our system. Currently, our existing provenance model focuses on activities undertaken only after data is ingested into Oceans 2.0, such as reprocessing data and assigning a new DOI for reprocessed data. However, it is also important for users to understand how data is received and parsed from the instrument from the beginning, as well as how derived data is produced. This was outside of the scope of the project within the CANARIE funding period, but efforts to develop this feature are slated for the future.

Finally, ONC is considering the feasibility of adding an aggregation option for datasets. This functionality would best serve users who download several datasets, manually or using ONC’s API. It would allow users to cite the aggregated dataset, instead of several singular datasets. This feature is still in preliminary discussion. 

Recommended Resources

Below is a list of resources that users can investigate at their own pace: