Overview

Ocean Networks Canada (ONC) distributes data in a wide range of formats, varying between open, non-proprietary and proprietary. An open, non-proprietary format is a file format that is published and free to be used by everybody, i.e., widely accessible and reusable. Proprietary formats are typically controlled by a company or organization for its own benefits, and the restriction of its use by others is ensured through patents or as trade secrets [1].

For a given data source, ONC provides:

Providing these options for users allows users to access data in their preferred way, based on their needs and use restrictions. While non-proprietary formats are best for data re-use and long-term data preservations, the proprietary formats have value for certain software used within the community and for dataset provenance to its rawest form (e.g., manufacturer formats). This approach is consistent with the W3C Best Practice recommendations 12 and 16 [1]. 

There are several types of stakeholders involved in the requirements, design, development, testing and maintenance of data products within Oceans 3.0. These include, but are not limited to: data users, instrument manufacturers, researchers, software developers, data specialists, data stewards, systems administrators, community partners, and funders.

ONC best practices are outlined in this document for a data product pipeline, preferred data product formats, data product documentation, file naming conventions, and format migration. Compliance to and a review cycle for these best practices is also demonstrated.

Data Product Pipeline

Data product development follows a defined and extensive process. The process includes stakeholder input and feedback, project planning and management, plus definition and adherence to software development best practices. The pipeline as a whole is documented internally (not publicly available) and is summarized here.

The following diagram is an overview of the full process: 

First, the need for a new data product is identified and defined, establishing a concept and a business case. The business case is assessed, including estimating development effort, and a go / no go decision is made at the executive level. The data product concept is then placed in a priority queue within a strategic program plan. The concept may be refined iteratively. Once a business case is approved, a prototype data product is then developed. It can be developed internally where one does not yet exist, or may be contributed partly or entirely by external stakeholders such as participating scientists. User feedback is sought at the prototype stage with any number of improvement / feedback iterations as needed. These initial steps must consider preferred formats as well as content, preferring self-describing products and established standards which may be topic-specific. For example, NetCDF formats adhere to the CF and ACDD standard, as noted in https://wiki.oceannetworks.ca/display/DP/NetCDF. The prototype is then developed into an operational version with a limited release (often referred to as a beta version). Oceans 3.0’s permissions system allows data products to be restricted to internal users and/or any group of users. The limited release is used for testing and to seek further feedback. Once the data product meets user expectations and internal standards for naming, consistency, reliability, etc., the restrictions are removed and the data product is released publicly. After release, the data product is actively supported via established procedures that triage queries from user support to bug fixes. Oceans 3.0 includes server-side logging and alerting as well as a user contact form for reporting issues and asking for support. Such issues are recorded in an issue tracking and management software (JIRA).

ONC’s software development process is mature. The software team makes use of automated testing (including performance metrics), review and version control processes, linting, and code and style standards (including automated enforcement). Requirements for documentation include both external user documentation (see below) and internal documentation for requirements, design, implementation and testing (both automated and the occasional manual regression and integration tests).

Preferred Data Product Formats

ONC prefers interoperable open file formats maintained by a standards organization [3,4] rather than proprietary [5] or product-specific formats. ONC also prefers formats that offer superior storage performance and accessibility, without loss of data or quality. Longevity of support by the relevant community is also considered.

As per ONC standard practice, raw data is archived unmodified from the state in which it was acquired. For autonomous, offline deployments, these may be captured in proprietary, manufacturer-specific formats, such as from bathymetric multi-beam sonar. With file-based data acquisition, such formats may also be unavoidable, an example is a Coastal Radar system. For online, Oceans 3.0 mediated, data acquisition, all initial raw data is stored in our raw log .txt file format where all communication between the device and the Oceans 3.0 driver software is timestamped and recorded, either as ASCII or hexadecimal for binary data. The raw log .txt data are then the source data of record, from which all products are derived. As noted in the overview of this document, every data source within ONC’s archive will be offered in the raw data format, as well as: the manufacturer’s format; at least one non-proprietary community-standardized, accessible format, such as NetCDF; and a basic data visualization in PNG and PDF formats.

For standard scalar data, which is defined as one measurement per reading, single dimension / dependency data (such as temperature data), preferred formats are readily and widely available. Scalar data formats are also standardized, applying to all devices that produce scalar data. For non-scalar data, such as video, audio and multi-dimensional data, the formats are more varied and device-type specific. The following table lists ONC’s current preferred formats.

Data Type

Preferred Format(s)

Time-series scalar data

CSV, ODV, NetCDF

Time-series multi-dimensional data (e.g., remote sensing acoustic or radar sources) 

HDF5, NetCDF

Audio

FLAC

Images

PNG

Video

MP4, encoded by H.264/ MPEG-4 AVC

Seismic data

miniseed

Geospatial Metadata

ISO 19115 xml [6]

Biodiversity Metadata

Darwin Core [7,8], Ecological Metadata Language. Note: these formats are in development, intended for contributing records to the Ocean Biodiversity Information System (OBIS)

 

The preferred approach to file compression is gzip, although occasionally other forms of compression are supported (e.g. zip).

Data Product Documentation

Documentation on data products can be found under https://wiki.oceannetworks.ca/display/DP/Data+Products+Home 

This page acts as the home for details on every data product that is available on Oceans 3.0. This includes a table of every single data product that has been created along with its associated and unique identification number.

From this table there are individual documentation pages for each data product. These pages give more specific details describing the data and information such as formats, possible data product options, revision history, and example files.

There is also general high-level information on how to download data products in Oceans 3.0, as well as different options, metadata, citations, quality, availability, mobile data convention, file naming conventions, data search features, email notifications, interoperability partners, and file formats.

Additionally, ONC’s contact page is linked on this page under https://www.oceannetworks.ca/contact-us/ for any questions or requests on data products.

New data product documentation is generated each time a new product is introduced (as per the pipeline described above), and updated as data products are updated. The process here includes a dedicated reviewer for approval, and consistency is ensured by using a template. Changes are tracked by the documentation software and linked to software updates. The documentation is regularly verified as part of ONC’s instrument deployment workflows (described in more detail below).

File Naming Conventions

Files are named starting with a unique alphanumeric device-code that is specific to the instrument the data are coming from. That device-code is then followed by the timestamp for when the data begin and may include a timestamp for when the data end. A file modifier option may be added to distinguish files that would otherwise have the same name extension, such as data product variations or data acquisition modes. The last part of the file-name is simply the file extension type. The timestamp is consistent with the ISO 8601 standard [9].

For more details, refer to the documentation about these conventions (https://wiki.oceannetworks.ca/display/DP/Data+Products+Home#DataProductsHome-Conventions). 

Format Migration

Raw data is always unmodified from the original format and state in which it was acquired. Only subsequent derived formats may be replaced with preferred formats as technologies and standards evolve. ONC shall maintain the capability to generate the replaced format on demand, so that previous datasets and versions are always reproducible. Once the replaced format is fully reproducible, it may be deleted from the archive to free space and reduce cost. On-demand products such as metadata reports, which were never archived, also need to be reproducible, but do not have to be available for active download.

Compliance

ONC ensures adherence to this strategy by undertaking regular review of the data formats we maintain and provide access to. This process includes reviewing new formats introduced to the research community, as well as guidance associated with new formats. Best practices evolve over time, and ONC is committed to evaluating new developments in the scientific community on a regular and scheduled basis. This frequency of review is defined in the next section of this document. 

For every instrument deployment, task-driven workflows [10,11] are executed for each phase of its life cycle. Numerous tasks ensure compliance to the data product best practices, including:

 

The file naming conventions are upheld by an automated archiving job that verifies filenames match the anticipated conventions and that the device-code exists. If there are any issues, an error is logged. Data Stewards have monthly checks in place to review the logs for archiving failures, and will remedy the situation as deemed appropriate. Moreover, whenever an instrument is deployed there are initial data stream checks to confirm files are archiving properly.

Review Cycle

The Research Data Management Team Lead convenes representatives of teams within the Digital Observatory Operations division of ONC on an annual basis to review this strategy, and updates as necessary. This review process is documented and meetings set in advance, to ensure that this effort is prioritized. Major changes to this strategy are further reviewed at the executive level.

Major revision is something substantial to the text or figures, etc.

Minor revision is something minor like spelling or broken links, etc. 

Revision History

DateChangeVersionReviewers
11-18-2023Create/Publish1.0DS, BB, CR, RJ
12-02-2024Updated links1.1DS, BB, CR
04-02-2026Included Revision History Table to capture review process and decisions made1.2DS, BB, CR, KM, PC, JC, ST

For More Information

Contact support.

References

[1] W3C. Data on the Web Best Practices. 2017-01-31.  https://www.w3.org/TR/2017/REC-dwbp-20170131/

[2] Ocean Networks Canada Data Policy. 2021-05-03. https://cdn.onc-prod.intergalactic.space/Data_Policy_2021_05_03_1771622fe1.pdf 

[3] OPF. International Comparison of Recommended File Formats. V 1.2 (April 2022). Accessed from https://openpreservation.org/resources/member-groups/international-comparison-of-recommended-file-formats/ 

[4] OPF Working Group. International Comparison of Recommended File Formats Spreadsheet (2023). Accessed from  https://docs.google.com/spreadsheets/d/1XjEjFBCGF3N1spNZc1y0DG8_Uyw18uG2j8V2bsQdYjk/edit#gid=1719869262 

[5] Proprietary file format. Wikipedia: The Free Encyclopedia. Retrieved August 10, 2023, from https://en.wikipedia.org/wiki/Proprietary_file_format#:~:text=Proprietary%20formats%20are%20typically%20controlled,or%20future 

[6] International Organization for Standardization. (2014).Geographic Information - Metadata - Part 3: XML schema implementation for fundamental concepts (ISO Standard No. 19115-3:2016). https://www.iso.org/standard/32579.html 

[7] Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, et al. (2012) Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE 7(1): e29715. https://doi.org/10.1371/journal.pone.0029715 

[8] Darwin Core Maintenance Interest Group, Biodiversity Information Standards (TDWG) (2014). Darwin Core. Zenodo. https://doi.org/10.5281/zenodo.592792

[9] International Organization for Standardization. (2019). Date and time — Representations for information interchange — Part 1: Basic rules (ISO Standard No. 8601-1:2019). https://www.iso.org/standard/70907.html 

[10] Owens D, Abeysirigunawardena D, Biffard B, Chen Y, Conley P, Jenkyns R, Kerschtien S, Lavallee T, MacArthur M, Mousseau J, Old K, Paulson M, Pirenne B, Scherwath M and Thorne M (2022) The Oceans 2.0/3.0 Data Management and Archival System. Front. Mar. Sci. 9:806452. https://doi.org/10.3389/fmars.2022.806452

[11] Jenkyns R, Tomlin M, Pirenne B, "Instrument task-driven workflow software for cruise and maintenance operations," 2013 OCEANS - San Diego, San Diego, CA, USA, 2013, pp. 1-4, doi: 10.23919/OCEANS.2013.6741251