Note: Some of the links within this policy are only available to ONC staff at this time.

Overview

This technology migration policy is established to ensure sustainable resolution of the datasets, such that the allocated persistent identifiers and their associated datasets are resilient to changes in Oceans 3.0 code, data model and architecture.

This can be achieved by software development processes such as code reviews, unit tests and automated testing. Overall regression testing is also performed prior to each release. Before each software release, internal stakeholders evaluate whether they foresee any issues with new code, database scripts, or outstanding bugs that warrant postponing the release for further investigation or a fix. The Systems team manage the servers and databases that provide long-term preservation and access. Data stewards operate the jobs that mint DataCite identifiers, execute tasks that version datasets, and perform regular checks on these results. They also maintain operational checks via data citation verifications as part of instrument workflows, which follow documented procedures.

As a member of the World Data System, ONC is committed to adhering to the CoreTrustSeal repository certification requirements. There are a few requirements which strengthen ONC's resolve in this technology migration policy such as continuity of access, preservation plans, data discovery and identification, and technical infrastructure. Our membership in the DataCite Canada Consortium also includes a signed agreement to uphold a Participation and Financial Commitment Agreement, to be renewed on an annual basis. Annual internal stakeholder reviews for related documentation and procedures related to data citations and related persistent identifiers will also be conducted.

Software Development

This section summarize the software development measures in place to ensure sustainability of the data citation framework in Oceans 3.0.

Code Reviews

An issue tracking ticket is associated with each code commit from a software engineer. Before that code is merged into the branch, their code is reviewed by a peer following a checklist appropriate to the ticket type. After the merge, the results are reviewed for their desired impact. If any issues are found, the ticket is re-opened or related tickets are created for further work. For full details, refer to CD Design - Code review process.

Unit tests

Unit testing is part of the software development life cycle that ensures that each piece of code that has been written performs the function that it is designed to do. Unit tests are created within Java, Matlab and Javascript. As code is developed, unit tests are created which adhere to a naming convention. Test suites are run during the build, and evaluated in the code review. The unit tests are intended to consider functional requirements, boundary and edge cases, and termination conditions (e.g., normal execution, system failure condition). For full details, refer to Unit Test Guidelines.

Automated Testing

Automated testing is continually being extended for comprehensive verification of Oceans 3.0 features and services. Related key performance indicators are established to monitor their impact. The relevant areas for automated tests which are currently established include:

data products (for which observed changes can be compared against dataset versioning and related DataCite DOIs),
web services (which will be extended to include the Citation Test web services), and
scheduled tasks (which will gradually include all tasks such as the DOI minting job).

Regression Testing

Before each release, a suite of tests for each module are conducted to minimize negative impact on existing features and services. Results are recorded, and tickets are created for any issues that arise. Features for the dataset citation are within the Persistent Identifier module. See PI DOI Dataset Metadata Search for the records of regression testing in this module as executed by the developers. For the regression test checklist and monitoring details, refer to Regression Test Checklist (Based on Modules).

Release Process

The software release process is well documented at a high-level. Specific releases are accompanied by release notes and a detailed deployment process. The notes indicate new blocker or critical tickets, a summary of sprint outputs, and a listing of database scripts. These notes are reviewed for any potential risks by stakeholders including the Director of User Engagement, the Software Development Manager, Senior Software Developers, a Systems & Operations representative, the Data Manager and the Data Stewardship Manager. Once all the stakeholders are in agreement, a release is scheduled. For additional details, refer to Deployments.

Rollbacks

In the event of an unanticipated significant issue, roll-backs may be executed. A formal roll-back plan will be created for higher risk deployments, such as migrating from one database technology to another or major data model reconfigurations.

Systems & Operations

In addition to executing code releases, the Systems & Operations team performs operational maintenance and replication for the databases (Postgres, Cassandra, Archive Directory). Their actions serve to preserve the datasets and related metadata. Automated alerts, monitoring tools and security mechanisms ensure a healthy system, with a rotating team member assigned to respond quickly as needed.

Data Stewardship

Operational tasks and monitoring executed by data stewards serve as additional verifications that the software functions as expected. The relevant areas include ongoing minting of dataset DOIs, metadata and dataset versioning, device workflows and instrument settings.

Dataset DOI Minting Job Execution & Monitoring

Data stewards execute and monitor the job that performs automated generation of dataset DOIs. Detailed checks are conducted before extending the job to support a new use case. Monthly checks are conducted to look for errors in the machine logs or in the resulting datasets. In the future, metrics will be established to enhance the monitoring and reporting capabilities.

Metadata and Dataset Versioning

As data stewards execute metadata and dataset versioning, the impacts on the data citations will be confirmed. Metadata updates to items like dataset attributions, data product mappings and geospatial extents should trigger DataCite XML updates, whilst dataset modifications via reprocessing, file manipulations and gap filling should trigger a new dataset version. Each data steward task is peer-reviewed for added assurance that the procedures are properly followed and that the results meet expectations.

Device Workflows

The Device Workflow tool facilitates task management for instruments affected in a given maintenance expedition, ensuring that work is streamlined, traceable and repeatable amongst the various organizational teams involved. Instruments affected by a particular expedition are added to a process group and assigned a relevant process. A summary page for each expedition includes a table listing devices with their processes and most recently completed phase. Once an instrument is assigned a process, a worksheet populates with a list of all its tasks grouped by phase.

The latest versions of the device commissioning phases for instrument installations are being updated to include a task to verify the data citation. This includes confirming the Postgres database entry, the landing page content, the DataCite XML, the ISO 19115 metadata record citation information, and an example query subset. For task details, refer to Data Citation - verify.

Numerous other tasks within the workflow influence the Data Citation such as:

Instrument Settings

The Data Stewardship team maintains detailed documentation for instrument-specific procedures, including data distribution facets like data citations. Other documented facets impact the resulting data citation metadata as well, such as:

data product mappings that should be applied for specific contexts;
geospatial area/volume extent metadata procedures for remote sensing instruments.

References

External

CoreTrustSeal Standards and Certification Board. (2019, November 20). CoreTrustSeal Trustworthy Data Repositories Requirements: Extended Guidance 2020–2022 (Version v02.00-2020-2022). Zenodo. http://doi.org/10.5281/zenodo.3632533

Internal (only accessible to ONC Staff)

CD Design - Code review process

Unit Test Guidelines

Automated Testing Project Plan

Automated Testing Related KPI

Regression Test Checklist (Based on Modules)

Deployments (including sub-sections)

AD Replication

Postgresql Replication, Load Balancing, Failover

Cassandra

Workflow Tool

Data Citation - verify