Data Management Principles

Note: This document is for public review. If you have any comments about these principles, please email CoastalCarbon@si.edu by Friday April 13th at 12 pm EST

Types of Data
Digital Object Identifiers
Metadata standards
Variables
- Proposed Level of Disaggregation for Variables
- Hierarchical Structure
Data Storage
Quality Control
Data Use Policy

Types of Data

The Network recognizes three classes of data: (i) data that we curate, (ii) data that we ingest, and (iii) synthesis products we create. Data that we curate will be hosted by on Smithsonian Institution (SI) servers, but the original data submitter, and funding sources will be credited as the dataset’s creators. Data that we ingest will include both data we curate, and data from any outside sources that meet basic availability, archiving, and metadata standards. These data will be pulled into intermediate files in a centralized database using R code. The workflow and files will be archived and publically available on an SI-dedicated GitHub website. This document refers to soil depth profile data throughout, but it is our intention that these general structures and principles be applied to other types of data as the Network evolves.

Digital Object Identifiers

To be cited and used in the Network synthesis, a dataset needs to have a citable digital object identifier (DOI), which points to a repository containing freely downloadable data and associated metadata. When data are discussed in a peer-reviewed journal article or report with a DOI, but the data are not publically available in their original form, then a second DOI specific to the dataset will be needed for inclusion in synthesis activities. A new DOI should also be assigned in cases where the original data are publically available but not in machine-readable format, or when data were missing key details necessary for Network synthesis purposes. In cases where data is continuously updated over time, we will version control the data, rather than issuing a new DOI for every update.

Submitters may archive data according to these outlined standards, host data on an SI server, and apply for a DOI through SI libraries with the assistance of Network personnel. This service is specific to this project. While data will be digitally archived long-term in accordance with SI standards, we cannot guaranty new data will be accepted after Network funding ends. For data that we curate, file trees will be hosted on SI ‘DSpace’ and DOIs will be issued by Smithsonian Institution Libraries (SIL). Landing pages with summaries of projects, sites, cores, will be viewable from the Network website, which will include links to download original data archived on Dspace.

Data submitters can additionally choose to forward DOIs issued outside of the Network for ingestion into the central data structure on the Network’s SI GitHub website. Some DOI-issuing repository services include Figshare and the Environmental Data Initiative. While we recognize that there is no official definition for what constitutes a ‘trusted repository’, repositories associated with DOIs should in general have community recognition and trust in their long-term stability. For data curated by the Network we hope that SI’s reputation, DSpace’s status as an approved technology, and the SIL’s commitment to digital object curation, generate this level of community trust.

Metadata standards

For data curated by the Network, we will use the Environmental Metadata Language standards. This includes an abstract, detailed submitor information, variable definitions, and variable data types (e.g. character, integer, or float). CCRCN personnel will use the R-based EML package in our workflow to create metadata for data that we curate.

Variables

Variable naming should follow good management practices¹. Variable names should be self descriptive and machine readable. They should contain no spaces and must not begin with a number or special character; however, underscores (i.e. ‘pothole case’) are acceptable. We will use controlled vocabulary for variable names that require a consistent format. There will also be non-controlled variable names and variable types are user defined as described in the associated metadata.

Units for all variables need to be defined and in some cases controlled. For some variables which typically have commonly reported units we will recommend submittors format using these controlled units. These include loss_on_ignition (fraction), dry_bulk_density (g cm^-3) and latitude and longitude (decimal degrees [world geographic survey 1988]). For variables that are applicable to the synthesis, but typically have multiple common unit formats we recommend an accompanying column defining these units. Uncommon or data types not included in synthesis projects simply need to have units defined in associated metadata.

Good data practices require consistently formatting “no data” values and categorical variables. We adopt the R-based convention of representing ‘no data’ values as ‘NA’ for all variable types (never blanks). Categorical variables should have descriptive names stored as text, similar to variables names. For example, one may code the categorical variable ‘treatments’ as numeric values ‘0’ and ‘1’ standing in for ‘experimental’ and ‘control’; however, best practices would dictate coding these as descriptive characters (‘experimental’ and ‘control’) rather than numbers.

For data we curate we will use controlled vocabulary units and variable types. For data we ingest, we will keep a file of corresponding controlled variables and ‘aliases’ so that data not complying with controlled vocabulary can still be ingested. We will document transformations made to ingested data to standardize them with the data we curate in R code.

Proposed Level of Disaggregation for Variables

In general we believe that there should be community agreement on the finest level of data disaggregation archived for practical use and reuse. This fundamental unit should be the most detailed unit typically used and reported in the literature. For soils data we will stratify by site, by core, and by depth increment. For calculations such as loss-on-ignition and bulk density, data by depth increment will be the fundamental level of archiving. For age-depth information will archive radiocarbon (¹⁴C) age ± sd for a sample for ¹⁴C dates, and counts per unit dry weight ± sd for ²¹⁰Pb and ¹³⁷Cs profiles.

Hierarchical Structure

We will make an effort to ingest existing data and curate submitted data in a hierarchical framework. Information associated with submitters, projects, sites, cores, and depth profiles will all be hosted in separate tables related by index codes that are unique in that database. Submitter metadata include name and affiliation data for principal investigators. This will be indexed using the principal investigator’s last name and first initial. Numbers will be added in case of conflicts (Example: Jane Doe = DoeJ). A submitter can be associated with one or more projects.

Project metadata will have an abstract and information about coauthors, associated funding source, or set of funding sources, associated publications, and materials and methods. A project should be a discrete unit of research united by consistent personnel, funding sources, and/or materials and methods. A project code should be brief but descriptive and user defined, but should not otherwise conflict with existing project codes. One project should be associated with one DOI. A project can be associated with one or more sites.

Sites refer to discrete geographic or management units and are somewhat nebulous, project specific, and user defined. A site code should follow the same best practices for variable naming, with no spaces, not starting with a number, descriptive, brief, and meaningful to project documentation and design. Site metadata refer to data associated with the sites, such as location, notes on dominant vegetation types, salinity, and site condition. Although there are no standards for what constitutes a site, and different projects could have different names for the same site, this coding should be consistent within a project. A site can have one or more data sets, including one or more core, plot, or instrument locations.

Core/Plot/Instrument-Level Data refer to information specifically about the location of a soil core. This could include positional information such as latitude, longitude, and elevation. It could also include notes that are redundant but more detailed than site metadata, such as vegetation. Each core should have a unique code within a project. A core code should follow the same best practices for variable naming, no spaces, not starting with a number, descriptive, brief, and meaningful to project documentation and design. For future syntheses this level of hierarchy could also house other types of plot or instrumental data.

Depth Series Information: Soil cores have depth-series information which should include minimum and maximum depth increments, as well as measurements presented in their fundamental unit (explained above), with associated methods notes, uncertainties, and/or QA:QC flags if applicable (explained below). Each sample should have its own ‘sample code’. These can be submitter defined, but should conform to general principles for variable naming. In future syntheses time-series data could also be archived in this format with instrument replacing a core, and time replacing depth.

Data Storage

Tabular data will be stored in a flat text file, meaning that no data formatting will be included. We will default to using tab-delimited text files (“.txt”) for simplicity. Although comma separated values “.csv” are also common, these types of files can be corrupted if commas are ever used within a variable rather than to delineate variables. Submissions that are received in other formats, such Microsoft Excel files will be edit-locked and archived as submission documents. However, a working version of this submission will be formatted according to Network standards in flat text files.

Tabular data will be stored in long-form tables, as opposed to wide-form tables. Each column should correspond to one variable, each row should correspond to one entry. Each column needs to have a single data type, and represent only one variable. Extra information such as units, notes, or operator code will not be encoded as an excel note, cell color, or be included in the same cell as a value.

Quality Control

Network personnel will perform a cursory visual check on all data we curate and relay any concerns with the data submitter during the curation process. We will also write scripts to enforce consistent units and data types for controlled variables, to check for completeness, and to ensure there are no conflicts or duplicates with previously archived data. Some key variables will receive quality control flags. For variables that are strongly intercorrelated, quality control flags will be assigned to the whole sample unit. Key variables independent of sample intercorrelation will receive variable-specific quality code flags. For data that we curate, files will be edit-locked following QA:QC. Any updates or corrections will result in a new named ‘version’ of the file, a change logged by Network personnel in a text file associated with the data, and a note of the change sent to the next update of the Network email list members.

Data Use Policy

We refer to users as anyone using either data we curate, or synthesis products we create. Data that is curated, but not created, by the Network, should not be attributed to the Network. Users should cite all dataset DOIs and credit the datasets’ original authors. All synthesis products created by the Network and associated collaborators will be listed under a Creative Commons “With Attribution” license. The Network should be acknowledged and cited appropriately if users utilize any of the data structures, tools, or scripts developed by Network and associated collaborators. We will develop additional tools to assist users in generating lists of citations, but users will be ultimately responsible for correctly citing all data used.

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510 ↩