Data management guide

Ocean DNA data storage and compute resources

The Ocean DNA group has access to the following disk storage. Please be mindful that these are shared resources.

  • Hydra cluster scratch
    • /scratch/nmnh_ocean_dna
    • 90 TB
    • Not backed up by Hydra admin
    • No automatic file purging
    • Accessible by all NMNH Ocean DNA users
  • Hydra cluster store
    • /store/nmnh_ocean_dna
    • 70 TB
    • Not backed up by Hydra admin
    • No automatic file purging
    • Intended for large raw data files and inactive projects
    • Slower read/write speeds
    • Can not be used for active analysis (not mounted on compute nodes)
    • Accessible by all NMNH Ocean DNA users
  • Smithsonian network P drive
    • smb://si-ocio-qnas2.si.edu/nmnh/nmnh-all/public/nmnh-ocean-dna
    • 80 TB
    • Separate system from Hyrda, managed by OCIO
    • Incrementally backed up daily
    • Fully backed up weekly
    • Only accessible from Smithsonian computers
    • Access limited to NMNH Ocean DNA data managers

Data management flowchart

graph TD;

  GenoHub[**GenoHub**]
  Metadata[**Map File**]
  Analyses(Run quality/adapter trimming, mitogenome assembly, etc)
  Scratch[(**Hydra Scratch**)]
  Store[(**Hydra Store**)]
  PDrive[(**P Drive**)]
  
  Move0[STEP 2: create map file]
  Move1[STEP 1: download FASTQs]  
  Move3[copy important results]
  Move4[STEP 3: move to Hydra Store]
  Move5[STEP 4: backup to P drive]

  Move0-->Metadata
  Metadata-->Move4
  GenoHub-->Move1
  GenoHub-->Move0
  Scratch-->Move4
  Move1-->Scratch
  subgraph " "
    Scratch-->Analyses
    Analyses-->Move3
    Move3-->Store
    Move4-->Store
  end
  Store-->Move5
  Move5-->PDrive

  classDef process stroke:black,color:white,fill:#159BD7,stroke-dasharray: 5 5
  classDef storage stroke:black,color:white,fill:#159BD7
  classDef ccr stroke:black,color:white,fill:#159BD7
  classDef step stroke:black,color:black,fill:#a0c5fa,stroke-dasharray: 5 5

  class Rename,Analyses,Move2,Move3 process
  class Metadata,GenoHub,Scratch,Store,PDrive storage
  class Move0,Move1,Move4,Move5 step

  click Rename "bestpractices.html"

  linkStyle default stroke:grey, stroke-width:4px

Step-by-step data management workflow

Below are step-by-step instructions for our data management workflow. This workflow was designed for genome skimming datasets, but could be modified for other types of sequencing projects.

Note

These instructions only pertain to raw sequence data. We do not have a standardized data management protocol for other data such as trimmed reads or analysis results.

STEP 1: download from GenoHub

GenoHub will provide a link with instructions to download the raw sequence data. The data should be demultiplexed, compressed sequence reads in FASTQ format (i.e. files should end in “.fastq.gz” or “.fq.gz”).

You can download sequence data directly from GenoHub to Hydra using the AWS command line interface. First, load the AWS module and run configuration. Enter the information provided on the GenoHub download instructions page when prompted.

module load tools/awscli/2.15.27
aws configure

Now you can access the AWS storage bucket containing your sequence data. To list the bucket contents prior to downloading, run the following, replacing “XXXXXX” with your GenoHub project number.

aws s3 ls s3://genohubXXXXXX

Run the following command to download your data. Replace “XXXXXX” with your GenoHub project number and specify the path to your desired download location.

aws s3 sync s3://genohubXXXXXXX /full/path/to/you/download/directory
Note

We recommend downloading sequence data to /scratch/public/genomics/YOUR_USER_ID. This directory has the fastest read/write speeds and the largest storage quota.

STEP 2: create the sequence data map file

The map file, a comma-separated values (CSV) file, lists all samples in the GenoHub project and is crucial for ensuring your data is both backed up and usable.

Please use the following naming convention for your map file:

genohub-{GENOHUB_PROJECT_NUMBER}_{PROJECT_DESCRIPTION}_mapfile.csv

For example: genohub-8459898_Vietnam_mapfile.csv

Where “8459898” is the GenoHub project number and “Vietnam” is the description given by the project owner.

Warning

Do not include underscores or spaces in the project description.

The first five columns of your map file must be:

  • ID: GenoHub sample name. For example:
    • I-0025-99AD-1N1
  • R1: Read 1 FASTQ file name. For example:
    • I-0025-99AD-1N1_1.fastq.gz
  • R2: Read 2 FASTQ file name. For example:
    • I-0025-99AD-1N1_2.fastq.gz
  • Taxon: Your best guess at taxonomic assignment, no required format. For example:
    • Latin binomial: Urophycis chuss
    • Genus: Urophycis sp.
    • Higher level taxonomy: Phycidae
    • Combination: Gadiformes_Phycidae_Urophycis_chuss
  • UniqID: Identifier linked to a voucher/tissue sample, no required format. For example:
    • USNM catalog number: USNM 477715
    • EMu EZID: http://n2t.net/ark:/65665/3987722bd-99bc-4913-a279-092f58c82d72
    • GEOME BCID: https://n2t.net/ark:/21547/FxY2USNM_Fish_477715.1

You may include other metadata as additional columns in the map file.

Note

We strongly advise the use of globally/universally unique identifiers (GUIDs/UUIDs) for the UniqID column. If you need to generate these IDs, consider creating a project in GEOME.

Important

The first five columns must contain information for all samples in the map file. We will contact you if there is any missing data.

STEP 3: move sequence data and map file to Hydra Store

The Ocean DNA Store directory on Hydra contains two subdirectories where you should upload your data.

1. /store/nmnh_ocean_dna/raw_sequence_data

Each directory within raw_sequence_data should contain the results of a GenoHub sequencing project. To save disk space, please ensure that all sequence files are compressed (e.g. sequence1.fastq.gz).

Please use the following naming convention for your directories:

genohub-{GENOHUB_PROJECT_NUMBER}_{PROJECT_DESCRIPTION}

For example: “genohub-8459898_Vietnam”

Where “8459898” is the GenoHub project number and “Vietnam” is the description given by the project owner.

Warning

Do not include underscores or spaces in the project description.

To create a new directory for your project:

cd /store/nmnh_ocean_dna/raw_sequence_data
mkdir genohub-XXXXXX_MY-PROJECT-DESCRIPTION

You can now copy your sequence data to this directory. There are many ways to copy data, including:

  • “Drag and drop” with a file browser (not recommended)
  • The Linux cp command (not recommended)
    • Copies one file at a time, fairly slow
    • Does not check for complete file transfer, which can lead to corrupt files at the destination
  • rclone
    • Available on Hydra with module load tools/rclone/1.66.0
    • Copies multiple files in parallel
    • Automatically performs checks to ensure complete file transfer
    • Can resume an interrupted transfer by running the same rclone command again
    • Example usage:
      • Perform a “dry run” (list files but do not copy):
      rclone copy -v -n /PATH/TO/SOURCE /PATH/TO/DESTINATION
      • Copy files:
      rclone copy -v /PATH/TO/SOURCE /PATH/TO/DESTINATION
      • Copy only files that end in “.fastq.gz”
      rclone copy -v --include "*.fastq.gz" /PATH/TO/SOURCE /PATH/TO/DESTINATION
  • Globus
    • Web-based file transfer service
    • Normally used to transfer data between servers, but can also perform data transfers within Hydra
    • Does not require you to maintain an active connection to Hydra while files transfer
    • See the documentation for additional information

2. /store/nmnh_ocean_dna/raw_sequence_metadata

Each directory in raw_sequence_data must be accompanied by a CSV map file in raw_sequence_metadata. The names of the map files should match the names of the sequence directories (with the addition of _mapfile.csv).

You can upload the map file to raw_sequence_metadata using your favorite SCP or SFTP client (e.g. FileZilla, WinSCP, Cyberduck).

STEP 4: backup Hydra Store to P drive

After adding new sequence data and the corresponding map file to Hydra Store, please contact Dan MacGuigan. He will copy your data to the Ocean DNA P drive directory. Data transfer may take several hours, depending on the size of your dataset.

The P drive is a separate system from Hydra and is backed up frequently by the Smithsonian OCIO. To keep these data backups secure, Dan MacGuigan is the only person with access to the Ocean DNA P drive.

Instructions for Data Manager:

Below are the general instructions for backing up data on Hydra Store to the P drive. These instructions are specifically for Mac OS, but could be modified to work on a Windows PC.

To access the P drive, you will need a Smithsonian computer connected to SI-staff wifi or plugged into the SI ethernet. You will also need to be added as a user to the Ocean DNA P drive. Our NMNH IT contact for help with network drives is Joshua Henry.

First, we need to validate that map files and raw data directories contain the same files. Instructions for this step are forthcoming.

Next, establish a connection with the P drive. In the Mac menu bar, select Go > Connect to Server and enter the following address:

smb://si-ocio-qnas2.si.edu/nmnh/nmnh-all/public/nmnh-ocean-dna

This will open a new finder window. The P drive structure is as follows:

  • Hydra_backup
    • store
      • raw_sequence_data
      • raw_squence_metadata
      • raw_sequence_metadata_backups

There are other directories, but these are the relevant ones for our data management purposes.

Prior to copying new map files to the P drive, you should make a backup of the existing directory. Open the store directory in the terminal and run the following commands. For MONTH_DAY_YEAR, please use the format 09_26_25.

cp -r raw_sequence_metadata raw_sequence_metadata_backups/raw_sequence_metadata_MONTH_DAY_YEAR
cd raw_sequence_metadata_backups
tar -zcvf raw_sequence_metadata_MONTH_DAY_YEAR.tar.gz raw_sequence_metadata_MONTH_DAY_YEAR
rm -r raw_sequence_metadata_MONTH_DAY_YEAR

We will use Rclone to copy data from Hydra directly to the P drive. Installation instructions are available here.

If this is the first time using Rclone to connect to Hydra, run rclone config. Select the following options:

  • n - New remote
  • name> SI-Hydra
  • storage> 48 (check the list, 48 should be for sftp)
  • host> hydra-login01.si.edu
  • user> YOUR_USER_ID

For all other config questions, press enter to accept the default values and choose n when asked “Edit advanced config?”. When finished, enter q to quit to config setup.

We can now copy data to/from Hydra using Rclone.

rclone copy

Copy the source to the destination. Does not transfer files that are identical on source and destination, testing by size and modification time or MD5SUM. Doesn’t delete files from the destination.

Note that it is always the contents of the directory that is synced, not the directory itself. So when source:path is a directory, it’s the contents of source:path that are copied, not the directory name and contents.

Note that after Rclone copies a file, it will calculate and compare checksums to ensure successful file transfer.

First we’ll copy the metadata. On the command line, run the following:

cd /Volumes/nmnh-ocean-dna/Hydra_backup/store/raw_sequence_metadata
rclone copy -v -n SI-Hydra:/store/public/oceandna/raw_sequence_metadata . 

-v tells Rclone to produce verbose output, -c tell Rclone to use checksums, -n tells Rclone to perform a “dry run”, listing files and sizes but not copying

If this dry run looks good, initiate the transfer with:

rclone copy -v SI-Hydra:/store/public/oceandna/raw_sequence_metadata . 

The metadata should copy quickly. Now let’s do the same process for the sequence data.

cd /Volumes/nmnh-ocean-dna/Hydra_backup/store/raw_sequence_data
rclone copy -v -n SI-Hydra:/store/public/oceandna/raw_sequence_data . 

And again, if the dry run looks good, proceed with transfer.

rclone copy -v SI-Hydra:/store/public/oceandna/raw_sequence_data . 

If either transfer is interrupted, simply rerun the command to resume.

Important: after backing up sequence data to the P drive, the data manager should make all of the FASTQ files on Store read-only. This will mitigate the risk of accidental overwrite or deletion.

Run the following command in the relevant sequence data folders on Store. This will make all fastq.gz files read-only..

chmod 444 *.fastq.gz

More info about chmod can be found here.