Data management guide

Ocean DNA data storage and compute resources

The Ocean DNA group has access to the following disk storage. Please be mindful that these are shared resources.

  • Hydra cluster scratch
    • /scratch/nmnh_ocean_dna
    • 40 TB
    • Not backed up by Hydra admin
    • No automatic file purging
  • Hydra cluster store
    • /store/nmnh_ocean_dna
    • 40 TB
    • Not backed up by Hydra admin
    • No automatic file purging
    • Intended for large raw data files and inactive projects
    • Slower read/write speeds
    • Can not be used for active analysis (not mounted on compute nodes)
  • Smithsonian network P drive
    • smb://si-ocio-qnas2.si.edu/nmnh/nmnh-all/public/nmnh-ocean-dna
    • 80 TB
    • Separate system from Hyrda, managed by OCIO
    • Incrementally backed up daily
    • Fully backed up weekly
    • Only accessible from Smithsonian computers

Data management flowchart

graph TD;

  GenoHub[**GenoHub**]
  Metadata[**Map File**]
  Analyses(Run quality/adapter trimming, mitogenome assembly, etc)
  Scratch[(**Hydra Scratch**)]
  Store[(**Hydra Store**)]
  PDrive[(**P Drive**)]
  
  Move0[STEP 2: create map file]
  Move1[STEP 1: download FASTQs]  
  Move3[copy important results]
  Move4[STEP 3: move to Hydra Store]
  Move5[STEP 4: backup to P drive]

  Move0-->Metadata
  Metadata-->Move4
  GenoHub-->Move1
  GenoHub-->Move0
  Scratch-->Move4
  Move1-->Scratch
  subgraph " "
    Scratch-->Analyses
    Analyses-->Move3
    Move3-->Store
    Move4-->Store
  end
  Store-->Move5
  Move5-->PDrive

  classDef process stroke:black,color:white,fill:#159BD7,stroke-dasharray: 5 5
  classDef storage stroke:black,color:white,fill:#159BD7
  classDef ccr stroke:black,color:white,fill:#159BD7
  classDef step stroke:black,color:black,fill:#a0c5fa,stroke-dasharray: 5 5

  class Rename,Analyses,Move2,Move3 process
  class Metadata,GenoHub,Scratch,Store,PDrive storage
  class Move0,Move1,Move4,Move5 step

  click Rename "bestpractices.html"

  linkStyle default stroke:grey, stroke-width:4px

Step-by-step data management workflow

Below are step-by-step instructions for our data management workflow. This workflow was designed for genome skimming datasets, but could be modified for other types of sequencing projects.

Note

These instructions only pertain to raw sequence data. We do not have a standardized data management protocol for other data such as trimmed reads or analysis results.

STEP 1: download from GenoHub

GenoHub will provide a link with instructions to download the raw sequence data. The data should be demultiplexed, compressed sequence reads in FASTQ format (i.e. files should end in “.fastq.gz” or “.fq.gz”).

You can download sequence data directly from GenoHub to Hydra using the AWS command line interface. First, load the AWS module and run configuration. Enter the information provided on the GenoHub download instructions page when prompted.

module load tools/awscli/2.15.27
aws configure

Now you can access the AWS storage bucket containing your sequence data. To list the bucket contents prior to downloading, run the following, replacing “XXXXXX” with your GenoHub project number.

aws s3 ls s3://genohubXXXXXX

Run the following command to download your data. Replace “XXXXXX” with your GenoHub project number and specify the path to your desired download location.

aws s3 sync s3://genohubXXXXXXX /full/path/to/you/download/directory
Note

We recommend downloading sequence data to /scratch/public/genomics/YOUR_USER_ID. This directory has the fastest read/write speeds and the largest storage quota.

STEP 2: create the sequence data map file

The map file, a comma-separated values (CSV) file, lists all samples in the GenoHub project and is crucial for ensuring your data is both backed up and usable.

Please use the following naming convention for your map file:

genohub-{GENOHUB_PROJECT_NUMBER}_{PROJECT_DESCRIPTION}_mapfile.csv

For example: genohub-8459898_Vietnam_mapfile.csv

Where “8459898” is the GenoHub project number and “Vietnam” is the description given by the project owner.

Warning

Do not include underscores or spaces in the project description.

The first five columns of your map file must be:

  • ID: GenoHub sample name. For example:
    • I-0025-99AD-1N1
  • R1: Read 1 FASTQ file name. For example:
    • I-0025-99AD-1N1_1.fastq.gz
  • R2: Read 2 FASTQ file name. For example:
    • I-0025-99AD-1N1_2.fastq.gz
  • Taxon: Your best guess at taxonomic assignment, no required format. For example:
    • Latin binomial: Urophycis chuss
    • Genus: Urophycis sp.
    • Higher level taxonomy: Phycidae
    • Combination: Gadiformes_Phycidae_Urophycis_chuss
  • UniqID: Identifier linked to a voucher/tissue sample, no required format. For example:
    • USNM catalog number: USNM 477715
    • EMu EZID: http://n2t.net/ark:/65665/3987722bd-99bc-4913-a279-092f58c82d72
    • GEOME BCID: https://n2t.net/ark:/21547/FxY2USNM_Fish_477715.1

You may include other metadata as additional columns in the map file.

Note

We strongly advise the use of globally/universally unique identifiers (GUIDs/UUIDs) for the UniqID column. If you need to generate these IDs, consider creating a project in GEOME.

Important

The first five columns must contain information for all samples in the map file. We will contact you if there is any missing data.

STEP 3: move sequence data and map file to Hydra Store

The Ocean DNA Store directory on Hydra contains two subdirectories where you should upload your data.

1. /store/nmnh_ocean_dna/raw_sequence_data

Each directory within raw_sequence_data should contain the results of a GenoHub sequencing project. To save disk space, please ensure that all sequence files are compressed (e.g. sequence1.fastq.gz).

Please use the following naming convention for your directories:

genohub-{GENOHUB_PROJECT_NUMBER}_{PROJECT_DESCRIPTION}

For example: “genohub-8459898_Vietnam”

Where “8459898” is the GenoHub project number and “Vietnam” is the description given by the project owner.

Warning

Do not include underscores or spaces in the project description.

To create a new directory for your project:

cd /store/nmnh_ocean_dna/raw_sequence_data
mkdir genohub-XXXXXX_MY-PROJECT-DESCRIPTION

You can now copy your sequence data to this directory. There are many ways to copy data, including:

  • “Drag and drop” with a file browser (not recommended)
  • The Linux cp command (not recommended)
    • Copies one file at a time, fairly slow
    • Does not check for complete file transfer, which can lead to corrupt files at the destination
  • rclone
    • Available on Hydra with module load XXXXXX
    • Copies multiple files in parallel
    • Automatically performs checks to ensure complete file transfer
    • Can resume an interrupted transfer by running the same rclone command again
    • Example usage:
      • Perform a “dry run” (list files but do not copy):
      rclone copy -v -n /PATH/TO/SOURCE /PATH/TO/DESTINATION
      • Copy files:
      rclone copy -v /PATH/TO/SOURCE /PATH/TO/DESTINATION
      • Copy only files that end in “.fastq.gz”
      rclone copy -v --include "*.fastq.gz" /PATH/TO/SOURCE /PATH/TO/DESTINATION
  • Globus
    • Web-based file transfer service
    • Normally used to transfer data between servers, but can also perform data transfers within Hydra
    • Does not require you to maintain an active connection to Hydra while files transfer
    • See the documentation for additional information

2. /store/nmnh_ocean_dna/raw_sequence_metadata

Each directory in raw_sequence_data must be accompanied by a CSV map file in raw_sequence_metadata. The names of the map files should match the names of the sequence directories (with the addition of _mapfile.csv).

You can upload the map file to raw_sequence_metadata using your favorite SCP or SFTP client (e.g. FileZilla, WinSCP, Cyberduck).

STEP 4: backup Hydra Store to P drive

After adding new sequence data and the corresponding map file to Hydra Store, please contact Dan MacGuigan. He will copy your data to the Ocean DNA P drive directory. Data transfer may take several hours, depending on the size of your dataset.

The P drive is a separate system from Hydra and is backed up frequently by the Smithsonian OCIO. To keep these data backups secure, Dan MacGuigan is the only person with access to the Ocean DNA P drive.