graph TD; GenoHub[**GenoHub**] Metadata[**Map File**] Analyses(Run quality/adapter trimming, mitogenome assembly, etc) Scratch[(**Hydra Scratch**)] Store[(**Hydra Store**)] PDrive[(**P Drive**)] Move0[STEP 2: create map file] Move1[STEP 1: download FASTQs] Move3[copy important results] Move4[STEP 3: move to Hydra Store] Move5[STEP 4: backup to P drive] Move0-->Metadata Metadata-->Move4 GenoHub-->Move1 GenoHub-->Move0 Scratch-->Move4 Move1-->Scratch subgraph " " Scratch-->Analyses Analyses-->Move3 Move3-->Store Move4-->Store end Store-->Move5 Move5-->PDrive classDef process stroke:black,color:white,fill:#159BD7,stroke-dasharray: 5 5 classDef storage stroke:black,color:white,fill:#159BD7 classDef ccr stroke:black,color:white,fill:#159BD7 classDef step stroke:black,color:black,fill:#a0c5fa,stroke-dasharray: 5 5 class Rename,Analyses,Move2,Move3 process class Metadata,GenoHub,Scratch,Store,PDrive storage class Move0,Move1,Move4,Move5 step click Rename "bestpractices.html" linkStyle default stroke:grey, stroke-width:4px
Data management guide
Ocean DNA data storage and compute resources
The Ocean DNA group has access to the following disk storage. Please be mindful that these are shared resources.
- Hydra cluster scratch
- /scratch/nmnh_ocean_dna
- 40 TB
- Not backed up by Hydra admin
- No automatic file purging
- Hydra cluster store
- /store/nmnh_ocean_dna
- 40 TB
- Not backed up by Hydra admin
- No automatic file purging
- Intended for large raw data files and inactive projects
- Slower read/write speeds
- Can not be used for active analysis (not mounted on compute nodes)
- Smithsonian network P drive
- smb://si-ocio-qnas2.si.edu/nmnh/nmnh-all/public/nmnh-ocean-dna
- 80 TB
- Separate system from Hyrda, managed by OCIO
- Incrementally backed up daily
- Fully backed up weekly
- Only accessible from Smithsonian computers
Data management flowchart
Step-by-step data management workflow
Below are step-by-step instructions for our data management workflow. This workflow was designed for genome skimming datasets, but could be modified for other types of sequencing projects.
These instructions only pertain to raw sequence data. We do not have a standardized data management protocol for other data such as trimmed reads or analysis results.
STEP 1: download from GenoHub
GenoHub will provide a link with instructions to download the raw sequence data. The data should be demultiplexed, compressed sequence reads in FASTQ format (i.e. files should end in “.fastq.gz” or “.fq.gz”).
You can download sequence data directly from GenoHub to Hydra using the AWS command line interface. First, load the AWS module and run configuration. Enter the information provided on the GenoHub download instructions page when prompted.
module load tools/awscli/2.15.27
aws configure
Now you can access the AWS storage bucket containing your sequence data. To list the bucket contents prior to downloading, run the following, replacing “XXXXXX” with your GenoHub project number.
aws s3 ls s3://genohubXXXXXX
Run the following command to download your data. Replace “XXXXXX” with your GenoHub project number and specify the path to your desired download location.
aws s3 sync s3://genohubXXXXXXX /full/path/to/you/download/directory
We recommend downloading sequence data to /scratch/public/genomics/YOUR_USER_ID
. This directory has the fastest read/write speeds and the largest storage quota.
STEP 2: create the sequence data map file
The map file, a comma-separated values (CSV) file, lists all samples in the GenoHub project and is crucial for ensuring your data is both backed up and usable.
Please use the following naming convention for your map file:
genohub-{GENOHUB_PROJECT_NUMBER}_{PROJECT_DESCRIPTION}_mapfile.csv
For example: genohub-8459898_Vietnam_mapfile.csv
Where “8459898” is the GenoHub project number and “Vietnam” is the description given by the project owner.
Do not include underscores or spaces in the project description.
The first five columns of your map file must be:
- ID: GenoHub sample name. For example:
I-0025-99AD-1N1
- R1: Read 1 FASTQ file name. For example:
I-0025-99AD-1N1_1.fastq.gz
- R2: Read 2 FASTQ file name. For example:
I-0025-99AD-1N1_2.fastq.gz
- Taxon: Your best guess at taxonomic assignment, no required format. For example:
- Latin binomial:
Urophycis chuss
- Genus:
Urophycis sp.
- Higher level taxonomy:
Phycidae
- Combination:
Gadiformes_Phycidae_Urophycis_chuss
- Latin binomial:
- UniqID: Identifier linked to a voucher/tissue sample, no required format. For example:
- USNM catalog number:
USNM 477715
- EMu EZID:
http://n2t.net/ark:/65665/3987722bd-99bc-4913-a279-092f58c82d72
- GEOME BCID:
https://n2t.net/ark:/21547/FxY2USNM_Fish_477715.1
- USNM catalog number:
You may include other metadata as additional columns in the map file.
We strongly advise the use of globally/universally unique identifiers (GUIDs/UUIDs) for the UniqID
column. If you need to generate these IDs, consider creating a project in GEOME.
The first five columns must contain information for all samples in the map file. We will contact you if there is any missing data.
STEP 3: move sequence data and map file to Hydra Store
The Ocean DNA Store directory on Hydra contains two subdirectories where you should upload your data.
1. /store/nmnh_ocean_dna/raw_sequence_data
Each directory within raw_sequence_data
should contain the results of a GenoHub sequencing project. To save disk space, please ensure that all sequence files are compressed (e.g. sequence1.fastq.gz
).
Please use the following naming convention for your directories:
genohub-{GENOHUB_PROJECT_NUMBER}_{PROJECT_DESCRIPTION}
For example: “genohub-8459898_Vietnam”
Where “8459898” is the GenoHub project number and “Vietnam” is the description given by the project owner.
Do not include underscores or spaces in the project description.
To create a new directory for your project:
cd /store/nmnh_ocean_dna/raw_sequence_data
mkdir genohub-XXXXXX_MY-PROJECT-DESCRIPTION
You can now copy your sequence data to this directory. There are many ways to copy data, including:
- “Drag and drop” with a file browser (not recommended)
- The Linux
cp
command (not recommended)- Copies one file at a time, fairly slow
- Does not check for complete file transfer, which can lead to corrupt files at the destination
- rclone
- Available on Hydra with
module load XXXXXX
- Copies multiple files in parallel
- Automatically performs checks to ensure complete file transfer
- Can resume an interrupted transfer by running the same
rclone
command again - Example usage:
- Perform a “dry run” (list files but do not copy):
rclone copy -v -n /PATH/TO/SOURCE /PATH/TO/DESTINATION
- Copy files:
rclone copy -v /PATH/TO/SOURCE /PATH/TO/DESTINATION
- Copy only files that end in “.fastq.gz”
rclone copy -v --include "*.fastq.gz" /PATH/TO/SOURCE /PATH/TO/DESTINATION
- Available on Hydra with
- Globus
- Web-based file transfer service
- Normally used to transfer data between servers, but can also perform data transfers within Hydra
- Does not require you to maintain an active connection to Hydra while files transfer
- See the documentation for additional information
2. /store/nmnh_ocean_dna/raw_sequence_metadata
Each directory in raw_sequence_data
must be accompanied by a CSV map file in raw_sequence_metadata
. The names of the map files should match the names of the sequence directories (with the addition of _mapfile.csv
).
You can upload the map file to raw_sequence_metadata
using your favorite SCP or SFTP client (e.g. FileZilla, WinSCP, Cyberduck).
STEP 4: backup Hydra Store to P drive
After adding new sequence data and the corresponding map file to Hydra Store, please contact Dan MacGuigan. He will copy your data to the Ocean DNA P drive directory. Data transfer may take several hours, depending on the size of your dataset.
The P drive is a separate system from Hydra and is backed up frequently by the Smithsonian OCIO. To keep these data backups secure, Dan MacGuigan is the only person with access to the Ocean DNA P drive.