Test Project Tutorial • MitoPilot

If you are a Smithsonian NMNH user, please see the 2025 workshop website for the most up-to-date instructions. The workshop documentation may also be helpful for other users.

Installation

We recommend running the included test project (Illumina data for 13 fish species) before trying out MitoPilot with your own samples. The following tutorial provides a step-by-step walkthrough.

First, make sure you have R (>=4.0.0) and Nextflow installed. This tutorial also assumes that you are using RStudio to interface with R. If you’re working on a computing cluster, we recommend checking out RStudio Server.

We have provided detailed installation and usage instructions for the Smithsonian Hydra and NOAA SEDNA computing clusters.

Next you’ll need to install the MitoPilot R package from GitHub. Within RStudio, run the following.

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
if (!requireNamespace("remotes", quietly = TRUE)) {
    install.packages("remotes")
}
BiocManager::install("Smithsonian/MitoPilot")

Project Initialization

Now we can initialize the test project. For your own data, you would use the function MitoPilot::new_project. However, for the test data, we’ll use MitoPilot::new_test_project.

# load the package
library(MitoPilot)

# specify the directory where your test project will be created
# if the directory does not exist, MitoPilot will create it
wd = "/pool/public/genomics/macguigand/MitoPilot/testing/2025_04_01"

# specify an execution environment, "local", "NMNH_Hydra", or "NOAA_SEDNA"
ex = "NMNH_Hydra"

# initialize the test project
MitoPilot::new_test_project(
    path = wd,
    executor = ex,
    full_size = FALSE,
    Rproj = FALSE
)

Note:If you are using an execution environment that is not currently supported, you could use config = config.MyEnv to pass a custom Nextflow config to the MitoPilot::new_test_project function. THIS FEATURE IS CURRENTLY UNDER DEVELOPMENT.

If the test project was successfully initialized, you should see the following.

Creating project directory: /pool/public/genomics/macguigand/MitoPilot/testing/2025_04_01
SRR22396794 - Psychrolutes paradoxus
SRR22396940 - Psenes pellucidus
SRR22396740 - Hoplostethus occidentalis
SRR21844202 - Fundulus majalis
SRR22396640 - Xyrichtys novacula
SRR22396732 - Gephyroberyx darwinii
SRR22396627 - Gigantura indica
SRR21843972 - Stomias affinis
SRR22396843 - Conger oceanicus
SRR22396668 - Erotelis smaragdus
SRR22396758 - Upeneus parvus
SRR22396865 - Paraconger caudilimbatus
Project initialized successfully.
Please open and review the .config file to ensure all required options are specified.

Exploring the MitoPilot GUI

We can now launch the R Shiny Graphical User Interface (GUI) to examine our test project and start the MitoPilot analysis pipeline.

# the function to launch the GUI must be called from within your project's directory
setwd(wd)
MitoPilot::MitoPilot()

Nice! We can see some basic information about our samples. When initializing your own project, this is pulled from the mapping CSV file.

The dropdown menu at the top left switches between the different MitoPilot modules: Assemble, Annotate, and Export. If necessary, the circular arrow button in the top left will refresh the sample table.

Clicking on a column name will sort the table by that column.

You can filter samples using the search box in the top right. Most columns can also be filtered using the text boxes at the top of the table.

To the left of the sample ID column are two icons. The first icon shows whether the sample is locked or unlocked. If a sample is unlocked, it will be included when running the current pipeline module. A locked sample cannot be edited for the current module, but will be made available for the next pipeline module. You can toggle the lock by selecting the sample (check mark), then pressing LOCK at the top of the window. Multiple samples can be locked or unlocked simultaneously.

The other icon shows the state of the sample. These states vary depending on the analysis module and will be automatically updated when running the pipeline. However, you can also manually change the state of a sample using the STATE button. For example, in the Assemble module you could prevent samples from running through the pipeline by manually setting their states to Hold / Waiting.

Modifying Pipeline Parameters

Within the GUI, we can modify options for each step of the current pipeline module. Click on one of the default links in the Preprocess Opts column.

Here you can see the default options for pre-processing your raw FASTQ files. We can modify these by clicking the edit checkbox. Let’s change the memory to 20 GB. You could also change the options passed to fastp, but we’ll keep those at the defaults for now.

We can save these new parameter options by clicking on the Parameter set name box, typing a new name, and clicking Add YOUR NEW NAME... in the dropdown. Finally, click Update in the bottom right to save your selection.

Once you’ve saved a new parameter set, you can easily access it again using the dropdown menu.

After clicking Update, your table should now show test for all samples in the Preprocess Opts column.

You can set different parameters for different samples by simply selecting the set you want to change and repeating the above process. But for this test dataset, let’s keep everything consistent.

Let’s also change the Assembly Opts. Repeat the process above, and creating a new himem setting with 6 CPUs and 36 GB of memory.

In the Assembly Opts window, you can also modify the parameters for GetOrganalle, including the seeds and labels databases. The default fish reference databases are downloaded from the MitoPilot GitHub repository. You could use custom databases by providing the full path to the appropriate FASTA files on your local computer or cluster. For this test project, we’ll leave all of the GetOrganalle settings at the default values.

Running the Assemble Module

Let’s get started with the pipeline! Select all samples, then click the UPDATE button. A new window should appear with a Nextflow command.

You have three options for running the pipeline. First, you could copy the Nextflow command and run it in a terminal window. This requires you to maintain an open connection while Nextflow is running, which may cause issues for large, complex datasets.

Alternatively, you could embed the Nextflow command within a batch job submission script for a computing cluster. This allows you to run the analyses in the background, which is preferable for datasets with a large number of samples that may take several hours to process.

We have provided instructions for running MitoPilot Nextflow commands as batch jobs on the Smithsonian Hydra and NOAA SEDNA computing clusters.

The final option is the Start Nextflow button, which will run the analysis pipeline within the GUI. This requires you to keep the GUI open while the pipeline is running.

Warning:Currently, the Start Nextflow button is not compatible with the NOAA SEDNA computing cluster. Please use the command line or batch job method, making sure to run mamba activate MitoPilot_deps prior to calling Nextflow.

Since our test dataset is small, let’s run Nextflow within the GUI. Click Start Nextflow to launch the pipeline.

And we’re off! You should see Nextflow output being printed the to Progress window. This window will continually update as the pipeline progresses. As long as the gears in the top right are spinning, the pipeline is still running.

The Assemble pipeline module has 3 steps: preprocessing (filtering) the raw FASTQ files with fastp, assembling the mitogenomes using GetOrganelle, and calculating coverage maps for the assemblies using bowtie2.

If you’re working on a computing cluster, Nextflow distributes the analyses across multiple batch jobs. This can allow you to run hundreds of samples simultaneously. You can use your cluster’s job scheduler to check on the status of MitoPilot jobs. Some samples may finish faster than others.

You may notice that some samples fail at certain steps (though this is unlikely for the test dataset). Often this is due to a memory error, so MitoPilot will retry failed samples with more RAM. The Nextflow log tracks the number of failed samples and retries.

The test dataset should take a few minutes to finish. When it’s done, your Progress window should look something like this.

If you scroll to the bottom of the progress window, you can find some runtime statistics.

Click Close to return to the samples table.

Inspecting Assembly Results

The sample table should now be updated with new information about the mitogenome assemblies.

If you want to locate the results files for a sample, scroll all the way to the right and click output. This will open the appropriate folder in the Files pane of your RStudio session.

You can also view the results within the MitoPilot GUI. For example, select SRR21843972 (Stomias affinis), then click details. This will open a new window with the mitogenome sequence.

Select the sequence and click the Fasta button in the bottom right. This will copy the FASTA formatted mitogenome to your clipboard, which you can paste into your favorite text editor. This could be useful for a quick BLAST search.

You can also click the view button, which will open a summary figure in a new tab, including mean read depth, sequence error rate, and GC content.

Notice that in this figure, read depth drops off dramatically at both ends of the sequence. SRR21843972 was unable to assemble a circular mitogenome due to poor read coverage in this region. To assemble a circular mitogenome, you could try running the Assemble module again with different GetOrganelle settings (see their Wiki) or more sequence data.

Problematic Samples

In this test dataset, there are two problematic samples which returned failed states. You can see which samples failed by looking for the exclamation mark state icon. Selecting a sample with that icon and clicking on the STATE button will confirm.

One failed sample is SRR22396758 (Upeneus parvus). We purposefully truncated the data for this sample to contain only 200 reads.

With so few reads, this sample failed to meet the min_depth threshold and returned a message “Insufficient sequencing depth” in the Notes column. When analyzing your own data, you can specify the min_depth threshold with the new_project function.

The other failed sample is SRR21844202 (Fundulus majalis). This sample had plenty of data and we were able to assemble a mitogenome. However, the message in the Notes columns says “Unable to resolve single assembly from reads.”

GetOrganelle produced two alternate assembly paths for this sample. This is usually due to complicated assembly graphs near a repeat region. Please refer to the GetOrganlle paper for more information about multiple assembly paths.

Let’s take a peek at SRR21844202 (Fundulus majalis). Select the sample, scroll all the way to the right, and click on details.

We can see two assemblies listed here. Clicking on view will show us the coverage, error, and GC content graphs for each assembly.

Path #1

Path #2

The two assembly paths differ slightly around 13,000 bp.

We can chose to move forward with just one assembly path by clicking the “ignore” button for one of the assembly paths.

Alternatively, we can use the consensus sequence. Select both paths and click the Align button in the bottom right. The sequence alignment will pop up, showing us that the two paths have 99.9897% sequence similarity.

If we scroll through the alignment, we can see a few base pair differences.

Selecting Trim Consensus will remove any conflicting regions of the aligned assembly paths and produce a shorter consensus sequence of both alignments. Doing so will automatically ignore the original two assembly paths.

Click close and return to the samples table. You will see that the # Paths column for SRR21844202 is highlighted and there is a note indicating that the assembly was edited. This sample has also been automatically changed to a locked state.

Warning: Carefully consider your options for samples with multiple assembly paths. You may wish to align each assembly against a reference or compare depth of sequencing coverage. There is no one-size-fits-all solution.

Running the Annotate Module

We can now move on to the Annotate pipeline module. This module consists of an annotation step using Mitos2 and tRNAscan-SE, a taxon-specific curation step to refine the annotation models, and a validation step to flag possible issues or known errors that would be rejected by NCBI GenBank.

First, we need to lock all of the successful samples in the Assemble module. Select all of the samples except SRR22396758 (Upeneus parvus) and click the LOCK button. Then use the dropdown menu in the top left to navigate to the Annotate module.

Like before, let’s edit the Annotate Opts and increase the memory allocation to 60 GB. In this window, you could also edit the Mitos2 and tRNAscan-SE options. In the future, we will allow users to specify custom reference databases for annotation.

You can also edit the curation options. Currently, we only have one set of curation parameters: fish_mito. This will be updated in the near future.

To run the Annotate module, select all samples, click UPDATE, then click Start Nextflow. This will take a few minutes. As long as the gears in the top right are spinning, the pipeline is still running.

Once all steps are complete, the gears will stop spinning and you’ll see a summary message printed at the bottom of the Progress window. Click Close to return to the sample table.

Exploring Annotation Results

There are a LOT of results to explore from the Annotate module. The sample table reports some basic stats about the number of protein-coding genes (PCGs), tRNAs, and rRNAs. The missing column reports which mitochondrial genes were not annotated (based on the provided curation model). The extra column notes the number of potentially duplicated genes.

Lastly, the warnings column indicates how many warning flags were raised during the validation step. Samples with many warnings will require more manual curation to ensure that they are not rejected during submission to NCBI GenBank. Warning messages are explained in further detail here.

The annotation results for each sample can be examined more closely by clicking the details button. First, let’s examine the details for a good sample, SRR19434536 (Rhinecanthus rectangulus).

This sample has the expected number of genes and has no warnings. However, the notes column shows that MitoPilot still made some tweaks to the annotation during the curation step. For example, the start position of rRNA rrnL was moved 22 bp upstream and the stop codon for NAD2 was trimmed by 2 bp.

Nucleotide sequences (and amino acid sequence for PCGs) can be copied to the clipboard using the nt (and aa) buttons on the far right. This could be helpful if you’d like to manually BLAST some genes.

Clicking on the Coverage Map button will show a plot of sequence depth, zooming to the position of the highlighted gene.

Note: The Coverage Map feature does not work consistenly on the NMNH Hydra cluster. We are working on a fix.

For protein coding genes, you can click the Alignment button to show the protein alignment of your annotated gene against a reference database. Currently, this shows only the top hits (filtered in blast using -best_hit_score_edge 0.01) from the curation process, which may be one or more sequences.

By default, MitoPilot uses RefSeq as the BLAST database. If you would like to use your own custom BLAST database, you can check the Local blast box. MitoPilot will return a message with the instructions: run options('MitoPilot.local.db' = '/path/to/local/blastp/db') within your RStudio session. You will need to restart the MitoPilot GUI for this change to take effect.

Editing Annotation Results

MitoPilot offers some basic functions to manually edit your annotation results. Let’s open the alignment for the SRR19434536 (Rhinecanthus rectangulus) “cox1” gene. Clicking the EDIT button brings up a few options to change the start and stop position annotation.

Try clicking the + button next to START. It will take a few seconds, since MitoPilot has to redo the alignments. But you should see that the “cox1 (focal)” sequence start position has shifted.

This is clearly a worse alignment. Click RESET to undo your changes.

You can also delete annotations by selecting them and clicking Delete at the bottom of the details window. Delete with caution; there’s no easy way to add the annotation back without running the sample through the Annotate module again.

The Linearize button will convert a circular assembly to a linear assembly. This may be useful if the D-loop region was poorly assmebled or annotated.

You can use Mark as reviewed/unreviewed to change the state of the Reviewed column, which may be helpful for tracking the progress of your manual edits.

Please note that annotation editing is a bit temperamental and you may encounter bugs. Please report any problems on the MitoPilot GitHub issues page.

Running the Export Module

Let’s move to the final step of the pipeline: the Export module. This module allows you to create groups of samples, then generate alignments and files formatted for submission to GenBank.

First, we need to lock the successful samples in the Annotate module. Select all of the samples and click the LOCK button. Then use the dropdown menu in the top left to navigate to the Export module.

Next, let’s create a group of samples. Sort the samples table by clicking on the Taxon column, then select the first five samples and click the GROUP button.

A new window will appear showing some summary information for the selected samples. Type a name for the group, click Add.., then click Create. You can then click Close to return to the sample table.

The sample table now shows the group assignment in the Group column.

Note: You can assign new samples to an existing group. Simply select the samples you wish to add, click GROUP, then choose the existing group from the dropdown menu.

Note: Each sample can only belong to one group. MitoPilot will produce a warning message if you attempt to re-assign a sample to a new group.

We can proceed to exporting the data for this group. Click EXPORT DATA, which will open a new window. Only samples that have been added to a group can be exported in this manner.

If you have multiple groups, you can select the appropriate one from the dropdown menu.

MitoPilot allows you to create a custom FASTA header for your samples, pulling data from columns the CSV file you supplied for the mapping_fn argument of the MitoPilot::new_project function. This conveniently allows you to include metadata needed for your NCBI GenBank submission. To reference a column, use curly brackets. For example organism={Taxon} will autofill values in the “Taxon” column into the FASTA header.

For this test project, we can leave the Fasta Header Template at the default value.

The Generate Group-level PCG alignment summary will run alignments of all the protein coding genes in your group. For large groups, this can take a while. But it’s useful for a final quality control check.

Let’s toggle the Export individual genes button. This will generate additional FASTA files and GenBank feature tables for each protein coding gene. This can be helpful if you want to use individual genes for phylogenetic analyses or submit them to GenBank. These gene FASTA files have their own custom header template that you can modify.

Click Export to generate the final files. It may take a couple of minutes, but as long as the gears are spinning, MitoPilot is still running.

Once complete, MitoPilot will print the location of the exported files. By default, that location is YOUR_PROJECT_DIRECTORY/out/export/YOUR_GROUP_NAME.

Exploring the Final Results

MitoPilot will produce three output files. First, the .html file contains visualizations of each protein coding gene alignment in your export group. This may be useful to quickly identify samples with poor annotations that need additional manual curation.

For example, the ATP8 annotation for “SRR22396640” clearly stands out when compared with the other samples in this export group.

Next are the two files you will need for submission to NCBI GenBank. The .fasta file contains the mitogenome assemblies for each sample, following the header template you specified when exporting the data. MitoPilot attempts to adjust the start position of every assembly to the start of the trnF gene.

Additionally there is the .tbl file, a standardized 5-column, tab-delimited feature table containing information about the annotated genes for each mitogenome.

With a real dataset, you could use these two files for submission to NCBI GenBank.

The export directory also contains a sub-directory GFFs with annotations for each sample in GFF3 format. GFF files can be loaded by tools like Geneious for additional manual inspection prior to submission.

Lastly, if you selected the Export individual genes option, there will be another directory genes containing sub-directories for every protein coding gene. In these you will find FASTA files and feature tables for the corresponding gene. There will also be a concatenated FASTA file and feature table containing all protein coding genes, named GROUP_PCGs.fasta/tbl.

Wrap-up

Congratulations, you’ve reached the end of the test project tutorial! Hopefully you now have a solid understanding of the MitoPilot interface and can begin to analyze and explore your own mitogenome datasets.