How to use MitoPilot on the Smithsonian Hydra computing cluster
You will need an account to access the Hyrda computing cluster. Instructions are available here.
First time setup
Dan MacGuigan has submitted a request to the Hydra team for installation of a Nextflow module. But for now, you will need to install your own copy of Nextflow on the cluster. Login to Hydra and run the following.
# Nextflow installation instructions
# from https://www.nextflow.io/docs/latest/install.html
cd ~
module load tools/java/21.0.2
curl -s https://get.nextflow.io | bash # install Nextflow
chmod +x nextflow # make Nextflow executable
There will now be an executable nextflow
file in your
home directory. You should move it to a location that is in your
PATH
. For example:
mkdir ~/bin # create bin directory, if needed
mv ~/nextflow ~/bin/nextflow # move nextflow to bin directory
echo 'export PATH="${HOME}/bin:${PATH}"' >> ~/.bashrc # add bin directory to PATH, in case it's not already there
source ~/.bashrc
This should allow you to call nextflow
from anywhere on
the cluster.
Note: You must load the Hydra Java module
(module load tools/java/21.0.2
) whenever you wish to use
Nextflow.
Launching RStudio server
We will use RStudio server to run MitoPilot. RStudio server functions much like the RStudio on your local computer, but using the Hydra cluster’s data storage and computational resources.
There are two ways to access RStudio server on Hydra.
Tunneling to a RStudio server session
We recommend always using an interactive session when tunneling to RStudio server. This avoids unnecessary computational burden on the login nodes. To launch an interactive session, run the following.
Note: You must include -pe mthread 2
in
order to have enough available RAM for building the MitoPilot
Singularity image.
Note: Interactive sessions on Hydra can run for a maximum of 24 hours. Additionally, users are limited to one active interactive session at a time.
Once your interactive session has started, launch RStudio server.
# avoid package conflicts (may not be necessary for all users)
conda deactivate
# load the RStudio server module
module load tools/R/RStudio/server
# launch RStudio server
start-rstudio-server
Note: If this is your first time launching RStudio server, you may be asked to run a different command.
You will see something like this printed to your screen.
start-rstudio-server: starting RStudio server on host=login02 and port=8787
you need to create a ssh tunnel on your local machine with
ssh -N -L 8787:login02:8787 macguigand@hydra-login01.si.edu
Point your browser to http://localhost:8787 on your local machine.
Use Control+C in this window to kill the server when done.
TTY detected. Printing informational message about logging configuration. Logging configuration loaded from '/etc/rstudio/logging.conf'. Logging to '/home/macguigand/.local/share/rstudio/log/rserver.log'.
Note: If you get a message saying “ERROR system
error 98 (Address already in use)”, someone else has a tunnel
established with the default port (8787). To fix this, try using a
different port, e.g. start-rstudio-server -port 8890
. Any
port number between 1025-65535 is allowed.
Leave this cluster terminal window open, open a new terminal window
on your local computer, and run the ssh
command printed by
start-rstudio-server
.
Enter your Hydra password when prompted. If nothing happens, this means that you have successfully established a SSH tunnel and can connect to RStudio server.
Leaving both terminal windows open, enter http://localhost:YOUR_PORT_NUMBER in a web browser. We recommend using Chrome or Firefox. There are some known issues running MitoPilot with Safari.
Enter your cluster login credentials to access the RStudio server. This should open a full RStudio session in your browser. Any R commands run in this RStudio window will execute on the cluster.
RStudio Galaxy server
The Hydra Team recently launched a new interactive R Studio environment that is accessible directly via a browser, at https://galaxy.si.edu/R4.
Hydra users can leverage this server to test, debug, and develop R based workflows using the interactive R Studio GUI (currently running R 4.4.3).
By logging in with your Hydra account credentials, users will have access to the storage under /pool, /scratch and /store. This server offers resources totaling 192 CPUs and 1.5 T of RAM.
Notes:
- This is a shared resource and should be used accordingly. Long running jobs or jobs requiring the entire resources of the server would be more appropriate as a job submission.
- This server is only accessible from trusted computers, not on the public internet. However, if you can access Hydra, you should be able to access this server. For technical reasons, to access this resource via telework.si.edu, go to https://galaxy.si.edu and then choose the “R4 v443” option.
- This is a new resource - please be patient as we test this offering with our user community. We will evaluate this test once Hydra is moved to the new data center and decide whether it should be kept or altered in any way.
Installing MitoPilot
To install MitoPilot, use the RStudio server window to run the following. This might take a while.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
install.packages("remotes")
}
BiocManager::install("Smithsonian/MitoPilot")
If the installation was successful, you’re ready to start using MitoPilot!
Updating MitoPilot
If you need to update MitoPilot, simply run the BiocManager
installation command again. If you would like to ensure that you’re
using the latest MitoPilot version, run
remove.packages("MitoPilot")
prior to installation.
After updating MitoPilot, we recommend restarting R (in RStudio,
Session > Restart R or run .rs.restartR()
) and then
reloading the package with library(MitoPilot)
.
We also recommend clearing your Singularity cache with
singularity cache clean
to ensure you are using the latest
MitoPilot Singularity image.
Launching MitoPilot
To load the MitoPilot R package, run library(MitoPilot)
within your RStudio server session. You can now utilize all of
MitoPilot’s functions, such as initializing
a project or opening
the R Shiny GUI.
Want to learn how to use MitoPilot? Check out the Test Project Tutorial.
Running Large MitoPilot Jobs
If you have a large number of samples to process (more than a few dozen), we recommend running the assemble and annotate MitoPilot modules as batch jobs.
Running these modules within the R Shiny GUI requires you to maintain an open connection to the cluster. There may be issues restarting if the connection breaks while Nextflow is running. Instead, we can “fire and forget” by submitting batch jobs.
First, initialize your new project and modify any desired parameters
using the GUI. Once ready, click UPDATE
. A new window
should appear.
Rather than clicking the Start Nextflow
button, copy the
Nextflow command and create a submission script. We have provided a
template below. You may wish to modify the job name (-N
)
and the log file name (-o
).
#!/bin/sh
#$ -N MitoPilot_assembly # MODIFY THIS IF DESIRED
#$ -o MitoPilot_assembly.log # MODIFY THIS IF DESIRED
#$ -cwd -j y
#$ -q lTWFM.sq
#$ -l wfmq
#$ -pe mthread 2
#$ -S /bin/sh
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
source ~/.bashrc
module load tools/java/21.0.2 # required for Nextflow on Hydra
# NEXTFLOW COMMAND, example below
nextflow -log /pool/public/genomics/macguigand/MitoPilot/22030FL-06-02/run_02/.logs/nextflow.log run /home/macguigand/R/x86_64-pc-linux-gnu-library/4.4/MitoPilot/nextflow -c /pool/public/genomics/macguigand/MitoPilot/22030FL-06-02/run_02/.config -entry WF1
echo = `date` job $JOB_NAME done
Note: You must use the options
-q lTWFM.sq
and -l wfmq
. This is a special
Hydra queue for workflow managers like Nextflow. You must also include
-pe mthread 2
in order to have enough available RAM for
building the MitoPilot Singularity image.
Move the submission script into your MitoPilot run directory (in the
above example,
/pool/public/genomics/macguigand/MitoPilot/22030FL-06-02/run_02/
).
Then submit the job using qsub MY_SCRIPT_NAME.sh
.
You can monitor the progress of this job using the qstat
command and by checking on the log files. Once the job is done, you can
relaunch the GUI to inspect the results. The same approach can be used
for the annotate module.