Home

This repository contains all bioinformatics workflows used on the St. Jude Cloud project. Officially, the repository is in beta — the project is adding workflows as they are developed and put into production.

Resources requirements and task parallelization have been optimized to minimize failures, but they may not reflect the best settings for your use case. Please ensure that you tailor these parameters to fit your needs.

🏠 Homepage

Workflows Structure

The repository contains workflows from multiple categories:

Harmonization - These workflows are standard entrypoints used by the St. Jude Cloud team to harmonize data for release on stjude.cloud.
Reference - These workflows are used to generate the necessary reference file inputs for our other workflows.
Utility - These workflows are generally called as subworkflows of other workflows, but may have utility on their own.
Variant Calling - These workflows are used to call a set of variants for a sample.
Other - These workflows are typically subworkflows of other workflows that should not be called in isolation or they are experimental workflows.
External - These workflows are hosted in other repositories, but are used by one (or more) of the workflows in this repository.

Expected FASTQ file name conventions

The tasks and workflows in this repository which have one or more FASTQ files as an input will also have a prefix input which will determine the filenames for any output files. The prefix input can be specified manually, or it can be left at the default value. The default value will attempt to strip common file suffixes from one of the input FASTQs and determine an appropriate basename to be used by all output files. That evaluation is most commonly performed using the POSIX ERE Regular Expression (([_.][rR](?:ead)?[12])((?:[_.-][^_.-]*?)*?))?\.(fastq|fq)(\.gz)?$. In plain English, this REGEX will at a minimum search for and remove the file extensions .fastq and .fq with or without a .gz GZIP extension. Additionally, if the FASTQ filename contains a "read number" signifier (R1/R2/r1/r2/read1/read2) somewhere before the FASTQ extension, that will be truncated off the basename. This means that everything after the read indicator will be removed. If there is important information encoded in your filenames between the read number and the final extension, we recommend you manually specify an appropriate prefix value.

Examples

Every filename in the following list will have the evaluated prefix "sample" if no override value is provided.

sample_R1_100000.fastq.gz
sample_R2.fq
sample.fq.gz
sample.r1_100000.trimmed.fastq.gz
sample.R2_100000.trimmed-kebab.fastq.gz
sample_r1.100000.trimmed-kebab.foobar.fastq.gz
sample.Read2_100000.trimmed-kebab.fastq
sample_read1.100000.trimmed-kebab.fq

A FASTQ with the filename sample.100000-kebab.foobar.fastq.gz would have a default prefix value of "sample.100000-kebab.foobar".

The following filenames will not result in any trimming of the filename, and should likely be either renamed or have a manually specified prefix:

sample_R_one.FASTQ.gz
sampleR1.Fq
sample_read_two.fq.zip

👤 St. Jude Cloud Team

Website: https://stjude.cloud
Github: @stjudecloud
Twitter: @StJudeResearch

🤝 Contributing