QC Pipeline for Whole Genome, Whole Exome, and RNA Sequencing

Introduction
Motivation
Discussion
Specification

Introduction

This RFC documents an automated workflow for assessing the integrity and quality of St. Jude Cloud omics data. The goal of this RFC is two-fold.

Establish state of the art method for comprehensively evaluating genomics data quality at scale, both at time of receipt and after processing, for whole genome, whole exome, and transcriptome sequencing.
Publish a collection of metrics that end-users of St. Jude Cloud can leverage to assess the quality of the data available. This context should save users time computing the information themselves while also informing appropriate use of the data.

You can find the relevant discussion on the associated pull request.

St. Jude Cloud is a large repository of omics data available for request from the academic and non-profit community. As such, the project processes thousands of samples from whole genome, whole exome, RNA-Seq, and various omics-based assays each year. A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will:

Define the standard set of QC tools used to evaluating omics-based data,
identify and implement key metrics that can be automated to assist in manual observation of the data, and
outline mechanisms to publish those QC results alongside the data already in St. Jude Cloud so that end-users can leverage this information.

Discussion

Types of QC

There are (at least) two different types of QC typically carried out omics-based data. Experimental QC attempts to identify the success of the assay(s) performed. Once one is sufficiently satisfied that the data generated by an experiment is "good", computational QC examines the degree to which computational processing of that data was completed successfully.

By the time data reaches the St. Jude Cloud team from various sources, extensive experimental and computational evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas, which is dependent on the best practices at that point in time and the goals of the project. Most often, our team takes the computational data, reverts it back to its raw form (such as FastQ files), and reprocesses the data using a harmonization pipeline.

Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the computational QC of the files produced for publication in St. Jude Cloud. While we do produce results that define experimental results (such as fastqc), these are rarely used to decide which files pass or fail our QC—this is in recognition of the fact that the data, while not always perfect, is extremely valuable due to its relative scarcity. We hope that the inclusion of these results will save end-users time and aid in decision-making about downstream analysis approaches.

General Statistics

General statistics are contained within the summary table at the top of the MultiQC reports. These statistics give a quick overview of the cohort of interest with respect to the defined metrics. Generally, they are summary statistics rather than being comprehensive in nature (those statistics are included in the sections that follow after).

Summary Table

Name	Tool	Description
M Reads Mapped (link)	samtools	Number of reads mapped in millions. This metric is useful
Validation Status (link)	picard	The validation status of the file as determined by `picard ValidateSamFile`.
Percentage of Reads Aligned (link)	picard	Number of mapped reads divided by the total number of reads as a percentage.
Median Insert Size (link)	picard	Median size of the fragment that is inserted between the sequencing adapters (estimated in silico).
Percentage Duplication (link)	picard	Percentage of the reads that are marked as PCR or optical duplicates.
≥ 30X Coverage (link)	mosdepth	The percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically.
Predicted Instrument (link)	ngsderive	The predicted sequencing machine that produced this data. This should match your expectations for what machine sequenced the data.
Predicted Read Length (link)	ngsderive	The predicted read length for the sequencing experiment. This should match your expectations for how the library was prepared. A value of -1 indicates that a stable read length could not be predicted.
Probable Encoding (link)	ngsderive	The predicted encoding of the original FASTQ file. For this, all modern data should match `Sanger/Illumina 1.8`.
Percentage of Reads attributed to Homo sapiens (link)	kraken	The percentage of reads that were assigned as originating from Homo sapiens from kraken.
Percentage of Reads attributed to top 5 species (link)	kraken	The percentage of reads that were assigned to the top 5 reported species from kraken.
Percentage of Reads Unclassified (link)	kraken	The percentage of reads that were not classified by kraken.

M Reads Mapped

Produced by samtools

Number of reads mapped in millions. The number of reads in a whole genome sequencing file can vary widely based on the experiment design and the sequencing depth of the project.

Typically, WGS files contain between 200M reads up to 2 billion reads (or higher in some cases). Any deviation outside of this range should be investigated further (particularly if the number is lower, which suggests that the file may be truncated).
This number is especially informative within the context of samples from the same cohort. We expect a normal distribution for this metric within the same sequencing cohort, special attention is required for any samples that do not conform to this distribution.

Percentage of Reads Aligned

Produced by picard

Percentage of the number of reads that were able to be mapped to a given reference genome. Within St. Jude Cloud, this metric is indicative of the amount of non-human material which was sequenced during the experiment.

This number is especially informative within the context of samples from the same cohort. We expect the percentage of aligned reads to be tightly clustered above 95% mapping rate. It is not uncommon for samples to go as low as 80% mapping rate, and this should be considered within an acceptable range. Occasionally you may find small proportion of samples between 51%-79% mapping rate: those samples should be investigated, but typically, you should still include those samples in the release. Anything below 50% mapping rate signals an issue with contamination, and anything less than 30% mapping may be an issue with the mapper or the data integrity (e.g. mismatched read pairs).

Median Insert Size

Produced by picard

Size of the fragment inserted between the sequencing adapters. In whole genome sequencing experiments, typically you have a target fragment length for each sample (commonly ~2x the read length to maximize the amount of information gathered per fragment). The median insert size has proven to be an incredibly valuable statistic for identifying mapping problems and experimental abnormalities.

Typically, the median insert size should be between 1.5x - 2x the read length of the experiment. This criteria may be evaluated loosely, as insert sizes that are close to this range (anything between 1x - 5x read length) do not necessarily indicate a major problem. Samples with an extremely low median insert size (10-40 nucleotides) are likely the cause of mapping artifacts—particularly if bwa mem was used with a low seed size.
Insert size distributions tend to follow a gamma distribution where long, rolling peaks are interspersed with tight peaks close to the target insert size.
This number is especially informative within the context of samples from the same cohort. We expect the median insert size across samples to be tightly clustered within a cohort.

≥ 30X Coverage

Produced by mosdepth

Percentage of the genome which has at least 30 reads covering each position (on average) for the whole genome, the exonic regions, and the coding regions ("30x coverage"). 30x coverage at any location is widely considered to be the minimum number of reads needed to generate high confidence genotype call. In whole genome sequencing, it is desirable to have as much of the genome as possible to be covered by 30 or more reads. Expected genome coverage is often determined at the time of sequencing and may vary from project to project depending on cost and experimental design.

Target sequencing depth and fraction of the genome to be covered are generally set on a project by project basis. Speak with the someone on the sequencing team to ensure you understand what sequencing depth and fraction of the genome covered you should be expecting.
At a minimum, we expect 80% of the whole genome to be covered by at least 30 reads in high-quality whole genome sequencing experiments. For higher depth coverage projects like our Clinical Genomics project, the standard is 60x across 80% of the genome.
Less coverage does not necessarily indicate a problem, but in whole genome sequencing, anything below 65% of the genome at 30X signals that something may be going wrong. Pay close attention to the distribution of the reads across the genome by generating a coverage plot to see where sequencing bias is being introduced.
This number is especially informative within the context of samples from the same cohort. We expect this metric to be tightly clustered across samples within a cohort.

GC Content

Produced by FastQC

Percentage of the nucleotides contained within reads which are Cs (Cytosine) or Gs (Guanine). GC content is important because Gs pair with Cs in DNA with 3 hydrogen bonds whereas As pair with Ts in DNA with 2 hydrogen bonds. Ergo, DNA is considered to be stronger when comprised of more GC bonds, and GC content is a defining characteristic of different genomes. GC content for humans is estimated to be at around 41%.

GC content in whole genome sequencing is typically between 40%-60% depending on sequencing bias. Any deviation outside of this range should be investigated further.
This number is especially informative within the context of samples from the same cohort. We expect GC content across samples to be tightly clustered within a cohort.
If either of the above assumptions do not hold true, probable sources of variance include library preparation abnormalities or contamination issues.

Detailed Charts

mosdepth

Name	Tool	Description
Cumulative coverage distribution	mosdepth	The proportion of bases within the genome with at least `X` coverage.
Coverage distribution	mosdepth	The proportion of bases within the genome with exactly `X` coverage.
Average coverage per contig	mosdepth	Average coverage per contig or chromosome.

For all target types, the mosdepth will produce various charts for the whole genome, exonic, and coding sequence ranges. The most important chart to examine is the "cumulative coverage distribution" chart, which shows the burndown chart of what percentage of the specified region is covered by a particular read depth. In this way, you can assess how much of the specified region is covered by a particular read depth.

When interpreting this chart, the most important thing to examine is (a) the chart's behavior with respect to the sequencing modality you are reviewing and (b) the relative similarities of the plots within a cohort. For (a), you can expect that

Whole genome data will show a significant cliff for which all places in the genome meet a particular coverage metric. There should be a pronounced downward curve that signifies the cliff of "normal" sites across the genome being separated from highly-covered sites of the genome.
Whole exome data will quickly drop off from left to right, as only about 3% of the genome is targetted (the exonic regions). For those regions, however, you can expect upwards of 1000x coverage.
For RNA-Seq, this plot is not as useful. As such, you can ignore it for this sequencing modality.

ngsderive

Name	Tool	Description
`ngsderive instrument`	ngsderive	The predicted sequencing instrument for this file.
`ngsderive readlen`	ngsderive	The predicted read length for the file.
`ngsderive encoding`	ngsderive	The predicted FASTQ quality score encoding scheme for the file.
`ngsderive strandedness`	ngsderive	The predicted strandedness of the RNA-Seq protocol for the file.
`ngsderive junction-annotation`	ngsderive	The predicted breakdown of novel, partially novel, and known splice junctions.

ngsderive is a tool written by the St. Jude Cloud team to provide forensic analysis techniques for evaluating next-generation sequencing data. The following subcommands are particularly useful in the context of these QC reports:

ngsderive instrument. Gives an estimate of which sequencing instrument was used for a particular file. You should check to ensure the result you receive from this subcommand matches your expectations of the data.
ngsderive readlen. Gives an estimate of the pre-trimmed read length. For files with significant adapter trimming, this may not provide conclusive results (giving a result of -1). You should check to ensure the result you receive from this subcommand matches your expectations of the data.
ngsderive encoding. Guesses which FASTQ encoding was used for the original FASTQ file. For most modern data, this should be Sanger/Illumina 1.8.
ngsderive strandedness. For RNA-Seq data, derives the strandedness of the sequencing protocol. This is useful to understand the effectiveness of the assay and also how strong downstream results will be (if you believe your data is Reverse-stranded, but your data appears to be somewhere between Reverse-stranded and Unstranded, then you may have issues with your analysis).
ngsderive junction-annotation. For RNA-Seq data, reports the number of novel junctions, partially novel junctions, and known junctions. This is incredibly helpful when comparing with other samples within the same cohort. Unless you have reason to suspect otherwise, you should expect the majority of junctions to be known, followed by novel, and then partially novel junctions. An abundance of novel junctions can indicate issues with the sample.

For each of these, we recommend you view the graphs provided to pull out any outliers in the data. In particular, the quantitative measures like percent strandedness and junction saturation can give you strong indications if your library preparation acted as you expected. For any outliers, you should contact your lab to report the issue and determine if they noted any quality control issues from their side.

Qualimap RNA-Seq

Name	Tool	Description	WGS	WXS	RNA-Seq
Genomic Origin of Reads	qualimap	The proportion of intronic, exonic, and intergenic reads from RNA-Seq data.
Gene Coverage Profile	qualimap	The distribution of read coverage across gene transcripts.

Qualimap RNA-Seq provides specific tools evaluating more straightforward metrics from RNA-Seq data. In our QC pipeline, we use two graphs: (a) the genomic origin of reads graph and (b) the gene coverage profile graph.

The genomic origin of reads graph shows the breakdown of reads from intronic, intergenic, and exonic regions. Particularly for PolyA selected data, you should expect to see exonic regions dominate the distribution. For Total or ribosomal depletion methods, you can expect to see more intronic regions (as pre-mRNA are typically captured with these library preparation protocols). Note that a high proportion of intergenic reads is often a sign that your library selection has gone awry.
The gene coverage profile graph demonstrates the coverage across different positions of gene transcripts. You should expect to see a gentle slope with a peak towards the middle of each gene. With respect to the ends, the 3' end of the gene typically drops off more quickly than the 5' end of the gene. If you see irregular coverage of gene transcripts, such as erradict patterns in coverage, this can indicate some bias or selection in the fragment selection of the sequencing experiment.

Picard

Name	Tool	Description
Alignment Summary	picard	Number of aligned versus unaligned reads for a given file.
Mean Read Length Distribution	picard	Mean read length of the file.
GC Coverage Bias	picard	Picard's interpretation of a GC bias distribution.
Insert Size Distribution	picard	The distribution of computed insert sizes for the file.
Deduplication stats	picard	Overview of duplication from reads within the file.
Base quality distribution	picard	The count of bases with each quality score.
SAM/BAM validation	picard	Validation of the SAM/BAM files.

Picard provides a variety of important tools for our QC workflow, including:

An alignment summary chart, which illustrates the number of reads and the relative breakdown of mapped versus unmapped reads. This is useful when determining which files are over/under sequenced within a particular cohort and for visually identifying samples with lower mapping rates (an indication of a problem).
The mean read length chart, which we typically ignore in favor of ngsderive's readlen command (note that ngsderive will give a result of -1 if a result cannot be determined, whereas Picard gives you the median read length).
The GC coverage bias chart, which we typically ignore in favor of the "Per Sequence GC Bias" plot from FastQC.
The insert size distribution chart, which is incredibly helpful when determining problems in your data. For example, if you have bacteria contamination in your data, the bacterial reads will map with low quality using bwa mem (and default parameters). This will result in a bimodal distrbution of the insert size estimate, which is highly indicate of some issue in the data. For these plots, (a) ensure that all of the insert size distributions match each other within a cohort and (b) watch out for bimodal distributions with a small (or large) peak around ~20bp, which is an indication of contamination.
The deduplication stats chart, which give an indication of duplication within the sample (note that, for RNA-Seq data, if this plot is shown, biological duplicates cannot be ruled out!).
The base quality distribution chart, which we do not use and instead use FastQC's "Per Sequence Quality Scores" plot.
The file validation chart, all of which should be green.

Samtools

Samtools also outputs various distributions as a beeswarm plot. In our experience, these are not as useful expect in the case of examining a particular sample to see where it lies for all metrics at once—otherwise, the summary statistics in the "General Statistics" section at the top do just as well.

Kraken

Name	Tool	Description	WGS	WXS	RNA-Seq
Top Taxa	kraken	Proportion of reads which are assigned to the indicated species.

Kraken is useful for determining the origin of reads within your sample of interest. Since we mainly are sequencing human subjects, we typically use kraken to give an indication of the level of contamination from various species within our data. You should look to ensure that the vast majority (80% or higher) of your sample originates from humans (Homo sapiens). If you are sequencing a xenograft, pay close attention to the proportion of mouse (Mus musculus) reads in the sample.

FastQC

Name	Tool	Description
Sequencing Counts	FastQC	Unique and duplicate sequences in the file.
Sequencing Quality Histogram	FastQC	Mean quality for every position in a read.
Per Sequence Quality Scores	FastQC	The number of reads with the indicated average quality score.
Per Base Sequence Content	FastQC	The bias towards a particular nucleotide at a given position of a read.
Per Sequence GC Content	FastQC	The GC content distribution.
Per Base N Content	FastQC	The percentile of base calls at each position which contained an N.
Sequence Length Distribution	FastQC	The distribution of read lengths in the file.
Sequence Duplication Levels	FastQC	The relative level of duplication for each sequence.
Overrepresented Sequences	FastQC	The sequences which may be overrepresented in your file.
Adapter Content	FastQC	The presence or absence of adapter contamination in your file.
Status Checks	FastQC	An overview of the pass/warn/fail status for all FastQC checks.

FastQC rounds out the list of important tools to interrogate the quality of the sequencing data itself. We leverage many of the plots produced by FastQC, using it as the authority on most sequencing quality metrics, including:

The sequencing counts chart, which we typically ignore in favor of picard's mark duplicates plot.
The sequence quality histogram, which gives an indication of the sequencing quality across reads. Look for anything that fails FastQC's quite stringent check to see if that correlates with any other failing metrics.
The per sequence quality scores chart, which shows the mean sequence quality scores across the sample (peaks for lower scores in the orange or red areas of the graph are cause for concern).
The per base sequence content, which shows is there are any biases in what nucleotides are showing up at particular locations in the reads. For whole genome and whole exome data, there should be relatively few biases. However, for RNA-Seq, you will see a bias at the 5' region of the reads.
The per sequence GC content chart, which is highly informative of contamination in whole genome and whole exome sequencing data. In those two sequencing modalities, you should check to ensure that all of the distributions are tightly following one another—any derivation from the expected distribution is a sign of contamination. Note that sometimes tools like FastQC plot an "expected" distribution of the GC content; this is usually the expectation of WGS data, but not WXS data, which has a different expected distribution. For RNA-Seq, this graph is generally not used except to examine extreme biases at particular positions.
The per base N content chart, which should generally report very low levels of Ns in modern data.
The sequence length distribution chart, which we typically ignore in favor of ngsderive's readlen plot.
The sequence duplication level chart, which gives an indication of the percentage of sequences that are duplicated and by what factor (though this is a graph that we find, often, even good data fails FastQC's stringent test).
The overrepresented sequences chart, which gives an indication if you're sequencing the same sequence over and over.
The adapter content chart, which we use to ensure there are minimal to no signs of adapters in the data we receive.
And finally, the status checks view, which gives you an overview of which samples passed/warned/failed different checks. Generally, this is a good overview of the quality of the data, though we find that many of the charts are often marked as "failed" for good data (in particular, the "Per Base Sequence Content" [for RNA-Seq], the "Per Sequence GC Content", and the "Sequence Length Distribution" graphs).

Specification

These are generic instructions for running each of the tools in our pipeline. We run our pipeline as a WDL workflow. We have supplied examples of the commands used for each package. For the typical memory requirements of each command, please see our WDL repository.

Dependencies

We presume Anaconda is available and installed. If not, please follow the link to Anaconda first.

conda create --name bio-qc \
    --channel bioconda \
    --channel conda-forge \
    fastqc==0.11.8 \
    picard==2.20.2  \
    qualimap==2.2.2c \
    samtools==1.9 \
    fastq-screen==0.13.0 \
    ngsderive==2.2.0 \
    multiqc==1.10.1 \
    -y

conda activate bio-qc

For linting created fastqs, fqlib must be installed. See installation instructions here.

Workflow

The workflow specification is as follows. Note that arguments that are not integral to the command (such as output directories) or vary between compute environments (such as memory thresholds or number of threads) are not included.

Compute the md5 checksum of the file.
```
md5sum $BAM
```

Use Picard's ValidateSamFile tool to ensure the inner contents of the BAM file are well-formed.

picard ValidateSamFile \
    I=$BAM             \  # specify bam file
    MODE=SUMMARY          # concise output

Run samtools flagstat to gather general statistics such as alignment percentage.
```
samtools flagstat $BAM
```
Run fastqc to collect sequencing and library-related statistics. These are only for informational purposes -- as stated above, we typically do not remove samples based on this information (with rare exception), as the sequencing-related QC work was done upstream in the genomics lab.
```
fastqc $BAM
```
Run ngsderive instrument to infer sequencing instrument.
```
ngsderive instrument $BAM
```
Run ngsderive readlen to infer read lengths.
```
ngsderive readlen $BAM
```

Run ngsderive encoding to infer PHRED score encoding.

ngsderive encoding \
    -n -1          \  # parse the entire file
    $BAM

Run qualimap bamqc to gather more in-depth statistics about read stats, coverage, mapping quality, mismatches, etc.

qualimap bamqc -bam $BAM \  # bam filename
    -nt $NUM_THREADS     \  # threads requested
    -nw 400                 # number of windows

If WGS or WES data, run fastq_screen. For performance, we subsample the input BAM using samtools view -s $computed_fraction before running it through picard SamToFastq. The resulting fastqs are validated with fq lint provided by fqlib.
```
cat $fastq_1 $fastq_2 > $combined_fastq
fastq_screen          \
    --aligner bowtie2 \
    $combined_fastq
```
If RNA-Seq data, run ngsderive strandedness to determine a backwards-computed strandedness of the RNA-Seq experiment.
```
ngsderive strandedness $BAM
```
If RNA-Seq data, run ngsderive junction-annotation to calculate the number of known, novel, and partial-novel junctions.
```
ngsderive junction-annotation $BAM
```

If RNA-Seq data, run qualimap rnaseq to gather QC statistics that are tailored for RNA-Seq files.

qualimap rnaseq --java-mem-size=$MEM_SIZE \  # memory
   -bam $BAM                              \  # bam filename
   [-pe]                                     # specify paired end if paired end

Combine all of the above metrics using multiqc.
```
multiqc . # recurse all files in '.'
```

St. Jude Cloud RFCs