QC Pipeline for Whole Genome, Whole Exome, and RNA Sequencing

Table of Contents

Introduction

This RFC documents an automated workflow for assessing the integrity and quality of St. Jude Cloud omics data. The goal of this RFC is two-fold.

  1. Establish state of the art method for comprehensively evaluating genomics data quality at scale, both at time of receipt and after processing, for whole genome, whole exome, and transcriptome sequencing.
  2. Publish a collection of metrics that end-users of St. Jude Cloud can leverage to assess the quality of the data available. This context should save users time computing the information themselves while also informing appropriate use of the data.

You can find the relevant discussion on the associated pull request.

Motivation

St. Jude Cloud is a large repository of omics data available for request from the academic and non-profit community. As such, the project processes thousands of samples from whole genome, whole exome, RNA-Seq, and various omics-based assays each year. A standard, robust method to assess pre-processing and post-processing quality for samples has been developed in-house, but there are some shortcomings with our current approach. In particular, this RFC will:

  • Define the standard set of QC tools used to evaluating omics-based data,
  • identify and implement key metrics that can be automated to assist in manual observation of the data, and
  • outline mechanisms to publish those QC results alongside the data already in St. Jude Cloud so that end-users can leverage this information.

Discussion

Types of QC

There are (at least) two different types of QC typically carried out omics-based data. Experimental QC attempts to identify the success of the assay(s) performed. Once one is sufficiently satisfied that the data generated by an experiment is "good", computational QC examines the degree to which computational processing of that data was completed successfully.

By the time data reaches the St. Jude Cloud team from various sources, extensive experimental and computational evaluation have already been carried out. Each contributing project has its own thresholds for quality in both areas, which is dependent on the best practices at that point in time and the goals of the project. Most often, our team takes the computational data, reverts it back to its raw form (such as FastQ files), and reprocesses the data using a harmonization pipeline.

Thus, the scope of this RFC, and the QC of samples on the project in general, is limited to the computational QC of the files produced for publication in St. Jude Cloud. While we do produce results that define experimental results (such as fastqc), these are rarely used to decide which files pass or fail our QC—this is in recognition of the fact that the data, while not always perfect, is extremely valuable due to its relative scarcity. We hope that the inclusion of these results will save end-users time and aid in decision-making about downstream analysis approaches.

General Statistics

General statistics are contained within the summary table at the top of the MultiQC reports. These statistics give a quick overview of the cohort of interest with respect to the defined metrics. Generally, they are summary statistics rather than being comprehensive in nature (those statistics are included in the sections that follow after).

Summary Table

NameToolDescriptionWGSWXSRNA-Seq
M Reads Mapped (link)samtoolsNumber of reads mapped in millions. This metric is usefulYesYesYes
Validation Status (link)picardThe validation status of the file as determined by picard ValidateSamFile.YesYesYes
Percentage of Reads Aligned (link)picardNumber of mapped reads divided by the total number of reads as a percentage.YesYesYes
Median Insert Size (link)picardMedian size of the fragment that is inserted between the sequencing adapters (estimated in silico).YesYesNo
Percentage Duplication (link)picardPercentage of the reads that are marked as PCR or optical duplicates.YesYesNo
≥ 30X Coverage (link)mosdepthThe percentage of locations that are covered by at least 30 reads for the whole genome, the exonic regions, and the coding sequence regions specifically.YesYesNo
Predicted Instrument (link)ngsderiveThe predicted sequencing machine that produced this data. This should match your expectations for what machine sequenced the data.YesYesYes
Predicted Read Length (link)ngsderiveThe predicted read length for the sequencing experiment. This should match your expectations for how the library was prepared. A value of -1 indicates that a stable read length could not be predicted.YesYesYes
Probable Encoding (link)ngsderiveThe predicted encoding of the original FASTQ file. For this, all modern data should match Sanger/Illumina 1.8.YesYesYes
Percentage of Reads attributed to Homo sapiens (link)krakenThe percentage of reads that were assigned as originating from Homo sapiens from kraken.YesYesYes
Percentage of Reads attributed to top 5 species (link)krakenThe percentage of reads that were assigned to the top 5 reported species from kraken.YesYesYes
Percentage of Reads Unclassified (link)krakenThe percentage of reads that were not classified by kraken.YesYesYes

M Reads Mapped

Produced by samtools

Number of reads mapped in millions. The number of reads in a whole genome sequencing file can vary widely based on the experiment design and the sequencing depth of the project.

  • Typically, WGS files contain between 200M reads up to 2 billion reads (or higher in some cases). Any deviation outside of this range should be investigated further (particularly if the number is lower, which suggests that the file may be truncated).
  • This number is especially informative within the context of samples from the same cohort. We expect a normal distribution for this metric within the same sequencing cohort, special attention is required for any samples that do not conform to this distribution.

Percentage of Reads Aligned

Produced by picard

Percentage of the number of reads that were able to be mapped to a given reference genome. Within St. Jude Cloud, this metric is indicative of the amount of non-human material which was sequenced during the experiment.

  • This number is especially informative within the context of samples from the same cohort. We expect the percentage of aligned reads to be tightly clustered above 95% mapping rate. It is not uncommon for samples to go as low as 80% mapping rate, and this should be considered within an acceptable range. Occasionally you may find small proportion of samples between 51%-79% mapping rate: those samples should be investigated, but typically, you should still include those samples in the release. Anything below 50% mapping rate signals an issue with contamination, and anything less than 30% mapping may be an issue with the mapper or the data integrity (e.g. mismatched read pairs).

Median Insert Size

Produced by picard

Size of the fragment inserted between the sequencing adapters. In whole genome sequencing experiments, typically you have a target fragment length for each sample (commonly ~2x the read length to maximize the amount of information gathered per fragment). The median insert size has proven to be an incredibly valuable statistic for identifying mapping problems and experimental abnormalities.

  • Typically, the median insert size should be between 1.5x - 2x the read length of the experiment. This criteria may be evaluated loosely, as insert sizes that are close to this range (anything between 1x - 5x read length) do not necessarily indicate a major problem. Samples with an extremely low median insert size (10-40 nucleotides) are likely the cause of mapping artifacts—particularly if bwa mem was used with a low seed size.
  • Insert size distributions tend to follow a gamma distribution where long, rolling peaks are interspersed with tight peaks close to the target insert size.
  • This number is especially informative within the context of samples from the same cohort. We expect the median insert size across samples to be tightly clustered within a cohort.

≥ 30X Coverage

Produced by mosdepth

Percentage of the genome which has at least 30 reads covering each position (on average) for the whole genome, the exonic regions, and the coding regions ("30x coverage"). 30x coverage at any location is widely considered to be the minimum number of reads needed to generate high confidence genotype call. In whole genome sequencing, it is desirable to have as much of the genome as possible to be covered by 30 or more reads. Expected genome coverage is often determined at the time of sequencing and may vary from project to project depending on cost and experimental design.

  • Target sequencing depth and fraction of the genome to be covered are generally set on a project by project basis. Speak with the someone on the sequencing team to ensure you understand what sequencing depth and fraction of the genome covered you should be expecting.
  • At a minimum, we expect 80% of the whole genome to be covered by at least 30 reads in high-quality whole genome sequencing experiments. For higher depth coverage projects like our Clinical Genomics project, the standard is 60x across 80% of the genome.
  • Less coverage does not necessarily indicate a problem, but in whole genome sequencing, anything below 65% of the genome at 30X signals that something may be going wrong. Pay close attention to the distribution of the reads across the genome by generating a coverage plot to see where sequencing bias is being introduced.
  • This number is especially informative within the context of samples from the same cohort. We expect this metric to be tightly clustered across samples within a cohort.

GC Content

Produced by FastQC

Percentage of the nucleotides contained within reads which are Cs (Cytosine) or Gs (Guanine). GC content is important because Gs pair with Cs in DNA with 3 hydrogen bonds whereas As pair with Ts in DNA with 2 hydrogen bonds. Ergo, DNA is considered to be stronger when comprised of more GC bonds, and GC content is a defining characteristic of different genomes. GC content for humans is estimated to be at around 41%.

  • GC content in whole genome sequencing is typically between 40%-60% depending on sequencing bias. Any deviation outside of this range should be investigated further.
  • This number is especially informative within the context of samples from the same cohort. We expect GC content across samples to be tightly clustered within a cohort.
  • If either of the above assumptions do not hold true, probable sources of variance include library preparation abnormalities or contamination issues.

Detailed Charts

mosdepth

NameToolDescriptionWGSWXSRNA-Seq
Cumulative coverage distributionmosdepthThe proportion of bases within the genome with at least X coverage.allexonic/cdsno
Coverage distributionmosdepthThe proportion of bases within the genome with exactly X coverage.allexonic/cdsno
Average coverage per contigmosdepthAverage coverage per contig or chromosome.allallall

For all target types, the mosdepth will produce various charts for the whole genome, exonic, and coding sequence ranges. The most important chart to examine is the "cumulative coverage distribution" chart, which shows the burndown chart of what percentage of the specified region is covered by a particular read depth. In this way, you can assess how much of the specified region is covered by a particular read depth.

When interpreting this chart, the most important thing to examine is (a) the chart's behavior with respect to the sequencing modality you are reviewing and (b) the relative similarities of the plots within a cohort. For (a), you can expect that

  • Whole genome data will show a significant cliff for which all places in the genome meet a particular coverage metric. There should be a pronounced downward curve that signifies the cliff of "normal" sites across the genome being separated from highly-covered sites of the genome.
  • Whole exome data will quickly drop off from left to right, as only about 3% of the genome is targetted (the exonic regions). For those regions, however, you can expect upwards of 1000x coverage.
  • For RNA-Seq, this plot is not as useful. As such, you can ignore it for this sequencing modality.

ngsderive

NameToolDescriptionWGSWXSRNA-Seq
ngsderive instrumentngsderiveThe predicted sequencing instrument for this file.yesyesyes
ngsderive readlenngsderiveThe predicted read length for the file.yesyesyes
ngsderive encodingngsderiveThe predicted FASTQ quality score encoding scheme for the file.yesyesyes
ngsderive strandednessngsderiveThe predicted strandedness of the RNA-Seq protocol for the file.nonoyes
ngsderive junction-annotationngsderiveThe predicted breakdown of novel, partially novel, and known splice junctions.nonoyes

ngsderive is a tool written by the St. Jude Cloud team to provide forensic analysis techniques for evaluating next-generation sequencing data. The following subcommands are particularly useful in the context of these QC reports:

  • ngsderive instrument. Gives an estimate of which sequencing instrument was used for a particular file. You should check to ensure the result you receive from this subcommand matches your expectations of the data.
  • ngsderive readlen. Gives an estimate of the pre-trimmed read length. For files with significant adapter trimming, this may not provide conclusive results (giving a result of -1). You should check to ensure the result you receive from this subcommand matches your expectations of the data.
  • ngsderive encoding. Guesses which FASTQ encoding was used for the original FASTQ file. For most modern data, this should be Sanger/Illumina 1.8.
  • ngsderive strandedness. For RNA-Seq data, derives the strandedness of the sequencing protocol. This is useful to understand the effectiveness of the assay and also how strong downstream results will be (if you believe your data is Reverse-stranded, but your data appears to be somewhere between Reverse-stranded and Unstranded, then you may have issues with your analysis).
  • ngsderive junction-annotation. For RNA-Seq data, reports the number of novel junctions, partially novel junctions, and known junctions. This is incredibly helpful when comparing with other samples within the same cohort. Unless you have reason to suspect otherwise, you should expect the majority of junctions to be known, followed by novel, and then partially novel junctions. An abundance of novel junctions can indicate issues with the sample.

For each of these, we recommend you view the graphs provided to pull out any outliers in the data. In particular, the quantitative measures like percent strandedness and junction saturation can give you strong indications if your library preparation acted as you expected. For any outliers, you should contact your lab to report the issue and determine if they noted any quality control issues from their side.

Qualimap RNA-Seq

NameToolDescriptionWGSWXSRNA-Seq
Genomic Origin of ReadsqualimapThe proportion of intronic, exonic, and intergenic reads from RNA-Seq data.nonoyes
Gene Coverage ProfilequalimapThe distribution of read coverage across gene transcripts.nonoyes

Qualimap RNA-Seq provides specific tools evaluating more straightforward metrics from RNA-Seq data. In our QC pipeline, we use two graphs: (a) the genomic origin of reads graph and (b) the gene coverage profile graph.

  • The genomic origin of reads graph shows the breakdown of reads from intronic, intergenic, and exonic regions. Particularly for PolyA selected data, you should expect to see exonic regions dominate the distribution. For Total or ribosomal depletion methods, you can expect to see more intronic regions (as pre-mRNA are typically captured with these library preparation protocols). Note that a high proportion of intergenic reads is often a sign that your library selection has gone awry.
  • The gene coverage profile graph demonstrates the coverage across different positions of gene transcripts. You should expect to see a gentle slope with a peak towards the middle of each gene. With respect to the ends, the 3' end of the gene typically drops off more quickly than the 5' end of the gene. If you see irregular coverage of gene transcripts, such as erradict patterns in coverage, this can indicate some bias or selection in the fragment selection of the sequencing experiment.

Picard

NameToolDescriptionWGSWXSRNA-Seq
Alignment SummarypicardNumber of aligned versus unaligned reads for a given file.yesyesyes
Mean Read Length DistributionpicardMean read length of the file.nonono
GC Coverage BiaspicardPicard's interpretation of a GC bias distribution.nonono
Insert Size DistributionpicardThe distribution of computed insert sizes for the file.yesyesyes
Deduplication statspicardOverview of duplication from reads within the file.yesyeswarn
Base quality distributionpicardThe count of bases with each quality score.nonono
SAM/BAM validationpicardValidation of the SAM/BAM files.yesyesyes

Picard provides a variety of important tools for our QC workflow, including:

  • An alignment summary chart, which illustrates the number of reads and the relative breakdown of mapped versus unmapped reads. This is useful when determining which files are over/under sequenced within a particular cohort and for visually identifying samples with lower mapping rates (an indication of a problem).
  • The mean read length chart, which we typically ignore in favor of ngsderive's readlen command (note that ngsderive will give a result of -1 if a result cannot be determined, whereas Picard gives you the median read length).
  • The GC coverage bias chart, which we typically ignore in favor of the "Per Sequence GC Bias" plot from FastQC.
  • The insert size distribution chart, which is incredibly helpful when determining problems in your data. For example, if you have bacteria contamination in your data, the bacterial reads will map with low quality using bwa mem (and default parameters). This will result in a bimodal distrbution of the insert size estimate, which is highly indicate of some issue in the data. For these plots, (a) ensure that all of the insert size distributions match each other within a cohort and (b) watch out for bimodal distributions with a small (or large) peak around ~20bp, which is an indication of contamination.
  • The deduplication stats chart, which give an indication of duplication within the sample (note that, for RNA-Seq data, if this plot is shown, biological duplicates cannot be ruled out!).
  • The base quality distribution chart, which we do not use and instead use FastQC's "Per Sequence Quality Scores" plot.
  • The file validation chart, all of which should be green.

Samtools

Samtools also outputs various distributions as a beeswarm plot. In our experience, these are not as useful expect in the case of examining a particular sample to see where it lies for all metrics at once—otherwise, the summary statistics in the "General Statistics" section at the top do just as well.

Kraken

NameToolDescriptionWGSWXSRNA-Seq
Top TaxakrakenProportion of reads which are assigned to the indicated species.yesyesyes

Kraken is useful for determining the origin of reads within your sample of interest. Since we mainly are sequencing human subjects, we typically use kraken to give an indication of the level of contamination from various species within our data. You should look to ensure that the vast majority (80% or higher) of your sample originates from humans (Homo sapiens). If you are sequencing a xenograft, pay close attention to the proportion of mouse (Mus musculus) reads in the sample.

FastQC

NameToolDescriptionWGSWXSRNA-Seq
Sequencing CountsFastQCUnique and duplicate sequences in the file.nonono
Sequencing Quality HistogramFastQCMean quality for every position in a read.yesyesyes
Per Sequence Quality ScoresFastQCThe number of reads with the indicated average quality score.yesyesyes
Per Base Sequence ContentFastQCThe bias towards a particular nucleotide at a given position of a read.yesyesyes
Per Sequence GC ContentFastQCThe GC content distribution.yesyesyes
Per Base N ContentFastQCThe percentile of base calls at each position which contained an N.yesyesyes
Sequence Length DistributionFastQCThe distribution of read lengths in the file.nonono
Sequence Duplication LevelsFastQCThe relative level of duplication for each sequence.yesyesyes
Overrepresented SequencesFastQCThe sequences which may be overrepresented in your file.yesyesyes
Adapter ContentFastQCThe presence or absence of adapter contamination in your file.yesyesyes
Status ChecksFastQCAn overview of the pass/warn/fail status for all FastQC checks.yesyesyes

FastQC rounds out the list of important tools to interrogate the quality of the sequencing data itself. We leverage many of the plots produced by FastQC, using it as the authority on most sequencing quality metrics, including:

  • The sequencing counts chart, which we typically ignore in favor of picard's mark duplicates plot.
  • The sequence quality histogram, which gives an indication of the sequencing quality across reads. Look for anything that fails FastQC's quite stringent check to see if that correlates with any other failing metrics.
  • The per sequence quality scores chart, which shows the mean sequence quality scores across the sample (peaks for lower scores in the orange or red areas of the graph are cause for concern).
  • The per base sequence content, which shows is there are any biases in what nucleotides are showing up at particular locations in the reads. For whole genome and whole exome data, there should be relatively few biases. However, for RNA-Seq, you will see a bias at the 5' region of the reads.
  • The per sequence GC content chart, which is highly informative of contamination in whole genome and whole exome sequencing data. In those two sequencing modalities, you should check to ensure that all of the distributions are tightly following one another—any derivation from the expected distribution is a sign of contamination. Note that sometimes tools like FastQC plot an "expected" distribution of the GC content; this is usually the expectation of WGS data, but not WXS data, which has a different expected distribution. For RNA-Seq, this graph is generally not used except to examine extreme biases at particular positions.
  • The per base N content chart, which should generally report very low levels of Ns in modern data.
  • The sequence length distribution chart, which we typically ignore in favor of ngsderive's readlen plot.
  • The sequence duplication level chart, which gives an indication of the percentage of sequences that are duplicated and by what factor (though this is a graph that we find, often, even good data fails FastQC's stringent test).
  • The overrepresented sequences chart, which gives an indication if you're sequencing the same sequence over and over.
  • The adapter content chart, which we use to ensure there are minimal to no signs of adapters in the data we receive.
  • And finally, the status checks view, which gives you an overview of which samples passed/warned/failed different checks. Generally, this is a good overview of the quality of the data, though we find that many of the charts are often marked as "failed" for good data (in particular, the "Per Base Sequence Content" [for RNA-Seq], the "Per Sequence GC Content", and the "Sequence Length Distribution" graphs).

Specification

These are generic instructions for running each of the tools in our pipeline. We run our pipeline as a WDL workflow. We have supplied examples of the commands used for each package. For the typical memory requirements of each command, please see our WDL repository.

Dependencies

We presume Anaconda is available and installed. If not, please follow the link to Anaconda first.

conda create --name bio-qc \
    --channel bioconda \
    --channel conda-forge \
    fastqc==0.11.8 \
    picard==2.20.2  \
    qualimap==2.2.2c \
    samtools==1.9 \
    fastq-screen==0.13.0 \
    ngsderive==2.2.0 \
    multiqc==1.10.1 \
    -y

conda activate bio-qc

For linting created fastqs, fqlib must be installed. See installation instructions here.

Workflow

The workflow specification is as follows. Note that arguments that are not integral to the command (such as output directories) or vary between compute environments (such as memory thresholds or number of threads) are not included.

  1. Compute the md5 checksum of the file.

    md5sum $BAM
    
  2. Use Picard's ValidateSamFile tool to ensure the inner contents of the BAM file are well-formed.

    picard ValidateSamFile \
        I=$BAM             \  # specify bam file
        MODE=SUMMARY          # concise output
    
  3. Run samtools flagstat to gather general statistics such as alignment percentage.

    samtools flagstat $BAM
    
  4. Run fastqc to collect sequencing and library-related statistics. These are only for informational purposes -- as stated above, we typically do not remove samples based on this information (with rare exception), as the sequencing-related QC work was done upstream in the genomics lab.

    fastqc $BAM
    
  5. Run ngsderive instrument to infer sequencing instrument.

    ngsderive instrument $BAM
    
  6. Run ngsderive readlen to infer read lengths.

    ngsderive readlen $BAM
    
  7. Run ngsderive encoding to infer PHRED score encoding.

    ngsderive encoding \
        -n -1          \  # parse the entire file
        $BAM
    
  8. Run qualimap bamqc to gather more in-depth statistics about read stats, coverage, mapping quality, mismatches, etc.

    qualimap bamqc -bam $BAM \  # bam filename
        -nt $NUM_THREADS     \  # threads requested
        -nw 400                 # number of windows
    
  9. If WGS or WES data, run fastq_screen. For performance, we subsample the input BAM using samtools view -s $computed_fraction before running it through picard SamToFastq. The resulting fastqs are validated with fq lint provided by fqlib.

    cat $fastq_1 $fastq_2 > $combined_fastq
    fastq_screen          \
        --aligner bowtie2 \
        $combined_fastq
    
  10. If RNA-Seq data, run ngsderive strandedness to determine a backwards-computed strandedness of the RNA-Seq experiment.

    ngsderive strandedness $BAM
    
  11. If RNA-Seq data, run ngsderive junction-annotation to calculate the number of known, novel, and partial-novel junctions.

    ngsderive junction-annotation $BAM
    
  12. If RNA-Seq data, run qualimap rnaseq to gather QC statistics that are tailored for RNA-Seq files.

    qualimap rnaseq --java-mem-size=$MEM_SIZE \  # memory
       -bam $BAM                              \  # bam filename
       [-pe]                                     # specify paired end if paired end
    
  13. Combine all of the above metrics using multiqc.

    multiqc . # recurse all files in '.'