scRNA-Seq Pipeline

Introduction
Motivation
Discussion
Specification

Introduction

This RFC lays out the specification for the scRNA-Seq mapping pipeline.

Motivation

To provide the community access to data from scRNA-Seq experiments performed at St. Jude Children's Research Hospital, we propose the following data harmonization pipeline. The goal of this pipeline is to provide harmonized alignments for scRNA-Seq data. For this pipeline, we will make no recommendations on downstream analysis, focusing instead on harmonizing the underlying sequencing data and leaving analysis decisions to the user.

Discussion

The development of bulk RNA-Seq enabled efficient profiling of transcripts in a sample. The ability to interrogate the transcriptome in an unbiased manner, largely replaced previous methods such as microarrays or RT-qPCR. A bulk RNA-Seq experiment sample is a mixture of cells. This method is useful for analyses such as comparing expression between tumor and normal samples. However, this limits the analysis to the average expression over a population of cells and is unable to capture the heterogeneity of gene expression across individual cells. Newer methods have enabled RNA-Seq to be applied to single cells. This enables new questions about cell-specific transcriptome changes. This methodology has been employed in large projects such as the Human Cell Atlas to capture the cellular diversity within organisms.

scRNA-Seq methodology

Generally, scRNA-Seq differs somewhat from bulk RNA-Seq. Most approaches generate three key pieces of information: 1. cDNA fragment from the RNA transcript, 2. a cell barcode that identifies in which cell the RNA was expressed, and 3. a unique molecular identifier (UMI) to enable collapse of PCR duplicates.

There are several commerical options for performing scRNA-Seq. One such option is the 10x Chromium platform. In their approach using paired-end reads, the celluar barcode and UMI are found in read 1 and the transcript sequence is found in read 2. The 10x method is the most popular commercially available method, so handling data for that platform will be the priority.

The canonical tool for processing 10x Chromium scRNA-Seq data is their Cell Ranger tool. This tool uses STAR to perform alignment. It then uses the transcriptome annotation GTF to label reads as exonic, intronic, or intergenic. It then performs some additional refinement on the alignments to determine which reads are transcriptomic. These reads are then carried into UMI counting. The UMI counting step produces a raw gene by cell matrix of counts.

Processing Pipeline

For processing of bulk RNA-Seq data, a common standard workflow is to align against the genome using STAR. Quantification is then done using a package such as HTSeq-count. An example of a bulk RNA-Seq pipeline is the St. Jude Cloud RNA-Seq standard workflow v2.0.0.

Aligner choice

For consistency with other St. Jude Cloud datasets, we are not considering psuedoalignment methods such as kallisto and Alevin. This leaves the previously described Cell Ranger method and STARsolo as the primary candidates for use in St. Jude Cloud.

Bruning, et. al. (2022) provide a useful comparison of commonly used scRNA-Seq methods. However due to our preference for an alignment-based method, we will choose between Cell Ranger and STARsolo. Cell Ranger is the most commonly used method for processing data generated from the 10x protocol. Downstream analysis tools have been developed to directly ingest data in the formats produced by Cell Ranger. The tool is an integrated system that performs alignment, feature counting, and quality control.

Quantification choice

Quantification is a product of both the Cell Ranger and STARsolo packages. Quantification could also be accomplished using a pseudoalignment method, such as Kallisto or Alevin, if there was no desire to share the raw data in BAM format. However that would be a departure from the normal St. Jude Cloud methodology, so that will not be discussed here. Therefore, there is no additional quantification choice to be made beyond that of the aligner package.

Using either STARsolo or Cell Ranger for analysis requires selection of reference files. For the genome, we will continue using GRCh38_no_alt. The transcript annotation GTF is more complicated. 10x provides several versions of the annotation files used in Cell Ranger. For simplicity, we will use the Cell Ranger reference files version 2020-A which incorporates GENCODE v32.

Specification

Metadata

Sample-based

Library Prep Kit - e.g. Illumina TruSeq RNA Library Prep Kit V2
Library Selection Protocol - e.g. Total, Poly-A
Sequencing Machine - e.g. Illumina HiSeq 2000, Illumina NovaSeq 6000
Library Layout - Single-End, Paired-End
RNA-Seq Strandedness - Unstranded, Stranded-Reverse, Stranded-Forward
Read Length - Number of bp per read
Tissue Preservative - e.g. FFPE, Fresh/Frozen
Cell Isolation Protocol - e.g. Droplet, Microfluidics
Flowcell design - Multiplexed, split across flowcells, etc.

Cell-based

TODO

Dependencies

If you'd like the full conda environment, you can install it using the following command. Obviously, you'll need to install anaconda first.

conda create -n scrnaseq-mapping \
    -c conda-forge \
    -c bioconda \
    picard==2.20.2 \
    samtools==1.9 \
    ngsderive==2.2.0 \
    tabix==1.11 \
    -y

Additionally, you will want to install our fqlib library to check that FastQ files are properly paired and have no duplicate read names. Installation of the Rust programming language is required.

cargo install --git https://github.com/stjude/fqlib.git --tag v0.8.0

Additionally, you will need to install Cell Ranger from 10x genomics to perform the alignment and feature calling.

curl -o cellranger-7.0.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.0.0.tar.gz?Expires=1659064381&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci03LjAuMC50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2NTkwNjQzODF9fX1dfQ__&Signature=SGKFTznrgspKBPQ2oRZ5C9KazslNWsTkxtkGnnaaPtrOR2iihhvkJ5frOlT5ahkABzBlHf~8EjLqcJ6HofbiQ4McbrHsmAnRggfv2QK0WScF40qDE3kGm~Z57VumO5homkdJPEf9r5DlbMzlArE5-UBH91F0rMMribGjwABXJCjUsAuet0klbUg~~SzCxoNMfTvo3Nn7Mxv7ls51LQf72riNit9zne6WsHynIY7YBHaquVftTi-C6bMZw5A3NoekyA7LBTZwc6sFjrsFrf3s3BpYKeRkBtPdkDMljME9JLDo9wXM7Lf1SfTj1vtNVUDoNhMnTFplDNzyBaDEDm7GVg__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"

echo '30855cb96a097c9cab6b02bdb520423f  cellranger-7.0.0.tar.gz' > cellranger-7.0.0.tar.gz.md5

md5sum -c cellranger-7.0.0.tar.gz.md5

tar -xzvf cellranger-7.0.0.tar.gz

export PATH=$(pwd)/cellranger-7.0.0:$PATH

Reference files

The following reference files are used as the basis of the scRNA-Seq Workflow v1.0.0:

We will use the 10x provided reference files version 2020-A. You can get the file by running the following commands:

wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz -O 10x-GRCh38-2020-A.tar.gz

echo "dfd654de39bff23917471e7fcc7a00cd  10x-GRCh38-2020-A.tar.gz" > 10x-GRCh38-2020-A.tar.gz.md5
md5sum -c 10x-GRCh38-2020-A.tar.gz.md5
# > 10x-GRCh38-2020-A.tar.gz: OK

tar -zxf 10x-GRCh38-2020-A.tar.gz
mv refdata-gex-GRCh38-2020-A 10x-GRCh38-2020-A

Workflow

Here are the resulting steps in the scRNA-Seq Workflow v1.0.0 pipeline. There might be slight alterations in the actual implementation, which can be found in the St. Jude Cloud workflows repository.

Run picard ValidateSam on the incoming BAM to ensure that it is well-formed enough to convert back to FastQ.

picard ValidateSamFile \
                  I=$INPUT_BAM \                     # Input BAM.
                  IGNORE=INVALID_PLATFORM_VALUE \    # Validations to ignore.
                  IGNORE=MISSING_PLATFORM_VALUE

Split BAM file into multiple BAMs on the different read groups using samtools split. See the samtools documentation for more information.
```
samtools split -u $UNACCOUNTED_BAM_NAME \ # Reads that do not belong to a read group or the read group is unrecognized go here.
               -f '%*_%!.%.'              # Format of output BAM file names.
```
If the BAM has unaccounted reads, those will need to be triaged and this step will need to be rerun.

Run Picard SamToFastq on each of the BAMs generated in the previous step.

   picard SamToFastq \
          INPUT=$INPUT_BAM \
          FASTQ=$FASTQ_R1 \
          SECOND_END_FASTQ=$FASTQ_R2 \
          RE_REVERSE=true \
          VALIDATION_STRINGENCY=SILENT

Run fq lint on each of the FastQ pairs that was generated in the previous step as a sanity check. You can see the checks that the fq tool performs here.
```
fq lint $FASTQ_R1 $FASTQ_R2 # Files for read 1 and read 2.
```

Run Cell Ranger.

     cellranger count \
                --id=$OUTPUT_DIR_NAME \      # A unique run id and output folder name [a-zA-Z0-9_-]+
                --transcriptome=$REFERENCE \ # Path of folder containing 10x-compatible transcriptome reference
                --fastqs=$FASTQ_DIR \        # Path to input FASTQ data
                --sample=$FASTQ_PREFIX \     # Prefix of the filenames of FASTQs to select
                --localcores=$N_CORES \      # Maximum number of cores to use
                --localmem=$MEM_GB           # Maximum memory (GB) to use

Run picard ValidateSamFile on the aligned and sorted BAM file.

picard ValidateSamFile \
               I=$CELL_RANGER_SORTED_BAM \     # Cell Ranger-aligned, coordinate-sorted BAM.
               IGNORE=INVALID_PLATFORM_VALUE \ # Validations to ignore.
               IGNORE=MISSING_PLATFORM_VALUE

Run ngsderive's strandedness subcommand to confirm that the lab's information on strandedness reflects what was is computed. Manually triage any discrepancies. This is particularly useful for historical samples. Additionally, if the value for strandedness isn't known at run time, we can use the inferred value (if reported).
```
ngsderive strandedness $CELL_RANGER_SORTED_BAM \  # Cell Ranger-aligned, coordinate-sorted BAM.
                       -g $GENCODE_GTF_V32_GZ     # GENCODE v32 GTF (gzipped)
```

St. Jude Cloud RFCs