Gatk4

Homepage

split_n_cigar_reads

description: Splits reads that contain Ns in their CIGAR strings into multiple reads.
external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360036858811-SplitNCigarReads
outputs: {'split_n_reads_bam': 'BAM file with reads split at N CIGAR elements and updated CIGAR strings.', 'split_n_reads_bam_index': 'Index file for the split BAM', 'split_n_reads_bam_md5': 'MD5 checksum for the split BAM'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to with unsplit reads containing Ns in their CIGAR strings.
bam_index (File, required): BAM index file corresponding to the input BAM
dict (File, required): Dictionary file for FASTA format genome
fasta (File, required): Reference genome in FASTA format. Must be uncompressed.
fasta_index (File, required): Index for FASTA format genome

Defaults

memory_gb (Int, default=25): RAM to allocate for task, specified in GB
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=8): Number of cores to allocate for task
prefix (String, default=basename(bam,".bam") + ".split"): Prefix for the BAM file. The extension .bam will be added.

Outputs

split_n_reads_bam (File)
split_n_reads_bam_index (File)
split_n_reads_bam_md5 (File)

base_recalibrator

description: Generates recalibration report for base quality score recalibration.
external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360036897372-BaseRecalibratorSpark-BETA
outputs: {'recalibration_report': 'Recalibration report file'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file on which to recabilbrate base quality scores
bam_index (File, required): BAM index file corresponding to the input BAM
dbSNP_vcf (File, required): dbSNP VCF file
dbSNP_vcf_index (File, required): dbSNP VCF index file
dict (File, required): Dictionary file for FASTA format genome
fasta (File, required): Reference genome in FASTA format
fasta_index (File, required): Index for FASTA format genome
known_indels_sites_indices (Array[File], required): List of VCF index files corresponding to the VCF files in known_indels_sites_vcfs
known_indels_sites_vcfs (Array[File], required): List of VCF files containing known indels

Defaults

memory_gb (Int, default=25): RAM to allocate for task, specified in GB
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=4): Number of cores to allocate for task
outfile_name (String, default=basename(bam,".bam") + ".recal.txt"): Name for the output recalibration report.
use_original_quality_scores (Boolean, default=false): Use original quality scores from the input BAM. Default is to use recalibrated quality scores.

Outputs

recalibration_report (File)

apply_bqsr

description: Applies base quality score recalibration to a BAM file.
external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360040097972-ApplyBQSRSpark-BETA
outputs: {'recalibrated_bam': 'Recalibrated BAM file', 'recalibrated_bam_index': 'Index file for the recalibrated BAM'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file on which to apply base quality score recalibration
bam_index (File, required): BAM index file corresponding to the input BAM
recalibration_report (File, required): Recalibration report file

Defaults

memory_gb (Int, default=25): RAM to allocate for task, specified in GB
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=4): Number of cores to allocate for task
prefix (String, default=basename(bam,".bam")): Prefix for the output recalibrated BAM. The extension .bqsr.bam will be added.
use_original_quality_scores (Boolean, default=false): Use original quality scores from the input BAM. Default is to use recalibrated quality scores.

Outputs

recalibrated_bam (File)
recalibrated_bam_index (File)

haplotype_caller

description: Calls germline SNPs and indels via local re-assembly of haplotypes.
external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller
outputs: {'vcf': 'VCF file containing called variants', 'vcf_index': 'Index file for the VCF'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file on which to call variants
bam_index (File, required): BAM index file corresponding to the input BAM
dbSNP_vcf (File, required): dbSNP VCF file
dbSNP_vcf_index (File, required): dbSNP VCF index file
dict (File, required): Dictionary file for FASTA format genome
fasta (File, required): Reference genome in FASTA format
fasta_index (File, required): Index for FASTA format genome
interval_list (File, required); description: Interval list indicating regions in which to call variants; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists

Defaults

memory_gb (Int, default=25): RAM to allocate for task, specified in GB
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=4): Number of cores to allocate for task
prefix (String, default=basename(bam,".bam")): Prefix for the output VCF. The extension .vcf.gz will be added.
stand_call_conf (Int, default=20); description: Minimum confidence threshold for calling variants; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller#--standard-min-confidence-threshold-for-calling
use_soft_clipped_bases (Boolean, default=false): Use soft clipped bases in variant calling. Default is to ignore soft clipped bases.

Outputs

vcf (File)
vcf_index (File)

variant_filtration

description: Filters variants based on specified criteria.
external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037434691-VariantFiltration
outputs: {'vcf_filtered': 'Filtered VCF file', 'vcf_filtered_index': 'Index file for the filtered VCF'}

Inputs

Required

_runtime (Any, required)
dict (File, required): Dictionary file for FASTA format genome
fasta (File, required): Reference genome in FASTA format
fasta_index (File, required): Index for FASTA format genome
vcf (File, required): Input VCF format file to filter
vcf_index (File, required): VCF index file corresponding to the input VCF

Defaults

cluster (Int, default=3): Number of SNPs that must be present in a window to filter
filter_expressions (Array[String], default=["FS > 30.0", "QD < 2.0"]); description: Expressions for the filters; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037434691-VariantFiltration#--filter-expression
filter_names (Array[String], default=["FS", "QD"]); description: Names of the filters to apply; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037434691-VariantFiltration#--filter-name
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=1): Number of cores to allocate for task
prefix (String, default=basename(vcf,".vcf.gz")): Prefix for the output filtered VCF. The extension .filtered.vcf.gz will be added.
window (Int, default=35): Size of the window (in bases) for filtering

Outputs

vcf_filtered (File)
vcf_filtered_index (File)

mark_duplicates_spark

description: Marks duplicate reads in the input BAM file using GATK's Spark implementation of Picard's MarkDuplicates.
external_help: https://gatk.broadinstitute.org/hc/en-us/articles/13832682540699-MarkDuplicatesSpark
outputs: {'duplicate_marked_bam': 'The input BAM with computationally determined duplicates marked.', 'duplicate_marked_bam_index': 'The .bai BAM index file associated with duplicate_marked_bam', 'mark_duplicates_metrics': {'description': 'The METRICS_FILE result of picard MarkDuplicates', 'external_help': 'http://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics'}}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file in which to mark duplicates

Defaults

create_bam (Boolean, default=true); description: Enable BAM creation (true)? Or only output MarkDuplicates metrics (false)?; common: true
duplicate_scoring_strategy (String, default="SUM_OF_BASE_QUALITIES"); description: Strategy for scoring duplicates.; choices: ['SUM_OF_BASE_QUALITIES', 'TOTAL_MAPPED_REFERENCE_LENGTH', 'RANDOM']
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
modify_memory_gb (Int, default=0): Add to or subtract from the default memory allocation. Default memory allocation is determined by the size of the input BAM. Specified in GB.
ncpu (Int, default=4): Number of cores to allocate for task
optical_distance (Int, default=0): Maximum distance between read coordinates to consider them optical duplicates. If 0, then optical duplicate marking is disabled. Suggested settings of 100 for unpatterned versions of the Illumina platform (e.g. HiSeq) or 2500 for patterned flowcell models (e.g. NovaSeq). Calculation of distance depends on coordinate data embedded in the read names, typically produced by the Illumina sequencing machines. Optical duplicate detection will not work on non-standard names without modifying read_name_regex.
prefix (String, default=basename(bam,".bam") + ".MarkDuplicates"): Prefix for the MarkDuplicates result files. The extensions .bam, .bam.bai, and .metrics.txt will be added.
read_name_regex (String, default="^[!-9;-?A-~:]+:([!-9;-?A-~]+):([0-9]+):([0-9]+)$"): Regular expression for extracting tile names, x coordinates, and y coordinates from read names. The default works for typical Illumina read names.
tagging_policy (String, default="All"); description: Tagging policy for the output BAM.; choices: ['DontTag', 'OpticalOnly', 'All']
validation_stringency (String, default="SILENT"); description: Validation stringency for parsing the input BAM.; choices: ['STRICT', 'LENIENT', 'SILENT']; tool_default: STRICT

Outputs

duplicate_marked_bam (File?)
duplicate_marked_bam_index (File?)
mark_duplicates_metrics (File)