Homepage

split_n_cigar_reads

description
Splits reads that contain Ns in their CIGAR strings into multiple reads.
external_help
https://gatk.broadinstitute.org/hc/en-us/articles/360036858811-SplitNCigarReads
outputs
{'split_n_reads_bam': 'BAM file with reads split at N CIGAR elements and updated CIGAR strings.', 'split_n_reads_bam_index': 'Index file for the split BAM', 'split_n_reads_bam_md5': 'MD5 checksum for the split BAM'}

Inputs

Required

  • _runtime (Any, required)
  • bam (File, required): Input BAM format file to with unsplit reads containing Ns in their CIGAR strings.
  • bam_index (File, required): BAM index file corresponding to the input BAM
  • dict (File, required): Dictionary file for FASTA format genome
  • fasta (File, required): Reference genome in FASTA format. Must be uncompressed.
  • fasta_index (File, required): Index for FASTA format genome

Defaults

  • memory_gb (Int, default=25): RAM to allocate for task, specified in GB
  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • ncpu (Int, default=8): Number of cores to allocate for task
  • prefix (String, default=basename(bam,".bam") + ".split"): Prefix for the BAM file. The extension .bam will be added.

Outputs

  • split_n_reads_bam (File)
  • split_n_reads_bam_index (File)
  • split_n_reads_bam_md5 (File)

base_recalibrator

description
Generates recalibration report for base quality score recalibration.
external_help
https://gatk.broadinstitute.org/hc/en-us/articles/360036897372-BaseRecalibratorSpark-BETA
outputs
{'recalibration_report': 'Recalibration report file'}

Inputs

Required

  • _runtime (Any, required)
  • bam (File, required): Input BAM format file on which to recabilbrate base quality scores
  • bam_index (File, required): BAM index file corresponding to the input BAM
  • dbSNP_vcf (File, required): dbSNP VCF file
  • dbSNP_vcf_index (File, required): dbSNP VCF index file
  • dict (File, required): Dictionary file for FASTA format genome
  • fasta (File, required): Reference genome in FASTA format
  • fasta_index (File, required): Index for FASTA format genome
  • known_indels_sites_indices (Array[File], required): List of VCF index files corresponding to the VCF files in known_indels_sites_vcfs
  • known_indels_sites_vcfs (Array[File], required): List of VCF files containing known indels

Defaults

  • memory_gb (Int, default=25): RAM to allocate for task, specified in GB
  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • ncpu (Int, default=4): Number of cores to allocate for task
  • outfile_name (String, default=basename(bam,".bam") + ".recal.txt"): Name for the output recalibration report.
  • use_original_quality_scores (Boolean, default=false): Use original quality scores from the input BAM. Default is to use recalibrated quality scores.

Outputs

  • recalibration_report (File)

apply_bqsr

description
Applies base quality score recalibration to a BAM file.
external_help
https://gatk.broadinstitute.org/hc/en-us/articles/360040097972-ApplyBQSRSpark-BETA
outputs
{'recalibrated_bam': 'Recalibrated BAM file', 'recalibrated_bam_index': 'Index file for the recalibrated BAM'}

Inputs

Required

  • _runtime (Any, required)
  • bam (File, required): Input BAM format file on which to apply base quality score recalibration
  • bam_index (File, required): BAM index file corresponding to the input BAM
  • recalibration_report (File, required): Recalibration report file

Defaults

  • memory_gb (Int, default=25): RAM to allocate for task, specified in GB
  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • ncpu (Int, default=4): Number of cores to allocate for task
  • prefix (String, default=basename(bam,".bam")): Prefix for the output recalibrated BAM. The extension .bqsr.bam will be added.
  • use_original_quality_scores (Boolean, default=false): Use original quality scores from the input BAM. Default is to use recalibrated quality scores.

Outputs

  • recalibrated_bam (File)
  • recalibrated_bam_index (File)

haplotype_caller

description
Calls germline SNPs and indels via local re-assembly of haplotypes.
external_help
https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller
outputs
{'vcf': 'VCF file containing called variants', 'vcf_index': 'Index file for the VCF'}

Inputs

Required

  • _runtime (Any, required)
  • bam (File, required): Input BAM format file on which to call variants
  • bam_index (File, required): BAM index file corresponding to the input BAM
  • dbSNP_vcf (File, required): dbSNP VCF file
  • dbSNP_vcf_index (File, required): dbSNP VCF index file
  • dict (File, required): Dictionary file for FASTA format genome
  • fasta (File, required): Reference genome in FASTA format
  • fasta_index (File, required): Index for FASTA format genome
  • interval_list (File, required); description: Interval list indicating regions in which to call variants; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists

Defaults

  • memory_gb (Int, default=25): RAM to allocate for task, specified in GB
  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
  • ncpu (Int, default=4): Number of cores to allocate for task
  • prefix (String, default=basename(bam,".bam")): Prefix for the output VCF. The extension .vcf.gz will be added.
  • stand_call_conf (Int, default=20); description: Minimum confidence threshold for calling variants; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller#--standard-min-confidence-threshold-for-calling
  • use_soft_clipped_bases (Boolean, default=false): Use soft clipped bases in variant calling. Default is to ignore soft clipped bases.

Outputs

  • vcf (File)
  • vcf_index (File)

variant_filtration

description
Filters variants based on specified criteria.
external_help
https://gatk.broadinstitute.org/hc/en-us/articles/360037434691-VariantFiltration
outputs
{'vcf_filtered': 'Filtered VCF file', 'vcf_filtered_index': 'Index file for the filtered VCF'}

Inputs

Required

  • _runtime (Any, required)
  • dict (File, required): Dictionary file for FASTA format genome
  • fasta (File, required): Reference genome in FASTA format
  • fasta_index (File, required): Index for FASTA format genome
  • vcf (File, required): Input VCF format file to filter
  • vcf_index (File, required): VCF index file corresponding to the input VCF

Defaults

  • cluster (Int, default=3): Number of SNPs that must be present in a window to filter
  • filter_expressions (Array[String], default=["FS > 30.0", "QD < 2.0"]); description: Expressions for the filters; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037434691-VariantFiltration#--filter-expression
  • filter_names (Array[String], default=["FS", "QD"]); description: Names of the filters to apply; external_help: https://gatk.broadinstitute.org/hc/en-us/articles/360037434691-VariantFiltration#--filter-name
  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
  • ncpu (Int, default=1): Number of cores to allocate for task
  • prefix (String, default=basename(vcf,".vcf.gz")): Prefix for the output filtered VCF. The extension .filtered.vcf.gz will be added.
  • window (Int, default=35): Size of the window (in bases) for filtering

Outputs

  • vcf_filtered (File)
  • vcf_filtered_index (File)

mark_duplicates_spark

description
Marks duplicate reads in the input BAM file using GATK's Spark implementation of Picard's MarkDuplicates.
external_help
https://gatk.broadinstitute.org/hc/en-us/articles/13832682540699-MarkDuplicatesSpark
outputs
{'duplicate_marked_bam': 'The input BAM with computationally determined duplicates marked.', 'duplicate_marked_bam_index': 'The .bai BAM index file associated with duplicate_marked_bam', 'mark_duplicates_metrics': {'description': 'The METRICS_FILE result of picard MarkDuplicates', 'external_help': 'http://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics'}}

Inputs

Required

  • _runtime (Any, required)
  • bam (File, required): Input BAM format file in which to mark duplicates

Defaults

  • create_bam (Boolean, default=true); description: Enable BAM creation (true)? Or only output MarkDuplicates metrics (false)?; common: true
  • duplicate_scoring_strategy (String, default="SUM_OF_BASE_QUALITIES"); description: Strategy for scoring duplicates.; choices: ['SUM_OF_BASE_QUALITIES', 'TOTAL_MAPPED_REFERENCE_LENGTH', 'RANDOM']
  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • modify_memory_gb (Int, default=0): Add to or subtract from the default memory allocation. Default memory allocation is determined by the size of the input BAM. Specified in GB.
  • ncpu (Int, default=4): Number of cores to allocate for task
  • optical_distance (Int, default=0): Maximum distance between read coordinates to consider them optical duplicates. If 0, then optical duplicate marking is disabled. Suggested settings of 100 for unpatterned versions of the Illumina platform (e.g. HiSeq) or 2500 for patterned flowcell models (e.g. NovaSeq). Calculation of distance depends on coordinate data embedded in the read names, typically produced by the Illumina sequencing machines. Optical duplicate detection will not work on non-standard names without modifying read_name_regex.
  • prefix (String, default=basename(bam,".bam") + ".MarkDuplicates"): Prefix for the MarkDuplicates result files. The extensions .bam, .bam.bai, and .metrics.txt will be added.
  • read_name_regex (String, default="^[!-9;-?A-~:]+:([!-9;-?A-~]+):([0-9]+):([0-9]+)$"): Regular expression for extracting tile names, x coordinates, and y coordinates from read names. The default works for typical Illumina read names.
  • tagging_policy (String, default="All"); description: Tagging policy for the output BAM.; choices: ['DontTag', 'OpticalOnly', 'All']
  • validation_stringency (String, default="SILENT"); description: Validation stringency for parsing the input BAM.; choices: ['STRICT', 'LENIENT', 'SILENT']; tool_default: STRICT

Outputs

  • duplicate_marked_bam (File?)
  • duplicate_marked_bam_index (File?)
  • mark_duplicates_metrics (File)