Samtools

Homepage

quickcheck

description: Runs Samtools quickcheck on the input BAM file. This checks that the BAM file appears to be intact, e.g. header exists and the end-of-file marker exists.
outputs: {'check': 'Dummy output to enable caching'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to quickcheck

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.

Outputs

check (String)

split

description: Runs Samtools split on the input BAM file. This splits the BAM by read group into one or more output files. It optionally errors if there are reads present that do not belong to a read group.
outputs: {'split_bams': 'The split BAM files. The extensions will contain read group IDs, and will end in .bam.'}

Inputs

Required

_runtime (Any, required)
bam (File, required); description: Input BAM format file to split; stream: true

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
prefix (String, default=basename(bam,".bam")): Prefix for the split BAM files. The extensions will contain read group IDs, and will end in .bam.
reject_empty_output (Boolean, default=true); description: If true, error if any output BAMs are empty.; common: true
reject_unaccounted_reads (Boolean, default=true); description: If true, error if there are reads present that do not have read group information matching the header.; common: true
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

split_bams (Array[File])

flagstat

description: Produces a samtools flagstat report containing statistics about the alignments based on the bit flags set in the BAM
outputs: {'flagstat_report': 'samtools flagstat STDOUT redirected to a file'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to generate flagstat for

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
outfile_name (String, default=basename(bam,".bam") + ".flagstat.txt"): Name for the flagstat report file
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

flagstat_report (File)

index

description: Creates a .bai BAM index for the input BAM
outputs: {'bam_index': "A .bai BAM index associated with the input BAM. Filename will be basename(bam) + '.bai'."}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to index

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

bam_index (File)

subsample

description: Randomly subsamples the input BAM, in order to produce an output BAM with approximately the desired number of reads.
help: A desired_reads greater than zero must be supplied. A desired_reads <= 0 will result in task failure. Sampling is probabalistic and will be approximate to desired_reads. Read count will not be exact. A sampled_bam will not be produced if the input BAM read count is less than or equal to desired_reads.
outputs: {'orig_read_count': 'A TSV report containing the original read count before subsampling. If subsampling was requested but the input BAM had less than desired_reads, no read count will be filled in (instead there will be a dash).', 'sampled_bam': 'The subsampled input BAM. Only present if subsampling was performed.'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to subsample
desired_reads (Int, required): How many reads should be in the ouput BAM? Output BAM read count will be approximate to this value. Must be greater than zero. A desired_reads <= 0 will result in task failure.

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
prefix (String, default=basename(bam,".bam")): Prefix for the BAM file. The extension .sampled.bam will be added.
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

orig_read_count (File)
sampled_bam (File?)

filter

description: Filters a BAM based on its bitwise flag value.
help: This task is a wrapper around samtools view. This task will fail if there are no reads in the output BAM. This can happen either because the input BAM was empty or because the supplied bitwise_filter was too strict. If you want to down-sample a BAM, use the subsample task instead.
outputs: {'filtered_bam': 'BAM file that has been filtered based on the input flags'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to filter
bitwise_filter (FlagFilter, required): A set of 4 possible read filters to apply. This is a FlagFilter object (see ../data_structures/flag_filter.wdl for more information).

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
prefix (String, default=basename(bam,".bam") + ".filtered"): Prefix for the filtered BAM file. The extension .bam will be added.
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

filtered_bam (File)

merge

description: Merges multiple sorted BAMs into a single BAM
outputs: {'merged_bam': 'The BAM resulting from merging all the input BAMs'}

Inputs

Required

_runtime (Any, required)
bams (Array[File], required): An array of BAMs to merge into one combined BAM
prefix (String, required): Prefix for the BAM file. The extension .bam will be added.

Optional

new_header (File?): Use the lines of FILE as @ headers to be copied to the merged BAM, replacing any header lines that would otherwise be copied from the first BAM file in the list. (File may actually be in SAM format, though any alignment records it may contain are ignored.)

Defaults

attach_rg (Boolean, default=true); description: Attach an RG tag to each alignment. The tag value is inferred from file names.; common: true
combine_pg (Boolean, default=true); description: Similarly to combine_rg: for each @PG ID in the set of files to merge, use the @PG line of the first file we find that ID in rather than adding a suffix to differentiate similar IDs.; common: true
combine_rg (Boolean, default=true); description: When several input files contain @RG headers with the same ID, emit only one of them (namely, the header line from the first file we find that ID in) to the merged output file. Combining these similar headers is usually the right thing to do when the files being merged originated from the same file. Without -c, all @RG headers appear in the output file, with random suffixes added to their IDs where necessary to differentiate them.; common: true
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
name_sorted (Boolean, default=false); description: Are all input BAMs queryname sorted (true)? Or are all input BAMs coordinate sorted (false)?; common: true
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
region (String, default=""): Merge files in the specified region (Format: chr:start-end)
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

merged_bam (File)

addreplacerg

description: Adds or replaces read group tags
outputs: {'tagged_bam': 'The transformed input BAM after read group modifications have been applied'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to add read group information

Optional

read_group_id (String?): Allows you to specify the read group ID of an existing @RG line and applies it to the reads specified by the orphan_only option

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
orphan_only (Boolean, default=true); description: Only add RG tags to orphans (true)? Or also overwrite all existing RG tags (including any in the header) (false)?; common: true
overwrite_header_record (Boolean, default=false); description: Overwrite an existing @RG line, if a new one with the same ID value is provided?; common: true
prefix (String, default=basename(bam,".bam") + ".addreplacerg"): Prefix for the BAM file. The extension .bam will be added.
read_group_line (Array[String], default=[]); description: Allows you to specify a read group line to append to (or replace in) the header and applies it to the reads specified by the orphan_only option. Each String in the Array should correspond to one field of the read group line. Tab literals will be inserted between each entry in the final BAM. Only one read group line can be supplied per invocation of this task.; common: true
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

tagged_bam (File)

collate

description: Runs samtools collate on the input BAM file. Shuffles and groups reads together by their names.
outputs: {'collated_bam': 'A collated BAM (reads sharing a name next to each other, no other guarantee of sort order)'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to collate

Defaults

fast_mode (Boolean, default=true); description: Use fast mode (output primary alignments only)?; common: true
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
modify_memory_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
prefix (String, default=basename(bam,".bam") + ".collated"): Prefix for the collated BAM file. The extension .bam will be added.
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

collated_bam (File)

bam_to_fastq

description: Converts an input BAM file into FASTQ(s) using samtools fastq.
help: If paired_end == false, then all reads in the BAM will be output to a single FASTQ file. Use bitwise_filter argument to remove any unwanted reads. An exit-code of 42 indicates that no reads were present in the output FASTQs. An exit-code of 43 indicates that unexpected reads were discovered in the input BAM.
outputs: {'collated_bam': 'A collated BAM (reads sharing a name next to each other, no other guarantee of sort order). Only generated if retain_collated_bam and paired_end are both true. Has the name ~{prefix}.collated.bam.', 'read_one_fastq_gz': 'Gzipped FASTQ file with 1st reads in pair. Only generated if paired_end is true and interleaved is false. Has the name ~{prefix}.R1.fastq.gz.', 'read_two_fastq_gz': 'Gzipped FASTQ file with 2nd reads in pair. Only generated if paired_end is true and interleaved is false. Has the name ~{prefix}.R2.fastq.gz.', 'singleton_reads_fastq_gz': 'Gzipped FASTQ containing singleton reads. Only generated if paired_end and output_singletons are both true. Has the name ~{prefix}.singleton.fastq.gz.', 'interleaved_reads_fastq_gz': 'Interleaved gzipped Paired-End FASTQ. Only generated if paired_end and interleaved are both true. Has the name ~{prefix}.fastq.gz. The conditions under which this output and single_end_reads_fastq_gz are created are mutually exclusive, but since they share the same literal filename they will always evaluate to the same file (or undefined if neither are created).', 'single_end_reads_fastq_gz': 'A gzipped FASTQ containing all reads. Only generated if paired_end is false. Has the name ~{prefix}.fastq.gz. The conditions under which this output and interleaved_reads_fastq_gz are created are mutually exclusive, but since they share the same literal filename they will always evaluate to the same file (or undefined if neither are created).'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to convert to FASTQ(s)

Defaults

append_read_number (Boolean, default=true); description: Append /1 and /2 suffixes to read names?; common: true
bitwise_filter (FlagFilter, default={"include_if_all": "0x0", "exclude_if_any": "0x900", "include_if_any": "0x0", "exclude_if_all": "0x0"}): A set of 4 possible read filters to apply during conversion to FASTQ. This is a FlagFilter object (see ../data_structures/flag_filter.wdl for more information). By default, it will remove secondary and supplementary reads from the output FASTQs.
collated (Boolean, default=false); description: Is the BAM collated (or name-sorted)? If collated == true, then the input BAM will be run through samtools fastq without preprocessing. If collated == false, then samtools collate must be run on the input BAM before conversion to FASTQ. Ignored if paired_end == false.; common: true
fail_on_unexpected_reads (Boolean, default=false): The definition of 'unexpected' depends on whether the values of paired_end and output_singletons are true or false. If paired_end is false, no reads are considered unexpected, and every read (not caught by bitwise_filter) will be present in the resulting FASTQ regardless of first/last bit settings. This setting will be ignored in that case. If paired_end is true then reads that don't satisfy first XOR last are considered unexpected (i.e. reads that have neither first nor last set or reads that have both first and last set). If output_singletons is false, singleton reads are considered unexpected. A singleton read is a read with either the first or the last bit set (but not both) and that possesses a unique QNAME; i.e. it is a read without a pair when all reads are expected to be paired. But if output_singletons is true, these singleton reads will be output as their own FASTQ instead of causing the task to fail. If fail_on_unexpected_reads is false, then all the above cases will be ignored. Any 'unexpected' reads will be silently discarded.; description: Should the task fail if reads with an unexpected first/last bit setting are discovered?; common: true
fast_mode (Boolean, default=!retain_collated_bam); description: Fast mode for samtools collate? If true, this removes secondary and supplementary reads during the collate step. If false, secondary and supplementary reads will be retained in the collated_bam output (if created). Defaults to the opposite of retain_collated_bam. Ignored if collated == true or paired_end == false.; common: true
interleaved (Boolean, default=false); description: Create an interleaved FASTQ file from Paired-End data? Ignored if paired_end == false.; common: true
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
modify_memory_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
output_singletons (Boolean, default=false): Output singleton reads as their own FASTQ? Ignored if paired_end == false.
paired_end (Boolean, default=true); description: Is the data Paired-End? If paired_end == false, then all reads in the BAM will be output to a single FASTQ file. Use bitwise_filter argument to remove any unwanted reads.; common: true
prefix (String, default=basename(bam,".bam")): Prefix for the collated BAM and FASTQ files. The extensions .collated.bam and [,.R1,.R2,.singleton].fastq.gz will be added.
retain_collated_bam (Boolean, default=false); description: Save the collated BAM to disk and output it (true)? This slows performance and substantially increases storage requirements. Be aware that collated BAMs occupy much more space than either position sorted or name sorted BAMs (due to the compression algorithm). Ignored if collated == true or paired_end == false.; common: true
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

collated_bam (File?)
read_one_fastq_gz (File?)
read_two_fastq_gz (File?)
singleton_reads_fastq_gz (File?)
interleaved_reads_fastq_gz (File?)
single_end_reads_fastq_gz (File?)

fixmate

description: Runs samtools fixmate on the name-collated input BAM file. This fills in mate coordinates and insert size fields among other tags and fields.
help: This task assumes a name-sorted or name-collated input BAM. If you have a position-sorted BAM, please use the position_sorted_fixmate task. This task runs fixmate and outputs a BAM in the same order as the input.
outputs: {'fixmate_bam': 'The BAM resulting from running samtools fixmate on the input BAM'}

Inputs

Required

_runtime (Any, required)
bam (File, required); description: Input BAM format file to add mate information. Must be name-sorted or name-collated.; stream: true

Defaults

add_cigar (Boolean, default=true); description: Add template cigar ct tag; tool_default: false; common: true
add_mate_score (Boolean, default=true); description: Add mate score tags. These are used by markdup to select the best reads to keep.; tool_default: false; common: true
disable_flag_sanitization (Boolean, default=false): Disable all flag sanitization?
disable_proper_pair_check (Boolean, default=false): Disable proper pair check [ensure one forward and one reverse read in each pair]
extension (String, default=".bam"); description: File format extension to use for output file.; choices: ['.bam', '.cram']; common: true
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
prefix (String, default=basename(bam,".bam") + ".fixmate"): Prefix for the output file. The extension specified with the extension parameter will be added.
remove_unaligned_and_secondary (Boolean, default=false): Remove unmapped and secondary reads
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

fixmate_bam (File)

position_sorted_fixmate

description: Runs samtools fixmate on the position-sorted input BAM file and output a position-sorted BAM. fixmate fills in mate coordinates and insert size fields among other tags and fields. samtools fixmate assumes a name-sorted or name-collated input BAM. If you already have a collated BAM, please use the fixmate task. This task collates the input BAM, runs fixmate, and then resorts the output into a position-sorted BAM.
outputs: {'fixmate_bam': 'BAM file with mate information added'}

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to add mate information. Must be position-sorted.

Defaults

add_cigar (Boolean, default=true); description: Add template cigar ct tag; tool_default: false; common: true
add_mate_score (Boolean, default=true); description: Add mate score tags. These are used by markdup to select the best reads to keep.; tool_default: false; common: true
disable_flag_sanitization (Boolean, default=false): Disable all flag sanitization?
disable_proper_pair_check (Boolean, default=false): Disable proper pair check [ensure one forward and one reverse read in each pair]?
fast_mode (Boolean, default=false); description: Use fast mode (output primary alignments only)?; common: true
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
modify_memory_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
prefix (String, default=basename(bam,".bam") + ".fixmate"): Prefix for the output file. The extension .bam will be added.
remove_unaligned_and_secondary (Boolean, default=false): Remove unmapped and secondary reads
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true

Outputs

fixmate_bam (File)

markdup

description: [DEPRECATED] Runs samtools markdup on the position-sorted input BAM file. This creates a report and optionally a new BAM with duplicate reads marked.
help: This task assumes samtools fixmate has already been run on the input BAM. If it has not, then the output may be incorrect. A name-sorted or collated BAM can be run through the fixmate task (and then position-sorted prior to this task) or a position-sorted BAM can be run through the position_sorted_fixmate task. Deprecated due to extremely high memory usage for certain RNA-Seq samples when searching for optical duplicates. Use mark_duplicates in ./picard.wdl instead.
deprecated: true

Inputs

Required

_runtime (Any, required)
bam (File, required): Input BAM format file to mark duplicates in

Defaults

coordinates_order (String, default="txy"); description: The order of the elements captured in the read_coords_regex regular expression. Default is txy where t is a part of the read name selected for string comparison and x/y are the coordinates used for optical duplicate detection. Ignored if optical_distance == 0.; choices: ['txy', 'tyx', 'xyt', 'yxt', 'xty', 'ytx', 'xy', 'yx']
create_bam (Boolean, default=true): Create a new BAM with duplicate reads marked? If false, then only a markdup report will be generated.
duplicate_count (Boolean, default=false): Record the original primary read duplication count (include itself) in a dc tag? Ignored if create_bam == false.
duplicates_of_duplicates_check (Boolean, default=false): Check duplicates of duplicates for correctness? Performs further checks to make sure all optical duplicates are found. Also operates on mark_duplicates_with_do_tag tagging where reads may be tagged with the best quality read. Disabling this option can speed up duplicate marking when there are a great many duplicates for each original read. Ignored if create_bam == false or optical_distance == 0.
include_qc_fails (Boolean, default=false): Include reads that have the QC-failed flag set in duplicate marking? This can increase the number of duplicates found. Ignored if create_bam == false.
json (Boolean, default=false): Output a JSON report instead of a text report? Either are parseable by MultiQC.
mark_duplicates_with_do_tag (Boolean, default=false): Mark duplicates with the do (duplicate original) tag? The do tag contains the name of the "original" read that was duplicated. Ignored if create_bam == false.
mark_supp_or_sec_or_unmapped_as_duplicates (Boolean, default=false): Mark supplementary, secondary, or unmapped alignments of duplicates as duplicates? As this takes a quick second pass over the data it will increase running time. Ignored if create_bam == false.
max_readlen (Int, default=300): Expected maximum read length.
modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
modify_memory_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
ncpu (Int, default=2); description: Number of cores to allocate for task; common: true
optical_distance (Int, default=0): Maximum distance between read coordinates to consider them optical duplicates. If 0, then optical duplicate marking is disabled. Suggested settings of 100 for HiSeq style platforms or about 2500 for NovaSeq ones. When set above 0, duplicate reads are tagged with dt:Z:SQ for optical duplicates and dt:Z:LB otherwise. Calculation of distance depends on coordinate data embedded in the read names, typically produced by the Illumina sequencing machines. Optical duplicate detection will not work on non-standard names without modifying read_coords_regex. If changing read_coords_regex, make sure that coordinates_order matches.
prefix (String, default=basename(bam,".bam") + ".markdup"): Prefix for the output file.
read_coords_regex (String, default="[!-9;-?A-~:]+:([!-9;-?A-~]+):([0-9]+):([0-9]+)"); description: Regular expression to extract read coordinates from the QNAME field. This takes a POSIX regular expression for at least x and y to be used in optical duplicate marking It can also include another part of the read name to test for equality, eg lane:tile elements. Elements wanted are captured with parentheses. The default is meant to capture information from Illumina style read names. Ignored if optical_distance == 0. If changing read_coords_regex, make sure that coordinates_order matches.; tool_default: ([!-9;-?A-~]+:[0-9]+:[0-9]+:[0-9]+:[0-9]+):([0-9]+):([0-9]+)
remove_duplicates (Boolean, default=false): Remove duplicates from the output BAM? Ignored if create_bam == false.
use_all_cores (Boolean, default=false); description: Use all cores? Recommended for cloud environments.; common: true
use_read_groups (Boolean, default=false): Only mark duplicates within the same Read Group? Ignored if create_bam == false.

Outputs

markdup_report (File)
markdup_bam (File?)

faidx

description: Creates a .fai FASTA index for the input FASTA
outputs: {'fasta_index': "A .fai FASTA index associated with the input FASTA. Filename will be basename(fasta) + '.fai'."}

Inputs

Required

_runtime (Any, required)
fasta (File, required): Input FASTA format file to index. Optionally gzip compressed.

Defaults

modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.

Outputs

fasta_index (File)