Homepage

arriba

description
Run Arriba structural variant caller on a RNA-Seq BAM file.
help
Typical input is a STAR-aligned BAM. Arriba also supports DRAGEN-aligned BAMs and any spec compliant BAM. That is discordant mates must have BAM_FPROPER_PAIR (0x2), split reads must have BAM_FSUPPLEMENTARY (0x800), and the anchor read must have a SA tag. Arriba also uses the HI tag to group supplementary alignments.
outputs
{'fusions': 'Output file of fusions in TSV format', 'discarded_fusions': 'Output file of discarded fusions in TSV format'}

Inputs

Required

  • _runtime (Any, required)
  • bam (File, required): Input BAM format file from which to call fusions
  • gtf (File, required): GTF features file. Gzipped or uncompressed.
  • reference_fasta_gz (File, required): Gzipped reference genome in FASTA format

Optional

  • annotate_fusions (File?); description: Optional input file in tab delimited format of fusions to annotate with tags; external_help: https://arriba.readthedocs.io/en/v2.4.0/input-files/#tags
  • chimeric_sam (File?): Optional input file of chimeric reads in SAM format, from older versions of STAR
  • exclude_list (File?); description: Optional input file of regions to exclude from analysis in tab delimited format; external_help: https://arriba.readthedocs.io/en/v2.4.0/input-files/#blacklist
  • known_fusions (File?); description: Optional input file of known fusions in tab delimited format; external_help: https://arriba.readthedocs.io/en/v2.4.0/input-files/#known-fusions
  • protein_domains (File?); description: Optional input file of protein domains coordinates in GFF3 format; external_help: https://arriba.readthedocs.io/en/v2.4.0/input-files/#protein-domains
  • wgs_svs (File?); description: Optional input file of structural variants found by WGS in tab delimited or VCF format; external_help: https://arriba.readthedocs.io/en/v2.4.0/input-files/#structural-variant-calls-from-wgs

Defaults

  • coverage_fraction (Float, default=0.05): Minimum fraction of viral contig transcription.
  • disable_filters (Array[String], default=[]); description: Array of filters to disable.; choices: ['top_expressed_viral_contigs', 'viral_contigs', 'low_coverage_viral_contigs', 'uninteresting_contigs', 'no_genomic_support', 'short_anchor', 'select_best', 'many_spliced', 'long_gap', 'merge_adjacent', 'hairpin', 'small_insert_size', 'same_gene', 'genomic_support', 'read_through', 'no_coverage', 'mismatches', 'homopolymer', 'low_entropy', 'multimappers', 'inconsistently_clipped', 'duplicates', 'homologs', 'blacklist', 'mismappers', 'spliced', 'relative_support', 'min_support', 'known_fusions', 'end_to_end', 'non_coding_neighbors', 'isoforms', 'intronic', 'in_vitro', 'intragenic_exonic', 'internal_tandem_duplication']
  • exonic_fraction (Float, default=0.33): Minimum fraction of exonic sequence between breakpoints.
  • feature_name (String, default="gene_name=gene_name|gene_id,gene_id=gene_id,transcript_id=transcript_id,feature_exon=exon,feature_CDS=CDS"): The Arriba default it designed to handle RefSeq, GENCODE, or ENSEMBL format annotations. feature_name expects a string of space/comma separated options. The required fields are gene_name, gene_id, transcript_id, feature_exon, and feature_CDS. The fields should space separated. The values should be provided with field=value. Mutliple values can be provided and separated by a pipe (|), e.g. =value1|value2. A complete example is gene_name=gene_name|gene_id gene_id=gene_id transcript_id=transcript_id feature_exon=exon feature_CDS=CDS.; description: List of feature names to use in GTF.; external_help: https://arriba.readthedocs.io/en/v2.4.0/command-line-options/; common: false
  • fill_gaps (Boolean, default=false): Fill gaps in assembled transcripts with reference bases. Expands the fusion sequence to the complete sequence of the fusion gene.
  • fragment_length (Int, default=200): For single-end data, this is the fragment length. With paired-end reads, this is ignored and determined automatically.
  • homopolymer_length (Int, default=6): Maximum homopolymer length adjacent to breakpoints.
  • interesting_contigs (Array[String], default=["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y", "AC_", "NC_"]): Array of contigs to consider for analysis. Contigs can be specified with or without the prefix chr.
  • many_spliced_events (Int, default=4): Recover fusions with at least this many spliced breakpoints.
  • mark_duplicates (Boolean, default=true): Arriba performs marking of duplicates internally based on identical mapping coordinates. When this switch is set, internal marking of duplicates is disabled and Arriba assumes that duplicates have been marked by a preceding program. In this case, Arriba only discards alignments flagged with the BAM_FDUP flag. This makes sense when duplicates cannot be reliably identified solely based on their mapping coordinates, e.g. when unique molecular identifiers (UMIs) are used or when independently generated libraries are merged in a single BAM file and the read group must be interrogated to distinguish duplicates from reads that map to the same coordinates by chance. In addition, when this switch is set, duplicate reads are not considered for the calculation of the coverage at fusion breakpoints (columns coverage1 and coverage2 in the output file).; description: Mark duplicates in the input BAM file with Arriba.
  • max_e_value (Float, default=0.3): Maximum E-value for read support.
  • max_genomic_breakpoint_distance (Int, default=1000000): With 'wgs_svs', threshold for relating genomic and transcriptomic events.
  • max_homolog_identity (Float, default=0.3): Maximum fraction of homologous sequence for genes.
  • max_itd_length (Int, default=100): Maximum length of internal tandem duplications.
  • max_kmer_content (Float, default=0.6): Maximum fraction of repetitive 3-mer content in the fusion region.
  • max_mismappers (Float, default=0.8): Maximum fraction of mismapped reads in the fusion region.
  • max_mismatch_pvalue (Float, default=0.01): Maximum p-value for mismatches in the fusion region.
  • max_reads (Int, default=300): Subsample fusions with more than this number of reads.
  • min_anchor_length (Int, default=23): Minimum anchor length for split reads.
  • min_itd_allele_fraction (Float, default=0.07): Minimum supporting read fraction for internal tandem duplications.
  • min_itd_supporting_reads (Int, default=10): Minimum number of supporting reads for internal tandem duplications.
  • min_supporting_reads (Int, default=2): Minimum number of supporting reads for a fusion.
  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • modify_memory_gb (Int, default=0): Add to or subtract from dynamic memory allocation. Default memory is determined by the size of the inputs. Specified in GB.
  • prefix (String, default=basename(bam,".bam") + ".fusions"): Prefix for the fusion result files. The extensions .tsv and .discarded.tsv will be added.
  • quantile (Float, default=0.998): Genes with expression above the given quantile are eligible for filtering.
  • read_through_distance (Int, default=10000): Minimum distance between breakpoints for read-through events.
  • report_additional_columns (Boolean, default=false): Report additional columns ['fusion_transcript', 'peptide_sequence', 'read_identifiers'] in the discarded fusions file.
  • strandedness (String, default="auto"); description: Strandedness of the input data.; external_help: https://arriba.readthedocs.io/en/v2.4.0/command-line-options/; choices: ['auto', 'yes', 'no', 'reverse']
  • top_n (Int, default=5): Only report the top N most highly expressed viral integration sites.
  • viral_contigs (Array[String], default=["AC_", "NC_"]): Array of contigs to consider for viral integration site analysis.

Outputs

  • fusions (File)
  • discarded_fusions (File)

arriba_tsv_to_vcf

description
Convert Arriba TSV format fusions to VCF format.
outputs
{'fusions_vcf': 'Output file of fusions in VCF format'}

Inputs

Required

  • _runtime (Any, required)
  • fusions (File, required): Input fusions in TSV format to convert to VCF
  • reference_fasta (File, required): Reference genome in FASTA format. Either gzipped or uncompressed.

Defaults

  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • prefix (String, default=basename(fusions,".tsv")): Output file name for fusions in VCF format. The extension .vcf will be appended.

Outputs

  • fusions_vcf (File)

arriba_extract_fusion_supporting_alignments

description
Extract alignments that support fusions.
outputs
{'fusion_bams': 'Array of BAM files corresponding with fusions in the input file', 'fusion_bam_indexes': "Array of BAM indexes corresponding with the BAMs in the 'fusion_bams'"}

Inputs

Required

  • _runtime (Any, required)
  • bam (File, required): Input BAM format file from which fusions were called
  • bam_index (File, required): BAM index file corresponding to the input BAM
  • fusions (File, required): Input fusions in TSV format for which to extract supporting alignments

Defaults

  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • prefix (String, default=basename(fusions,".tsv")): Output file name prefix for the extracted BAM files. The extension .bam will be appended.

Outputs

  • fusion_bams (Array[File])
  • fusion_bam_indexes (Array[File])

arriba_annotate_exon_numbers

description
Annotate fusions with exon numbers.
outputs
{'fusion_tsv': 'TSV file with fusions annotated with exon numbers'}

Inputs

Required

  • _runtime (Any, required)
  • fusions (File, required): Input fusions in TSV format for which to annotate gene exon numbers
  • gtf (File, required): GTF features file. Gzipped or uncompressed.

Defaults

  • modify_disk_size_gb (Int, default=0): Add to or subtract from dynamic disk space allocation. Default disk size is determined by the size of the inputs. Specified in GB.
  • prefix (String, default=basename(fusions,".tsv")): Output file name for annotated fusions in TSV format. The extension .annotated.tsv will be appended.

Outputs

  • fusion_tsv (File)