Command-line usage¶
im-fusion build¶
Description¶
im-fusion build is used to create the augmented reference genome that is required to identify gene-transposon fusions using Tophat2.
Arguments¶
--reference_seq Path to the reference sequence (in Fastaformat).--reference_gtf Path to the reference gtf file. (Expected toconform to Ensembls GTF file struture.)--transposon_seq Path to the transposon sequence (in Fastaformat). This file should typically containa single sequence.--output Output path for the augmented reference.--blacklist_regions Regions of the reference to blacklist. Shouldbe specified as ‘chromosome:start-end’.--blacklist_genes Genes to blacklist. Should be specified asgene IDs (Ensembl IDs for Ensembl gtfs).--no_index Suppresses building of the bowtie index.--no_transcriptome_index Suppresses building of the transcriptome index.--force_overwrite Overwrite any existing files.
Examples¶
Basic:
im-fusion build --reference_sequence ./reference.fa \
--reference_gtf ./reference.gtf \
--transposon_sequence ./transposon.fa \
--output ./augmented.fa
With blacklisted region:
im-fusion build --reference_sequence ./reference.fa \
--reference_gtf ./reference.gtf \
--transposon_sequence ./transposon.fa \
--blacklist_regions 13:31623816-31633406 \
--output ./augmented.fa
With blacklisted gene:
im-fusion build --reference_sequence reference.fa \
--reference_gtf reference.gtf \
--transposon_seq transposon.fa \
--blacklist_genes ENSMUSG00000039095 \
--output ./augmented.fa
im-fusion insertions¶
Description¶
im-fusion insertions identifies insertions in a sample by using Tophat2 to identify gene-transposon fusions in RNA-sequencing data. To do so, the command first calls Tophat2 to align reads to the augmented reference genome and to identify gene fusions. It then parses the fusions identified by Tophat (from Tophat’s fusion.out output file) and selects fusions between the endogeneous reference and the transposon sequence. These gene-tranposon fusions are annotated to identify the involved gene(s) and transposon feature(s) and converted to insertions with an (approximate) genomic location. Insertions are written to a tab-separated file (insertions.txt) in the output directory.
Arguments¶
Basic arguments
--fastq Path(s) to the samples fastq files.--fastq2 Paths to the second pair fastq files(for paired-end sequencing data). Should begiven in the same order as for fastq.--reference_index Path to the index of the augmented referencegenerated by im-fusion build.--reference_gtf Path to the reference gtf file. Typically thesame file as used in im-fusion build.--transposon_name Name of the transposon sequence in the augmentedreference. Should reflect the name used for thetransposon sequence in the transposon.fafile that was used in im-fusion build.--transposon_features Path to a tab-separated file containing thedescription of the transposon features (spliceacceptors/donors etc.). The file should containthe columns name, start, end, strand and type.Start, end and strand are used to definethe position and orientation of the features.The type column indicates whether a featurerepresents a splice acceptor or donor feature andshould contain only SA, SD or empty values. Seethe data directory on Github for examples.--output_dir The samples output directory.--sample_id Sample id to use for the given sample. Defaultsto the basename of the output directory.
Filtering options
--blacklist IDs of genes for which insertionsshould be filtered (typically genes that sharehomologous sequences with the transposon).--min_flank Minimum required size of aligned sequences oneither side of the fusion (default = 20).--min_support Minimum number of reads that should supportany identfied fusions (default = 2).
Tophat options
--transcriptome_index Path to the transcriptome index. Only needs tobe provided if the transcriptome index locationdiffers from the default relative path used byim-fusion build.--tophat_args String with extra commandline arguments thatshould be passed to Tophat2 for the alignment.
Examples¶
For single-end data:
im-fusion insertions --fastq s1.R1.fastq.gz \
--reference_seq augmented \
--reference_gtf reference.gtf \
--transposon_name T2Onc2 \
--transposon_features T2Onc2.features.txt \
--output_dir ./output/s1 \
--sample_id s1
For paired-end data add the fastq2 argument:
im-fusion insertions --fastq s1.R1.fastq.gz \
--fastq2 s1.R2.fastq.gz \
--reference_seq augmented \
--reference_gtf reference.gtf \
--transposon_name T2Onc2 \
--transposon_features T2Onc2.features.txt \
--output_dir ./output/s1 \
--sample_id s1
im-fusion expression¶
Description¶
im-fusion expression generates exon expression counts for an individual sample using the alignment generated by im-fusion insertions. The counts are generated using the featureCounts tool, which must be available in PATH. The generated counts are written to the samples output directory as the tab-separated file ‘exon_counts.txt’.
Arguments¶
--sample_dir Path to the sample directory. Effectively thesame as output_dir in im-fusion insertions.--exon_gtf Path to the exon gtf file, which containsa flattened representation of the exons inthe previously used reference gtf.--sample_id Sample id to use in the generated counts file.Should reflect the same sample id as used byim-fusion insertions. Defaults to the nameof the input directory.--paired Generate counts by counting fragments instead ofreads (for paired-end data).--stranded Perform strand-specific read counting. Possiblevalues: 0 (unstranded), 1 (stranded) and 2(reversely stranded). 0 by default.--threads Number of threads to use in featureCounts.--extra_kwargs Extra command line options to pass tofeatureCounts.
The easiest way to generate the required exon gtf file is to generate it from the previously used reference.gtf file using the dexseq_prepare_annotation.py script from DEXSeq. After extracting the script from the DEXSeq package, the exon gtf can be generated using the following command:
python dexseq_prepare_annotation.py --aggr no reference.gtf exons.gtf
Examples¶
For single-end data:
im-fusion expression --sample_dir ./output/s1 \
--exon_gtf exons.gtf \
--sample_id s1
For paired-end data:
im-fusion expression --sample_dir ./output/s1 \
--exon_gtf exons.gtf \
--sample_id s1 \
--paired
im-fusion merge¶
Description¶
im-fusion merge merges individual samples into a combined dataset that can be used in the CTG analysis. The command effectively concatenates the individual results into a combined insertions.txt file and a combined exon_counts.txt file.
Arguments¶
--base_dir Path to a base directory that contains outputsfor individual samples as sub-directories.--output_base Base name of the merged output files.--samples IDs of samples to subset the output to.
Examples¶
Without subsetting:
im-fusion merge --base_dir ./output \
--output_base ./output/merged
With subsetting:
im-fusion merge --base_dir ./output \
--output_base ./output/merged \
--samples s1 s2
im-fusion ctg¶
Description¶
im-fusion ctg uses the combined insertions/expression dataset to identify genes that are recurrently mutated across samples AND differentially expressed by their insertions. To do so, the command performs two distinct significance tests. The first test compares the number of insertions within a gene to what would be expected by chance (modeled using a Poisson distribution). Genes with significant p-values are selected as Commonly Targeted Genes (CTGs).
The second test is a differential expression test that compares the ratio of expression before/after insertions in a CTG between samples with/without insertions in the CTG. A CTG is considered to be differentially expressed if samples with an insertion in the gene show a significant increase/decrease in expression after the insertion site compared to samples without an insertion. Genes that pass both tests are written to an output file.
Note that the differential expression test can be ommitted by not providing the expression file as an argument. In this case, only the CTG test is performed.
For more details on the implementation of the tests, please see our publication.
Arguments¶
Basic arguments
--insertions Path to the merged insertions file fromim-fusion merge.--reference_seq Path to the reference genome sequence (infasta format). Can either be the augmentedreference genome or the original reference.--reference_gtf Path to the reference gtf file. Typically thesame file as used in im-fusion build.--output Path for the output CTG file.--threshold Minimum corrected p-value for CTGs.(Default = 0.05).--pattern Regular expression reflecting the nucleotidesequence at which the use transposon typicallyintegrates (if any). Used to correct forsequence integration biases along the genome.For example the pattern (AT|TA) is used forthe T2onc2 transposon, which is biased towardsintegrations as TA sites.--window Window around the gene within which we testa given gene for enrichment in insertions.
Insertion selection
--chromosomes Chromosomes to consider. Used to omitspecific chromosomes from the CTG analysis.--min_depth Minimum supporting number of reads for insertionsto be included in the CTG analysis. Can be usedto omit insertions with low support for moreconfidence in the analysis.
Differential expression
--expression Path to the merged expression file fromim-fusion merge.--exon_gtf Path to the exon gtf file. Typically thesame file as used in im-fusion expression.--de_threshold Minimum p-value for a CTG to be consideredas differentially expressed.
Examples¶
With differential expression:
im-fusion ctg --insertions ./merged.insertions.txt \
--expression ./merged.exon_counts.txt \
--reference_seq ./reference.fa \
--reference_gtf ./reference.gtf \
--exon_gtf ./exons.gtf \
--output ctgs.txt
Without differential expression:
im-fusion ctg --insertions ./merged.insertions.txt \
--reference_seq ./reference.fa \
--reference_gtf ./reference.gtf \
--output ctgs.txt
With non-default significance thresholds:
im-fusion ctg --insertions ./merged.insertions.txt \
--expression ./merged.exon_counts.txt \
--reference_seq ./reference.fa \
--reference_gtf ./reference.gtf \
--exon_gtf ./exons.gtf \
--output ctgs.txt
--threshold 0.01
--de_threshold 0.1