Command-line usage¶

im-fusion build¶

Description¶

im-fusion build is used to create the augmented reference genome that is required to identify gene-transposon fusions using Tophat2.

Arguments¶

--reference_seq

Path to the reference sequence (in Fasta

format).

--reference_gtf

Path to the reference gtf file. (Expected to

conform to Ensembls GTF file struture.)

--transposon_seq

Path to the transposon sequence (in Fasta

format). This file should typically contain

a single sequence.

--output

Output path for the augmented reference.

--blacklist_regions

Regions of the reference to blacklist. Should

be specified as ‘chromosome:start-end’.

--blacklist_genes

Genes to blacklist. Should be specified as

gene IDs (Ensembl IDs for Ensembl gtfs).

--no_index

Suppresses building of the bowtie index.

--no_transcriptome_index

Suppresses building of the transcriptome index.

--force_overwrite

Overwrite any existing files.

Examples¶

Basic:

im-fusion build --reference_sequence ./reference.fa \
                --reference_gtf ./reference.gtf \
                --transposon_sequence ./transposon.fa \
                --output ./augmented.fa

With blacklisted region:

im-fusion build --reference_sequence ./reference.fa \
                --reference_gtf ./reference.gtf \
                --transposon_sequence ./transposon.fa \
                --blacklist_regions 13:31623816-31633406 \
                --output ./augmented.fa

With blacklisted gene:

im-fusion build --reference_sequence reference.fa \
                --reference_gtf reference.gtf \
                --transposon_seq transposon.fa \
                --blacklist_genes ENSMUSG00000039095 \
                --output ./augmented.fa

im-fusion insertions¶

Description¶

im-fusion insertions identifies insertions in a sample by using Tophat2 to identify gene-transposon fusions in RNA-sequencing data. To do so, the command first calls Tophat2 to align reads to the augmented reference genome and to identify gene fusions. It then parses the fusions identified by Tophat (from Tophat’s fusion.out output file) and selects fusions between the endogeneous reference and the transposon sequence. These gene-tranposon fusions are annotated to identify the involved gene(s) and transposon feature(s) and converted to insertions with an (approximate) genomic location. Insertions are written to a tab-separated file (insertions.txt) in the output directory.

Arguments¶

Basic arguments

--fastq

Path(s) to the samples fastq files.

--fastq2

Paths to the second pair fastq files

(for paired-end sequencing data). Should be

given in the same order as for fastq.

--reference_index

Path to the index of the augmented reference

generated by im-fusion build.

--reference_gtf

Path to the reference gtf file. Typically the

same file as used in im-fusion build.

--transposon_name

Name of the transposon sequence in the augmented

reference. Should reflect the name used for the

transposon sequence in the transposon.fa

file that was used in im-fusion build.

--transposon_features

Path to a tab-separated file containing the

description of the transposon features (splice

acceptors/donors etc.). The file should contain

the columns name, start, end, strand and type.

Start, end and strand are used to define

the position and orientation of the features.

The type column indicates whether a feature

represents a splice acceptor or donor feature and

should contain only SA, SD or empty values. See

the data directory on Github for examples.

--output_dir

The samples output directory.

--sample_id

Sample id to use for the given sample. Defaults

to the basename of the output directory.

Filtering options

--blacklist

IDs of genes for which insertions

should be filtered (typically genes that share

homologous sequences with the transposon).

--min_flank

Minimum required size of aligned sequences on

either side of the fusion (default = 20).

--min_support

Minimum number of reads that should support

any identfied fusions (default = 2).

Tophat options

--transcriptome_index

Path to the transcriptome index. Only needs to

be provided if the transcriptome index location

differs from the default relative path used by

im-fusion build.

--tophat_args

String with extra commandline arguments that

should be passed to Tophat2 for the alignment.

Examples¶

For single-end data:

im-fusion insertions --fastq s1.R1.fastq.gz \
                     --reference_seq augmented \
                     --reference_gtf reference.gtf \
                     --transposon_name T2Onc2 \
                     --transposon_features T2Onc2.features.txt \
                     --output_dir ./output/s1 \
                     --sample_id s1

For paired-end data add the fastq2 argument:

im-fusion insertions --fastq s1.R1.fastq.gz \
                     --fastq2 s1.R2.fastq.gz \
                     --reference_seq augmented \
                     --reference_gtf reference.gtf \
                     --transposon_name T2Onc2 \
                     --transposon_features T2Onc2.features.txt \
                     --output_dir ./output/s1 \
                     --sample_id s1

im-fusion expression¶

Description¶

im-fusion expression generates exon expression counts for an individual sample using the alignment generated by im-fusion insertions. The counts are generated using the featureCounts tool, which must be available in PATH. The generated counts are written to the samples output directory as the tab-separated file ‘exon_counts.txt’.

Arguments¶

--sample_dir

Path to the sample directory. Effectively the

same as output_dir in im-fusion insertions.

--exon_gtf

Path to the exon gtf file, which contains

a flattened representation of the exons in

the previously used reference gtf.

--sample_id

Sample id to use in the generated counts file.

Should reflect the same sample id as used by

im-fusion insertions. Defaults to the name

of the input directory.

--paired

Generate counts by counting fragments instead of

reads (for paired-end data).

--stranded

Perform strand-specific read counting. Possible

values: 0 (unstranded), 1 (stranded) and 2

(reversely stranded). 0 by default.

--threads

Number of threads to use in featureCounts.

--extra_kwargs

Extra command line options to pass to

featureCounts.

The easiest way to generate the required exon gtf file is to generate it from the previously used reference.gtf file using the dexseq_prepare_annotation.py script from DEXSeq. After extracting the script from the DEXSeq package, the exon gtf can be generated using the following command:

python dexseq_prepare_annotation.py --aggr no reference.gtf exons.gtf

Examples¶

For single-end data:

im-fusion expression --sample_dir ./output/s1 \
                     --exon_gtf exons.gtf \
                     --sample_id s1

For paired-end data:

im-fusion expression --sample_dir ./output/s1 \
                     --exon_gtf exons.gtf \
                     --sample_id s1 \
                     --paired

im-fusion merge¶

Description¶

im-fusion merge merges individual samples into a combined dataset that can be used in the CTG analysis. The command effectively concatenates the individual results into a combined insertions.txt file and a combined exon_counts.txt file.

Arguments¶

--base_dir

Path to a base directory that contains outputs

for individual samples as sub-directories.

--output_base

Base name of the merged output files.

--samples

IDs of samples to subset the output to.

Examples¶

Without subsetting:

im-fusion merge --base_dir ./output \
                --output_base ./output/merged

With subsetting:

im-fusion merge --base_dir ./output \
                --output_base ./output/merged \
                --samples s1 s2

im-fusion ctg¶

Description¶

im-fusion ctg uses the combined insertions/expression dataset to identify genes that are recurrently mutated across samples AND differentially expressed by their insertions. To do so, the command performs two distinct significance tests. The first test compares the number of insertions within a gene to what would be expected by chance (modeled using a Poisson distribution). Genes with significant p-values are selected as Commonly Targeted Genes (CTGs).

The second test is a differential expression test that compares the ratio of expression before/after insertions in a CTG between samples with/without insertions in the CTG. A CTG is considered to be differentially expressed if samples with an insertion in the gene show a significant increase/decrease in expression after the insertion site compared to samples without an insertion. Genes that pass both tests are written to an output file.

Note that the differential expression test can be ommitted by not providing the expression file as an argument. In this case, only the CTG test is performed.

For more details on the implementation of the tests, please see our publication.

Arguments¶

Basic arguments

--insertions

Path to the merged insertions file from

im-fusion merge.

--reference_seq

Path to the reference genome sequence (in

fasta format). Can either be the augmented

reference genome or the original reference.

--reference_gtf

Path to the reference gtf file. Typically the

same file as used in im-fusion build.

--output

Path for the output CTG file.

--threshold

Minimum corrected p-value for CTGs.

(Default = 0.05).

--pattern

Regular expression reflecting the nucleotide

sequence at which the use transposon typically

integrates (if any). Used to correct for

sequence integration biases along the genome.

For example the pattern (AT|TA) is used for

the T2onc2 transposon, which is biased towards

integrations as TA sites.

--window

Window around the gene within which we test

a given gene for enrichment in insertions.

Insertion selection

--chromosomes

Chromosomes to consider. Used to omit

specific chromosomes from the CTG analysis.

--min_depth

Minimum supporting number of reads for insertions

to be included in the CTG analysis. Can be used

to omit insertions with low support for more

confidence in the analysis.

Differential expression

--expression

Path to the merged expression file from

im-fusion merge.

--exon_gtf

Path to the exon gtf file. Typically the

same file as used in im-fusion expression.

--de_threshold

Minimum p-value for a CTG to be considered

as differentially expressed.

Examples¶

With differential expression:

im-fusion ctg --insertions ./merged.insertions.txt  \
              --expression ./merged.exon_counts.txt \
              --reference_seq ./reference.fa \
              --reference_gtf ./reference.gtf \
              --exon_gtf ./exons.gtf \
              --output ctgs.txt

Without differential expression:

im-fusion ctg --insertions ./merged.insertions.txt  \
              --reference_seq ./reference.fa \
              --reference_gtf ./reference.gtf \
              --output ctgs.txt

With non-default significance thresholds:

im-fusion ctg --insertions ./merged.insertions.txt  \
              --expression ./merged.exon_counts.txt \
              --reference_seq ./reference.fa \
              --reference_gtf ./reference.gtf \
              --exon_gtf ./exons.gtf \
              --output ctgs.txt
              --threshold 0.01
              --de_threshold 0.1