Command-line usage

im-fusion build

Description

im-fusion build is used to create the augmented reference genome that is required to identify gene-transposon fusions using Tophat2.

Arguments

--reference_seq
 
Path to the reference sequence (in Fasta
format).
--reference_gtf
 
Path to the reference gtf file. (Expected to
conform to Ensembls GTF file struture.)
--transposon_seq
 
Path to the transposon sequence (in Fasta
format). This file should typically contain
a single sequence.
--output
Output path for the augmented reference.
--blacklist_regions
 
Regions of the reference to blacklist. Should
be specified as ‘chromosome:start-end’.
--blacklist_genes
 
Genes to blacklist. Should be specified as
gene IDs (Ensembl IDs for Ensembl gtfs).
--no_index
Suppresses building of the bowtie index.
--no_transcriptome_index
 
Suppresses building of the transcriptome index.
--force_overwrite
 
Overwrite any existing files.

Examples

Basic:

im-fusion build --reference_sequence ./reference.fa \
                --reference_gtf ./reference.gtf \
                --transposon_sequence ./transposon.fa \
                --output ./augmented.fa

With blacklisted region:

im-fusion build --reference_sequence ./reference.fa \
                --reference_gtf ./reference.gtf \
                --transposon_sequence ./transposon.fa \
                --blacklist_regions 13:31623816-31633406 \
                --output ./augmented.fa

With blacklisted gene:

im-fusion build --reference_sequence reference.fa \
                --reference_gtf reference.gtf \
                --transposon_seq transposon.fa \
                --blacklist_genes ENSMUSG00000039095 \
                --output ./augmented.fa

im-fusion insertions

Description

im-fusion insertions identifies insertions in a sample by using Tophat2 to identify gene-transposon fusions in RNA-sequencing data. To do so, the command first calls Tophat2 to align reads to the augmented reference genome and to identify gene fusions. It then parses the fusions identified by Tophat (from Tophat’s fusion.out output file) and selects fusions between the endogeneous reference and the transposon sequence. These gene-tranposon fusions are annotated to identify the involved gene(s) and transposon feature(s) and converted to insertions with an (approximate) genomic location. Insertions are written to a tab-separated file (insertions.txt) in the output directory.

Arguments

Basic arguments

--fastq
Path(s) to the samples fastq files.
--fastq2
Paths to the second pair fastq files
(for paired-end sequencing data). Should be
given in the same order as for fastq.
--reference_index
 
Path to the index of the augmented reference
generated by im-fusion build.
--reference_gtf
 
Path to the reference gtf file. Typically the
same file as used in im-fusion build.
--transposon_name
 
Name of the transposon sequence in the augmented
reference. Should reflect the name used for the
transposon sequence in the transposon.fa
file that was used in im-fusion build.
--transposon_features
 
Path to a tab-separated file containing the
description of the transposon features (splice
acceptors/donors etc.). The file should contain
the columns name, start, end, strand and type.
Start, end and strand are used to define
the position and orientation of the features.
The type column indicates whether a feature
represents a splice acceptor or donor feature and
should contain only SA, SD or empty values. See
the data directory on Github for examples.
--output_dir
The samples output directory.
--sample_id
Sample id to use for the given sample. Defaults
to the basename of the output directory.

Filtering options

--blacklist
IDs of genes for which insertions
should be filtered (typically genes that share
homologous sequences with the transposon).
--min_flank
Minimum required size of aligned sequences on
either side of the fusion (default = 20).
--min_support
Minimum number of reads that should support
any identfied fusions (default = 2).

Tophat options

--transcriptome_index
 
Path to the transcriptome index. Only needs to
be provided if the transcriptome index location
differs from the default relative path used by
im-fusion build.
--tophat_args
String with extra commandline arguments that
should be passed to Tophat2 for the alignment.

Examples

For single-end data:

im-fusion insertions --fastq s1.R1.fastq.gz \
                     --reference_seq augmented \
                     --reference_gtf reference.gtf \
                     --transposon_name T2Onc2 \
                     --transposon_features T2Onc2.features.txt \
                     --output_dir ./output/s1 \
                     --sample_id s1

For paired-end data add the fastq2 argument:

im-fusion insertions --fastq s1.R1.fastq.gz \
                     --fastq2 s1.R2.fastq.gz \
                     --reference_seq augmented \
                     --reference_gtf reference.gtf \
                     --transposon_name T2Onc2 \
                     --transposon_features T2Onc2.features.txt \
                     --output_dir ./output/s1 \
                     --sample_id s1

im-fusion expression

Description

im-fusion expression generates exon expression counts for an individual sample using the alignment generated by im-fusion insertions. The counts are generated using the featureCounts tool, which must be available in PATH. The generated counts are written to the samples output directory as the tab-separated file ‘exon_counts.txt’.

Arguments

--sample_dir
Path to the sample directory. Effectively the
same as output_dir in im-fusion insertions.
--exon_gtf
Path to the exon gtf file, which contains
a flattened representation of the exons in
the previously used reference gtf.
--sample_id
Sample id to use in the generated counts file.
Should reflect the same sample id as used by
im-fusion insertions. Defaults to the name
of the input directory.
--paired
Generate counts by counting fragments instead of
reads (for paired-end data).
--stranded
Perform strand-specific read counting. Possible
values: 0 (unstranded), 1 (stranded) and 2
(reversely stranded). 0 by default.
--threads
Number of threads to use in featureCounts.
--extra_kwargs
Extra command line options to pass to
featureCounts.

The easiest way to generate the required exon gtf file is to generate it from the previously used reference.gtf file using the dexseq_prepare_annotation.py script from DEXSeq. After extracting the script from the DEXSeq package, the exon gtf can be generated using the following command:

python dexseq_prepare_annotation.py --aggr no reference.gtf exons.gtf

Examples

For single-end data:

im-fusion expression --sample_dir ./output/s1 \
                     --exon_gtf exons.gtf \
                     --sample_id s1

For paired-end data:

im-fusion expression --sample_dir ./output/s1 \
                     --exon_gtf exons.gtf \
                     --sample_id s1 \
                     --paired

im-fusion merge

Description

im-fusion merge merges individual samples into a combined dataset that can be used in the CTG analysis. The command effectively concatenates the individual results into a combined insertions.txt file and a combined exon_counts.txt file.

Arguments

--base_dir
Path to a base directory that contains outputs
for individual samples as sub-directories.
--output_base
Base name of the merged output files.
--samples
IDs of samples to subset the output to.

Examples

Without subsetting:

im-fusion merge --base_dir ./output \
                --output_base ./output/merged

With subsetting:

im-fusion merge --base_dir ./output \
                --output_base ./output/merged \
                --samples s1 s2

im-fusion ctg

Description

im-fusion ctg uses the combined insertions/expression dataset to identify genes that are recurrently mutated across samples AND differentially expressed by their insertions. To do so, the command performs two distinct significance tests. The first test compares the number of insertions within a gene to what would be expected by chance (modeled using a Poisson distribution). Genes with significant p-values are selected as Commonly Targeted Genes (CTGs).

The second test is a differential expression test that compares the ratio of expression before/after insertions in a CTG between samples with/without insertions in the CTG. A CTG is considered to be differentially expressed if samples with an insertion in the gene show a significant increase/decrease in expression after the insertion site compared to samples without an insertion. Genes that pass both tests are written to an output file.

Note that the differential expression test can be ommitted by not providing the expression file as an argument. In this case, only the CTG test is performed.

For more details on the implementation of the tests, please see our publication.

Arguments

Basic arguments

--insertions
Path to the merged insertions file from
im-fusion merge.
--reference_seq
 
Path to the reference genome sequence (in
fasta format). Can either be the augmented
reference genome or the original reference.
--reference_gtf
 
Path to the reference gtf file. Typically the
same file as used in im-fusion build.
--output
Path for the output CTG file.
--threshold
Minimum corrected p-value for CTGs.
(Default = 0.05).
--pattern
Regular expression reflecting the nucleotide
sequence at which the use transposon typically
integrates (if any). Used to correct for
sequence integration biases along the genome.
For example the pattern (AT|TA) is used for
the T2onc2 transposon, which is biased towards
integrations as TA sites.
--window
Window around the gene within which we test
a given gene for enrichment in insertions.

Insertion selection

--chromosomes
Chromosomes to consider. Used to omit
specific chromosomes from the CTG analysis.
--min_depth
Minimum supporting number of reads for insertions
to be included in the CTG analysis. Can be used
to omit insertions with low support for more
confidence in the analysis.

Differential expression

--expression
Path to the merged expression file from
im-fusion merge.
--exon_gtf
Path to the exon gtf file. Typically the
same file as used in im-fusion expression.
--de_threshold
Minimum p-value for a CTG to be considered
as differentially expressed.

Examples

With differential expression:

im-fusion ctg --insertions ./merged.insertions.txt  \
              --expression ./merged.exon_counts.txt \
              --reference_seq ./reference.fa \
              --reference_gtf ./reference.gtf \
              --exon_gtf ./exons.gtf \
              --output ctgs.txt

Without differential expression:

im-fusion ctg --insertions ./merged.insertions.txt  \
              --reference_seq ./reference.fa \
              --reference_gtf ./reference.gtf \
              --output ctgs.txt

With non-default significance thresholds:

im-fusion ctg --insertions ./merged.insertions.txt  \
              --expression ./merged.exon_counts.txt \
              --reference_seq ./reference.fa \
              --reference_gtf ./reference.gtf \
              --exon_gtf ./exons.gtf \
              --output ctgs.txt
              --threshold 0.01
              --de_threshold 0.1