API

imfusion.aligners

The aligners module contains the different aligners that can be used to identify insertion sites from gene-transposon fusions. Currently only Tophat2 is supported, though other aligners may be added in the future. The aligner modules only implement functionality that is specific to the aligner. Generic code for building reference genomes and identifying insertions are contained in the util modules util.fusions, util.insertions and util.reference.

imfusion.aligners.tophat2

The tophat2 submodule contains the functionality that is used to identify insertions from gene-transposon fusions using Tophat2. The module contains two main functions: build_reference and identify_insertions. The function build_reference is used to build an augmented reference genome and the corresponding indices for Tophat2. This augmented reference is can be supplied to identify_insertions to perform the actual insertion identification.

imfusion.aligners.tophat2.build_reference(ref_seq_path, ref_gtf_path, tr_seq_path, output_path, blacklist_regions=None, create_index=True, create_transcriptome_index=True)

Builds an augmented reference genome for Tophat2.

This is the main function responsible for building augmented reference genomes, which are used as a reference for Tophat2 when identifying transposon insertions in RNA-seq samples from an insertional mutagenesis screen. An augmented reference contains an extra sequence which reflects the sequence of the transposon that was used in the screen.

Optionally, specific regions of the original reference can be blacklisted if these regions contain sequences that are also in the transposon sequence, as these may be problematic in the alignment. For example, in our Sleeping Beauty screen we masked regions in the Foxf2 and in En2 genes, as the T2onc2 transposon we used contains sequences from these genes. Sequences in blacklisted regions are replaced by Ns, effectively removing the regions from the reference.

Parameters:
  • refseq_path (pathlib.Path) – Path to the fasta file containing the reference genome.
  • ref_gtf_path (pathlib.Path) – Path to the gtf file containing the genes of the reference genome. Assumed to have the same structure as Ensembls reference GTF files.
  • trseq_path (pathlib.Path) – Path to the fasta file containg the transposon sequence.
  • output_path (pathlib.Path) – Output path for the augmented reference genome (which is written as a new fasta file).
  • blacklist_regions (List[tuple(str, int, int)]) – List of regions that should be blacklisted in the augmented reference genome. Regions are specified as a tuple of (chromosome, start_position, end_position). For example: (‘1’, 2000, 2200).
  • create_index (bool) – Whether a bowtie index should be created.
  • create_transcriptome_index (bool) – Whether a Tophat2 transcriptome index should be created.
imfusion.aligners.tophat2.identify_insertions(fastqs, index_path, reference_gtf_path, transposon_name, transposon_features, sample_id, work_dir, min_flank, tophat_kws=None, transcriptome_index=None)

Identifies insertions from RNA-seq fusions using Tophat2.

Main function for identifying fusions from RNA-seq fastq files using Tophat2. The function essentially consists of four main steps:

  • The identification of gene-transposon fusions using Tophat2
  • Annotation of the found fusions for gene/transposon features
  • Deriving approximate locations for the corresponding insertions.
  • Filtering of fusions that are biologically implausible (for example due to their relative orientation)

The function returns the list of insertions that were identified by Tophat2. The generated alignment is also symlinked into the work directory as ‘alignment.bam’ for convenient access.

Parameters:
  • fastqs (list[pathlib.Path] or list[tuple(pathlib.Path, pathlib.Path)]) – Paths to the fastq files that should be used for the Tophat2 alignment. Can be given as a list of file paths for single-end sequencing data, or a list of path tuples for paired-end sequencing data. The fastqs are treated as belonging to a single sample.
  • index_path (pathlib.Path) – Path to the bowtie index of the (augmented) genome that should be used in the alignment. This index is typically generated by the build_reference function.
  • reference_gtf_path (pathlib.Path:) – Path to the gtf file containing genomic features. This file is used by Tophat2 for known gene features and for the annotation of gene features for identified fusions.
  • transposon_name (str) – Name of the transposon sequence in the augmented reference genome.
  • transposon_features (pandas.DataFrame) – Dataframe containing positions for the features present in the transposon sequence. Used to identify transposon features (such as splice acceptors or donors) that are involed in the identified fusions.
  • sample_id (str) – Sample name that the identified insertions should be assigned to.
  • work_dir (pathlib.Path) – Path to the working directory.
  • min_flank (int) – Minimum amount of flanking region that should be surrounding the fusion. Used by Tophat2 in its identification of fusions during the alignment.
  • tophat_kws (dict) – Dict of extra arguments for Tophat2.
Yields:

Insertion – Next insertion that was identified in the given sample.

imfusion.aligners.tophat2.get_version(path=None)

Get the version of Tophat2 in path.

Parameters:path (pathlib.Path) – Path to use for the Tophat2 executable.
Returns:Version of the Tophat2 executable in path.
Return type:str
imfusion.aligners.tophat2.get_bowtie_version(path=None)

Get the version of Bowtie in path.

Parameters:path (pathlib.Path) – Path to use for the bowtie executable.
Returns:Version of Bowtie executable in path.
Return type:str

imfusion.ctg

The ctg module contains the functions for identifying commonly targeted genes (CTGs) from a collection of insertions from multiple samples. The module contains two main functions: test_ctgs and test_de. The test_ctgs function performs the actual enrichment test and returns CTGs and their corresponding (corrected) p-values. The function test_de takes the CTG frame and tests each of the genes for differential expression, filtering out CTGs that are not significantly differentially expressed.

imfusion.ctg.test_ctgs(insertion_frame, reference_seq, reference_gtf, chromosomes=None, pattern=None, gene_ids=None, per_sample=True, window=None, threshold=0.05)

Identifies genes that are significantly enriched for insertions (CTGs).

This function takes a DataFrame of insertions, coming from multiple samples, and identifies if any genes are more frequently affected by an insertion than would be expected by chance. These genes are called Commonly Targeted Genes (CTGs). CTGs are selected by comparing the number of insertions within the gene to the number of insertions that would be expected from the background insertion rate, which is modeled using a Poisson distribution.

Parameters:
  • insertion_frame (pd.DataFrame) – Insertions to test (in DataFrame format).
  • reference_sequence (pyfaidx.Fasta) – Fasta sequence of the reference genome.
  • reference_gtf (GtfFile) – GtfFile containing reference genes.
  • chromosomes (list[str]) – List of chromosomes to include, defaults to all chromosomes in reference_gtf.
  • pattern (str) – Specificity pattern of the used transposon.
  • genes (list[str]) – List of genes to test (defaults to all genes with an insertion).
  • per_sample (bool) – Whether to perform the per sample test (recommended), which effectively collapes insertions per sample/gene combination. This avoids issues in which insertions that are detected multiple times or that may have hopped inside the gene locus are counted multiple times.
  • window (tuple(int, int)) – Window to include around gene (in bp). Specified as (upstream_dist, downstream_dist). For example: (-2000, 2000) specifies in a 2KB window around each gene.
  • threshold (float) – Maximum p-value for selected CTGs.
Returns:

Results of CTG test for tested genes. Contains three columns: gene_id, p_val and p_val_corr. The last column, p_val_corr, represents the p-value of the gene after correcting for multiple testing using bonferroni correction.

Return type:

pandas.DataFrame

imfusion.ctg.test_de(ctgs, insertions, dexseq_gtf, exon_counts_path, threshold=0.05)

Tests identified CTGs for differential expression.

This function takes CTG frame produced by test_ctgs and tests each of the identified CTGs for differential expression using the groupwise exon-level differential expression test (de_exon). The resulting DE p-values are added to the DataFrame and CTGs that are not differentially expressed are dropped.

Parameters:
  • ctgs (pandas.DataFrame) – DataFrame containing the identified CTGs (as generated by test_ctgs).
  • insertions (List[insertions]) – List of insertions to use in the test. Should be the same insertions as used to identify CTGs.
  • dexseq_gtf (imfusion.util.tabix.GtfFile) – GtfFile instance containing the flattened exon representation of the original reference_gtf. The corresponding gtf file is typically generated using DEXSeqs script for preparing exon annotations.
  • exon_counts_path (pathlib:Path) – Path to the file containing exon counts for all samples.
  • threshold (float) – Maximum p-value for differential expression.
Returns:

CTG dataFrame containing the differential expression test results.

Return type:

pandas.DataFrame

imfusion.expression

The expression module contains functionality for generating expression counts and testing for differential expression. The counts submodule handles the count generation using the featureCounts tool. The de_test submodule implements the various differential expression tests.

imfusion.expression.counts

imfusion.expression.counts.exon_counts(bam_files, gff_path, names=None, extra_kws=None, **kwargs)

Generates exon counts for given bam files using featureCounts.

This function is used to generate a m-by-n matrix (m = number of samples, n = number of exons) of exon expression counts. This matrix is generated using featureCounts, whose results are then parsed and returned.

Parameters:
  • bam_files (list[pathlib.Path]) – List of paths to the bam files for which counts should be generated.
  • gff_path (pathlib.Path) – Path to the gene feature file containing gene features.
  • names (dict[str, str]) – Alternative names to use for the given bam files. Keys of the dict should correspond to bam file paths, values should reflect the sample names that should be used in the resulting count matrix.
  • extra_kws (dict[str, any]) – Dictionary of extra arguments that should be passed to feature counts. Keys should correspond to argument names (including dashes), values should correspond to the argument value. Arguments without values (flags) should be given with the boolean value True.
  • **kwargs

    Any kwargs are passed to feature_counts.

Returns:

DataFrame containing counts. The index of the DataFrame contains gene ids corresponding to exons in the gff file, the columns correspond to samples/bam files. Column names are either the bam file paths, or the alternative sample names if given.

Return type:

pandas.DataFrame

imfusion.expression.counts.feature_counts(bam_files, gff_path, names=None, extra_kws=None, tmp_dir=None, keep_tmp=False)

Runs featureCounts on bam files with given options.

Main function used to run featureCounts. Used by gene_counts and exon_counts to generate expression counts.

Parameters:
  • bam_files (list[pathlib.Path]) – List of paths to the bam files for which counts should be generated.
  • gff_path (pathlib.Path) – Path to the gff file containing gene features.
  • names (dict[str, str]) – Dictionary with sample names, used to rename columns from file paths to sample names. Keys of the Dictionary should correspond with the bam file paths, values should reflect the desired sample name for the respective bam file.
  • extra_kws (dict[str, any]) – Dictionary containing extra command line arguments that should be passed to featureCounts.
  • tmp_dir (pathlib.Path) – Temp directory to use for the generated counts.
  • keep_tmp (bool) – Whether to keep the temp directory (default = False).
Returns:

DataFrame containing feature counts for the given bam files. The rows correspond to the counted features, the columns correspond to the index values (chomosome, position etc.) and the bam files.

Return type:

pandas.Dataframe

imfusion.expression.de_test

imfusion.expression.de_test.de_exon(insertions, gene_id, dexseq_gtf, exon_counts, pos_samples=None, neg_samples=None)

Performs the groupwise exon-level differential expression test.

Tests if the expression of exons after the insertion site(s) in a gene is significantly increased or decreased in samples with an insertion (pos_samples) compared to samples without an insertion (neg_samples). The test is performed by comparing normalized counts after the insertion sites between samples with and without an insertion in the gene, using the non-parametric Mann-Whitney-U test.

Note that the before/after split for the groupwise test is taken as the common set of before/after exons over all samples with an insertion. In cases where either set is empty, for example due to insertions before the first exon of the gene, we attempt to drop samples that prevent a proper split and perform the test without these samples.

Parameters:
  • insertions (pandas.DataFrame) – DataFrame containing all insertions.
  • gene_id (str) – ID of the gene of interest. Should correspond with a gene in the DEXSeq gtf file.
  • dexseq_gtf (GtfFile) – Gtf file containing exon features generated using DEXSeq. Can either be given as a GtfFile object or as a string specifying the path to the gtf file.
  • exon_counts (pandas.DataFrame or pathlib.Path) – DataFrame containing exon counts. The DataFrame is expected to contain samples as columns, and have a multi-index containing the chromosome, start, end and strand of the exon. This index should correspond with the annotation in the DEXSeq gtf. The samples should correspond with samples in the insertions frame. If a Path is given, it should point to a TSV file containing the counts.
  • pos_samples (set[str]) – Set of positive samples (with insertion) to use in the test. Defaults to all samples with an insertion in the gene of interest.
  • neg_samples (set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples not in the positive set.
Returns:

Result of the differential expression test.

Return type:

DeExonResult

class imfusion.expression.de_test.DeExonResult(sums, sample_split, exon_split, direction, p_value)

Class embodying the results of the groupwise exon-level DE test.

sums

pandas.DataFrame

DataFrame of before/after expression counts for all samples.

sample_split

tuple(List[str], List[str])

Split of samples into positive/negative samples.

exon_split

tuple

Split of exons into before/after groups.

direction

int

Direction of the differential expression (1 = positive, -1 = negative).

p_value

float

P-value of the differential expression test.

plot_boxplot(log=False, ax=None, show_points=True, **kwargs)

Plots boxplot of ‘after’ expression for samples with/without insertions in the gene.

plot_sums(log=False, **kwargs)

Plots the distribution of before/after counts for the samples.

imfusion.expression.de_test.de_exon_single(insertions, gene_id, insertion_id, dexseq_gtf, exon_counts)

Performs the single-sample exon-level differential expression test.

Tests if the expression of exons after the insertion site of the given sample is significantly increased or decreased compared to samples without an insertion. This test is performed by comparing the (normalized) after count of the given sample to a background distribution of normalized counts of samples without an insertion, which is modeled using a negative binomial distribution.

Note: this function requires Rpy2 to be installed, as R functions are used to fit the negative binomial distribution.

Parameters:
  • insertions (pandas.DataFrame) – DataFrame containing all insertions.
  • gene_id (str) – ID of the gene of interest. Should correspond with a gene in the DEXSeq gtf file.
  • insertion_id (str) – ID of the insertion of interest. Should correspond with an insertion in the list of insertions.
  • dexseq_gtf (GtfFile) – Gtf file containing exon features generated using DEXSeq. Can either be given as a GtfFile object or as a string specifying the path to the gtf file.
  • exon_counts (pandas.DataFrame) – DataFrame containing exon counts. The DataFrame is expected to contain samples as columns, and have a multi-index containing the chromosome, start, end and strand of the exon. This index should correspond with the annotation in the DEXSeq gtf. The samples should correspond with samples in the insertions frame.
Returns:

Result of the differential expression test.

Return type:

DeExonResult

class imfusion.expression.de_test.DeExonSingleResult(sums, sample_split, exon_split, nb_fit, direction, p_value)

Class containing the results of the single-sample exon-level DE test.

sums

pandas.DataFrame

DataFrame of before/after expression counts for all samples.

sample_split

tuple(List[str], List[str])

Split of samples into positive/negative samples.

exon_split

tuple

Split of exons into before/after groups.

nb_fit

imfusion.expression.de_test.stats.NegativeBinomial

Fit negative-binomial background distribution.

direction

int

Direction of the differential expression (1 = positive, -1 = negative).

p_value

float

P-value of the differential expression test.

plot_fit(ax=None)

Plots the sample expression on the background distribution.

plot_sums(log=False, **kwargs)

Plots the distribution of before/after counts for the samples.

imfusion.merge

The merge module contains functions for merging the results of the individual sample analyses (the insertions and expression counts) into a single combined dataset. This combined dataset is used as input for the CTG and differential expression analysis.

imfusion.merge.merge_samples(dir_paths, samples=None, with_expression=True)

Merges samples in dir_paths to a single insertions/exon counts frame.

Parameters:
  • dir_paths (List[pathlib.Path]) – Paths to the sample directories.
  • samples (List[str]) – Samples to subset the results to.
  • with_expression (bool) – Whether to include expression.
Returns:

Two DataFrames respectively containing the merged insertions and the merged exon counts. If with_expression is False, the merged counts frame is returned as None.

Return type:

tuple(pandas.DataFrame, pandas.DataFrame)

imfusion.model

Two model classes, Fusion and Insertion, are used to represent fusions and insertions respectively. These classes are mainly used to track which attributes fusions and insertions have and to convert between model instances and DataFrame representations.

class imfusion.model.Fusion

Class representing a gene-transposon fusion.

Used by fusion identification tools (such as Tophat2) to return the fusions that are identified. Not all fields are required if these are not available. However, the following fields should at least be provided: seqname, anchor_genome, anchor_transposon, strand_genome, strand_transposon.

seqname

str

Chromosome involved in the fusion.

anchor_genome

int

Genomic fusion breakpoint.

anchor_transposon

int

Transposon fusion breakpoint.

strand_genome

int

Strand of fusion in genome (-1 or 1)

strand_transposon

int

Strand of fusion in transposon (-1 or 1)

flank_genome

int

Size of flanking region on genome.

flank_transposon

int

Size of flanking region in transposon.

gene_id

str

ID of affected gene.

gene_name

str

Name of affected gene.

gene_strand

int

Strand of affected gene.

feature_name

str

Name of affected transposon feature.

feature_type

str

Feature type (SD/SA).

feature_strand

int

Strand of affected transposon feature.

spanning_reads

int

Number of supporting single-end reads.

supporting_mates

int

Number of mate pairs that support the fusion, but do not span the breakpoint with either mate.

supporting_spanning_mates

int

Number of mate pairs that support the fusion and have at least one mate spanning the breakpoint.

class imfusion.model.Insertion

Class respresenting a RNA-seq transposon insertion site.

Used to represent insertions derived from RNA-seq fusions. Not all fields are required, though at least the following should be specified: id, seqname, position, strand and sample.

id

str

ID of the insertion.

seqname

str

Chromosome of the insertion.

position

int

Genomic position of the insertion.

strand

int

Strand of the insertion (-1 or 1).

sample_id

str

Sample in which the insertion was identified.

gene_id

str

ID of the gene involved in the fusion.

gene_name

str

Name of the gene involved in the fusion.

gene_strand

int

Strand of the gene involved in the fusion.

orientation

str

Relative orientation of the insertion.

feature_name

str

Name of the transposon feature involved in the fusion.

feature_strand

int

Strand of transposon feature involved in the fusion.

anchor_genome

int

Genomic fusion breakpoint.

anchor_transposon

int

Transposon fusion breakpoint.

flank_genome

int

Size of flanking region on genome.

flank_transposon

int

Size of flanking region in transposon.

spanning_reads

int

Number of supporting single-end reads.

supporting_mates

int

Number of mate pairs that support the fusion, but do not span the breakpoint with either mate.

spanning_mates

int

Number of mate pairs that support the fusion and have at least one mate spanning the breakpoint.

imfusion.util

The util module contains various helper modules shared between different parts of im-fusion. The most important submodules are fusions, insertions and reference, which contain functions that are used by the aligners (currently only Tophat2) to generate augmented reference genomes and identify insertion sites.

imfusion.util.check

Utility functions for checking the validity of inputs.

imfusion.util.check.check_features(transposon_features)

Checks if a transposon feature frame is valid.

imfusion.util.fusions

Utility functions for annotating fusions and converting fusions into insertions by determining the approximate position of the corresponding insertion in the genome (‘placing’ the fusion).

imfusion.util.fusions.annotate_fusions(fusions, reference_gtf, transposon_features)

Annotates fusions with gene and transposon features.

Main function for annotating identified gene-transposon fusions. Adds the following annotations to fusions: gene features, transposon features and the relative orientation of the corresponding insertion with respect to the identified gene.

Parameters:
  • fusions (List[Fusion]) – Fusions to annotate.
  • reference_gtf (GtfFile) – GtfFile instance containing reference gene features.
  • transposon_features (pandas.DataFrame) – Dataframe containing positions for the features present in the transposon sequence. Used to identify transposon features (such as splice acceptors or donors) that are involved in the identified fusions.
Yields:

Fusion – Next fusion, annotated with gene/transposon features.

imfusion.util.fusions.place_fusions(fusions, sample_id, reference_gtf, offset=20, max_dist=5000)

Derives insertions by placing fusions at approximate genomic locations.

Main function for deriving insertions from annotated gene-transposon fusions. Derives insertions by determining an approximate genomic location that is compatible with the gene/transposon feature annotations of the fusions. Fusions are therefore expected to be properly annotated for gene/transposon features.

An insertion is essentially ‘placed’ by looking for the first genomic position that does not overlap with a reference feature, in the direction that is compatible with the insertions orientation w.r.t. its target gene.

Parameters:
  • fusions (List[Fusion]) – List of fusions to convert.
  • sample_id (str) – Sample id that should be used for the insertions.
  • reference_gtf (GtfFile) – GtfFile containing the reference features. Expected to conform to the Ensembl reference gtf format.
  • offset (int) – Minimum offset of the transposon to the closest reference gene feature.
  • max_dist (int) – Maximum distance that an insertion may be placed from the genomic anchor of the fusion.
Yields:

Insertion – Next insertion derived from the given fusions.

imfusion.util.insertions

Utility functions for filtering invalid/unwanted insertions.

imfusion.util.insertions.filter_invalid_insertions(insertions)

Filters invalid insertions.

Main function for filtering invalid insertions. Effectively applies both the filter_wrong_orientation and filter_unexpected_sites filters to filter insertions that have the wrong orientation (w.r.t the transposon feature and the annotated gene) or involve features of the transposon that we are not interested in (typically non-splice acceptor/donor features).

Parameters:insertions (List[Insertion]) – Insertions to filter.
Yields:Insertion – Next filtered insertion.
imfusion.util.insertions.filter_wrong_orientation(insertions, drop_na=False)

Filters insertions with wrong feature orientations w.r.t. their genes.

This filter removes any insertions with a transposon feature that is in the wrong orientation with respect to the annotated gene. This is based on the premise that, for example, a splice acceptor can only splice to a gene that is in the same orientation as the acceptor.

Parameters:insertions (List[Insertion]) – Insertions to filter.
Yields:Insertion – Next filtered insertion.
imfusion.util.insertions.filter_unexpected_sites(insertions)

Filters insertions that have non splice-acceptor/donor features.

This filter removes any insertions that splice to tranposon features that aren’t splice-acceptors or splice-donors. This is based on the premise that these other sites are unlikely to be involved in any splicing and that therefore these insertions are likely to be false positives of the fusion identification.

Parameters:insertions (List[Insertion]) – Insertions to filter.
Yields:Insertion – Next filtered insertion.
imfusion.util.insertions.filter_blacklist(insertions, gene_ids, reference_gtf=None, filter_overlap=True)

Filters insertions for blacklisted genes.

Parameters:
  • insertions (List[Insertion]) – Insertions to filter.
  • gene_ids (set[str]) – IDs of the blacklisted genes.
  • reference_gtf (GtfFile) – GtfFile instance containing reference gene features. (Only needed if filter_overlap is True).
  • filter_overlap (bool) – Whether to filter any insertions overlapping with the listed genes. If False (default), only genes explicitly splicing to the gene are filtered.
Yields:

Insertion – Next filtered insertion.

imfusion.util.reference

Utility functions used for generating augmented reference genomes.

imfusion.util.reference.concatenate_fastas(fasta_paths, output_path)

Concatenates multiple fasta files into a single file.

This function combines multiple fasta files into a single output file. It is mainly used to generate the combined reference genome that contains both the reference genome sequence and the transposon sequence.

Parameters:
  • fasta_paths (List[pathlib.Path]) – Paths to fasta files that should be concatentated.
  • output_path (pathlib.Paths) – Path for the combined output file.
imfusion.util.reference.mask_regions(refseq_path, blacklist_regions)

Masks blacklisted regions in a given reference sequence.

This function removes blacklisted regions from a given reference genome. Blacklisted regions are removed by replacing their original sequence with a sequences of ‘N’ nucleotides.

Parameters:
  • refseq_path (pathlib.Path) – Path to the reference sequence in fasta format.
  • blacklist_regions (List[tuple(str, int, int)]:) – List of regions that should be blacklisted in the reference sequence. Regions should be specified as a tuple of (chromosome, start_position, end_position). For example: (‘1’, 2000, 2200).
Returns:

Path to edited reference sequence.

Return type:

pathlib.Path

imfusion.util.reference.blacklist_for_regions(region_strs)

Builds blacklist region list for region strings.

Parses region strings into a list of blacklist region tuples. Region strings should be provided in the following format: ‘chromosome:start-end’. For example, ‘X:1000-2000’ denotes a a region on chromosome X from position 1000 to 2000.

Parameters:region_strs (List[str]) – List of region strings.
Returns:List of blacklist region tuples.
Return type:tuple(str, int, int)
imfusion.util.reference.blacklist_for_genes(gene_ids, reference_gtf)

Builds blacklist frame for given genes.

Returns a list of blacklist regions encompassing the regions spanned by the genes corresponding to the given gene ids.

Parameters:
  • gene_ids (List[str]) – List of (Ensembl) gene ids.
  • reference_gtf (GtfFile) – GtfFile instance containing reference gene features.
Returns:

List of blacklist region tuples.

Return type:

tuple(str, int, int)

imfusion.util.tabix

Utility classes used for fast access to Gtf and Bed files.

class imfusion.util.tabix.GtfFile(file_path)
__delattr__

x.__delattr__(‘name’) <==> del x.name

__format__()

default object formatter

__getattribute__

x.__getattribute__(‘name’) <==> x.name

__hash__
__reduce__()

helper for pickle

__reduce_ex__()

helper for pickle

__setattr__

x.__setattr__(‘name’, value) <==> x.name = value

__sizeof__() → int

size of object in memory, in bytes

__str__
classmethod compress(file_path, out_path=None, sort=True, create_index=True)

Compresses and indexes a gtf file using bgzip and tabix.

fetch(reference=None, start=None, end=None, filters=None, incl_left=True, incl_right=True)

Fetches records for the given region.

get_gene(gene_id, feature_type='gene', field_name='gene_id', **kwargs)

Fetchs a given gene by id.

get_region(reference=None, start=None, end=None, filters=None, incl_left=True, incl_right=True)

Fetches DataFrame of features for the given region.

classmethod sort(file_path, out_path)

Sorts a gtf file by position, required for indexing by tabix.