API¶
imfusion.aligners¶
The aligners module contains the different aligners that can be used to identify insertion sites from gene-transposon fusions. Currently only Tophat2 is supported, though other aligners may be added in the future. The aligner modules only implement functionality that is specific to the aligner. Generic code for building reference genomes and identifying insertions are contained in the util modules util.fusions, util.insertions and util.reference.
imfusion.aligners.tophat2¶
The tophat2 submodule contains the functionality that is used to identify insertions from gene-transposon fusions using Tophat2. The module contains two main functions: build_reference and identify_insertions. The function build_reference is used to build an augmented reference genome and the corresponding indices for Tophat2. This augmented reference is can be supplied to identify_insertions to perform the actual insertion identification.
-
imfusion.aligners.tophat2.build_reference(ref_seq_path, ref_gtf_path, tr_seq_path, output_path, blacklist_regions=None, create_index=True, create_transcriptome_index=True)¶ Builds an augmented reference genome for Tophat2.
This is the main function responsible for building augmented reference genomes, which are used as a reference for Tophat2 when identifying transposon insertions in RNA-seq samples from an insertional mutagenesis screen. An augmented reference contains an extra sequence which reflects the sequence of the transposon that was used in the screen.
Optionally, specific regions of the original reference can be blacklisted if these regions contain sequences that are also in the transposon sequence, as these may be problematic in the alignment. For example, in our Sleeping Beauty screen we masked regions in the Foxf2 and in En2 genes, as the T2onc2 transposon we used contains sequences from these genes. Sequences in blacklisted regions are replaced by Ns, effectively removing the regions from the reference.
Parameters: - refseq_path (pathlib.Path) – Path to the fasta file containing the reference genome.
- ref_gtf_path (pathlib.Path) – Path to the gtf file containing the genes of the reference genome. Assumed to have the same structure as Ensembls reference GTF files.
- trseq_path (pathlib.Path) – Path to the fasta file containg the transposon sequence.
- output_path (pathlib.Path) – Output path for the augmented reference genome (which is written as a new fasta file).
- blacklist_regions (List[tuple(str, int, int)]) – List of regions that should be blacklisted in the augmented reference genome. Regions are specified as a tuple of (chromosome, start_position, end_position). For example: (‘1’, 2000, 2200).
- create_index (bool) – Whether a bowtie index should be created.
- create_transcriptome_index (bool) – Whether a Tophat2 transcriptome index should be created.
-
imfusion.aligners.tophat2.identify_insertions(fastqs, index_path, reference_gtf_path, transposon_name, transposon_features, sample_id, work_dir, min_flank, tophat_kws=None, transcriptome_index=None)¶ Identifies insertions from RNA-seq fusions using Tophat2.
Main function for identifying fusions from RNA-seq fastq files using Tophat2. The function essentially consists of four main steps:
- The identification of gene-transposon fusions using Tophat2
- Annotation of the found fusions for gene/transposon features
- Deriving approximate locations for the corresponding insertions.
- Filtering of fusions that are biologically implausible (for example due to their relative orientation)
The function returns the list of insertions that were identified by Tophat2. The generated alignment is also symlinked into the work directory as ‘alignment.bam’ for convenient access.
Parameters: - fastqs (list[pathlib.Path] or list[tuple(pathlib.Path, pathlib.Path)]) – Paths to the fastq files that should be used for the Tophat2 alignment. Can be given as a list of file paths for single-end sequencing data, or a list of path tuples for paired-end sequencing data. The fastqs are treated as belonging to a single sample.
- index_path (pathlib.Path) – Path to the bowtie index of the (augmented) genome that should be used in the alignment. This index is typically generated by the build_reference function.
- reference_gtf_path (pathlib.Path:) – Path to the gtf file containing genomic features. This file is used by Tophat2 for known gene features and for the annotation of gene features for identified fusions.
- transposon_name (str) – Name of the transposon sequence in the augmented reference genome.
- transposon_features (pandas.DataFrame) – Dataframe containing positions for the features present in the transposon sequence. Used to identify transposon features (such as splice acceptors or donors) that are involed in the identified fusions.
- sample_id (str) – Sample name that the identified insertions should be assigned to.
- work_dir (pathlib.Path) – Path to the working directory.
- min_flank (int) – Minimum amount of flanking region that should be surrounding the fusion. Used by Tophat2 in its identification of fusions during the alignment.
- tophat_kws (dict) – Dict of extra arguments for Tophat2.
Yields: Insertion – Next insertion that was identified in the given sample.
-
imfusion.aligners.tophat2.get_version(path=None)¶ Get the version of Tophat2 in path.
Parameters: path (pathlib.Path) – Path to use for the Tophat2 executable. Returns: Version of the Tophat2 executable in path. Return type: str
-
imfusion.aligners.tophat2.get_bowtie_version(path=None)¶ Get the version of Bowtie in path.
Parameters: path (pathlib.Path) – Path to use for the bowtie executable. Returns: Version of Bowtie executable in path. Return type: str
imfusion.ctg¶
The ctg module contains the functions for identifying commonly targeted genes (CTGs) from a collection of insertions from multiple samples. The module contains two main functions: test_ctgs and test_de. The test_ctgs function performs the actual enrichment test and returns CTGs and their corresponding (corrected) p-values. The function test_de takes the CTG frame and tests each of the genes for differential expression, filtering out CTGs that are not significantly differentially expressed.
-
imfusion.ctg.test_ctgs(insertion_frame, reference_seq, reference_gtf, chromosomes=None, pattern=None, gene_ids=None, per_sample=True, window=None, threshold=0.05)¶ Identifies genes that are significantly enriched for insertions (CTGs).
This function takes a DataFrame of insertions, coming from multiple samples, and identifies if any genes are more frequently affected by an insertion than would be expected by chance. These genes are called Commonly Targeted Genes (CTGs). CTGs are selected by comparing the number of insertions within the gene to the number of insertions that would be expected from the background insertion rate, which is modeled using a Poisson distribution.
Parameters: - insertion_frame (pd.DataFrame) – Insertions to test (in DataFrame format).
- reference_sequence (pyfaidx.Fasta) – Fasta sequence of the reference genome.
- reference_gtf (GtfFile) – GtfFile containing reference genes.
- chromosomes (list[str]) – List of chromosomes to include, defaults to all chromosomes in reference_gtf.
- pattern (str) – Specificity pattern of the used transposon.
- genes (list[str]) – List of genes to test (defaults to all genes with an insertion).
- per_sample (bool) – Whether to perform the per sample test (recommended), which effectively collapes insertions per sample/gene combination. This avoids issues in which insertions that are detected multiple times or that may have hopped inside the gene locus are counted multiple times.
- window (tuple(int, int)) – Window to include around gene (in bp). Specified as (upstream_dist, downstream_dist). For example: (-2000, 2000) specifies in a 2KB window around each gene.
- threshold (float) – Maximum p-value for selected CTGs.
Returns: Results of CTG test for tested genes. Contains three columns: gene_id, p_val and p_val_corr. The last column, p_val_corr, represents the p-value of the gene after correcting for multiple testing using bonferroni correction.
Return type: pandas.DataFrame
-
imfusion.ctg.test_de(ctgs, insertions, dexseq_gtf, exon_counts_path, threshold=0.05)¶ Tests identified CTGs for differential expression.
This function takes CTG frame produced by test_ctgs and tests each of the identified CTGs for differential expression using the groupwise exon-level differential expression test (de_exon). The resulting DE p-values are added to the DataFrame and CTGs that are not differentially expressed are dropped.
Parameters: - ctgs (pandas.DataFrame) – DataFrame containing the identified CTGs (as generated by test_ctgs).
- insertions (List[insertions]) – List of insertions to use in the test. Should be the same insertions as used to identify CTGs.
- dexseq_gtf (imfusion.util.tabix.GtfFile) – GtfFile instance containing the flattened exon representation of the original reference_gtf. The corresponding gtf file is typically generated using DEXSeqs script for preparing exon annotations.
- exon_counts_path (pathlib:Path) – Path to the file containing exon counts for all samples.
- threshold (float) – Maximum p-value for differential expression.
Returns: CTG dataFrame containing the differential expression test results.
Return type: pandas.DataFrame
imfusion.expression¶
The expression module contains functionality for generating expression counts and testing for differential expression. The counts submodule handles the count generation using the featureCounts tool. The de_test submodule implements the various differential expression tests.
imfusion.expression.counts¶
-
imfusion.expression.counts.exon_counts(bam_files, gff_path, names=None, extra_kws=None, **kwargs)¶ Generates exon counts for given bam files using featureCounts.
This function is used to generate a m-by-n matrix (m = number of samples, n = number of exons) of exon expression counts. This matrix is generated using featureCounts, whose results are then parsed and returned.
Parameters: - bam_files (list[pathlib.Path]) – List of paths to the bam files for which counts should be generated.
- gff_path (pathlib.Path) – Path to the gene feature file containing gene features.
- names (dict[str, str]) – Alternative names to use for the given bam files. Keys of the dict should correspond to bam file paths, values should reflect the sample names that should be used in the resulting count matrix.
- extra_kws (dict[str, any]) – Dictionary of extra arguments that should be passed to feature counts. Keys should correspond to argument names (including dashes), values should correspond to the argument value. Arguments without values (flags) should be given with the boolean value True.
- **kwargs –
Any kwargs are passed to feature_counts.
Returns: DataFrame containing counts. The index of the DataFrame contains gene ids corresponding to exons in the gff file, the columns correspond to samples/bam files. Column names are either the bam file paths, or the alternative sample names if given.
Return type: pandas.DataFrame
-
imfusion.expression.counts.feature_counts(bam_files, gff_path, names=None, extra_kws=None, tmp_dir=None, keep_tmp=False)¶ Runs featureCounts on bam files with given options.
Main function used to run featureCounts. Used by gene_counts and exon_counts to generate expression counts.
Parameters: - bam_files (list[pathlib.Path]) – List of paths to the bam files for which counts should be generated.
- gff_path (pathlib.Path) – Path to the gff file containing gene features.
- names (dict[str, str]) – Dictionary with sample names, used to rename columns from file paths to sample names. Keys of the Dictionary should correspond with the bam file paths, values should reflect the desired sample name for the respective bam file.
- extra_kws (dict[str, any]) – Dictionary containing extra command line arguments that should be passed to featureCounts.
- tmp_dir (pathlib.Path) – Temp directory to use for the generated counts.
- keep_tmp (bool) – Whether to keep the temp directory (default = False).
Returns: DataFrame containing feature counts for the given bam files. The rows correspond to the counted features, the columns correspond to the index values (chomosome, position etc.) and the bam files.
Return type: pandas.Dataframe
imfusion.expression.de_test¶
-
imfusion.expression.de_test.de_exon(insertions, gene_id, dexseq_gtf, exon_counts, pos_samples=None, neg_samples=None)¶ Performs the groupwise exon-level differential expression test.
Tests if the expression of exons after the insertion site(s) in a gene is significantly increased or decreased in samples with an insertion (pos_samples) compared to samples without an insertion (neg_samples). The test is performed by comparing normalized counts after the insertion sites between samples with and without an insertion in the gene, using the non-parametric Mann-Whitney-U test.
Note that the before/after split for the groupwise test is taken as the common set of before/after exons over all samples with an insertion. In cases where either set is empty, for example due to insertions before the first exon of the gene, we attempt to drop samples that prevent a proper split and perform the test without these samples.
Parameters: - insertions (pandas.DataFrame) – DataFrame containing all insertions.
- gene_id (str) – ID of the gene of interest. Should correspond with a gene in the DEXSeq gtf file.
- dexseq_gtf (GtfFile) – Gtf file containing exon features generated using DEXSeq. Can either be given as a GtfFile object or as a string specifying the path to the gtf file.
- exon_counts (pandas.DataFrame or pathlib.Path) – DataFrame containing exon counts. The DataFrame is expected to contain samples as columns, and have a multi-index containing the chromosome, start, end and strand of the exon. This index should correspond with the annotation in the DEXSeq gtf. The samples should correspond with samples in the insertions frame. If a Path is given, it should point to a TSV file containing the counts.
- pos_samples (set[str]) – Set of positive samples (with insertion) to use in the test. Defaults to all samples with an insertion in the gene of interest.
- neg_samples (set[str]) – Set of negative samples (without insertion) to use in the test. Defaults to all samples not in the positive set.
Returns: Result of the differential expression test.
Return type:
-
class
imfusion.expression.de_test.DeExonResult(sums, sample_split, exon_split, direction, p_value)¶ Class embodying the results of the groupwise exon-level DE test.
-
sums¶ pandas.DataFrame
DataFrame of before/after expression counts for all samples.
-
sample_split¶ tuple(List[str], List[str])
Split of samples into positive/negative samples.
-
exon_split¶ tuple
Split of exons into before/after groups.
-
direction¶ int
Direction of the differential expression (1 = positive, -1 = negative).
-
p_value¶ float
P-value of the differential expression test.
-
plot_boxplot(log=False, ax=None, show_points=True, **kwargs)¶ Plots boxplot of ‘after’ expression for samples with/without insertions in the gene.
-
plot_sums(log=False, **kwargs)¶ Plots the distribution of before/after counts for the samples.
-
-
imfusion.expression.de_test.de_exon_single(insertions, gene_id, insertion_id, dexseq_gtf, exon_counts)¶ Performs the single-sample exon-level differential expression test.
Tests if the expression of exons after the insertion site of the given sample is significantly increased or decreased compared to samples without an insertion. This test is performed by comparing the (normalized) after count of the given sample to a background distribution of normalized counts of samples without an insertion, which is modeled using a negative binomial distribution.
Note: this function requires Rpy2 to be installed, as R functions are used to fit the negative binomial distribution.
Parameters: - insertions (pandas.DataFrame) – DataFrame containing all insertions.
- gene_id (str) – ID of the gene of interest. Should correspond with a gene in the DEXSeq gtf file.
- insertion_id (str) – ID of the insertion of interest. Should correspond with an insertion in the list of insertions.
- dexseq_gtf (GtfFile) – Gtf file containing exon features generated using DEXSeq. Can either be given as a GtfFile object or as a string specifying the path to the gtf file.
- exon_counts (pandas.DataFrame) – DataFrame containing exon counts. The DataFrame is expected to contain samples as columns, and have a multi-index containing the chromosome, start, end and strand of the exon. This index should correspond with the annotation in the DEXSeq gtf. The samples should correspond with samples in the insertions frame.
Returns: Result of the differential expression test.
Return type:
-
class
imfusion.expression.de_test.DeExonSingleResult(sums, sample_split, exon_split, nb_fit, direction, p_value)¶ Class containing the results of the single-sample exon-level DE test.
-
sums¶ pandas.DataFrame
DataFrame of before/after expression counts for all samples.
-
sample_split¶ tuple(List[str], List[str])
Split of samples into positive/negative samples.
-
exon_split¶ tuple
Split of exons into before/after groups.
-
nb_fit¶ imfusion.expression.de_test.stats.NegativeBinomial
Fit negative-binomial background distribution.
-
direction¶ int
Direction of the differential expression (1 = positive, -1 = negative).
-
p_value¶ float
P-value of the differential expression test.
-
plot_fit(ax=None)¶ Plots the sample expression on the background distribution.
-
plot_sums(log=False, **kwargs)¶ Plots the distribution of before/after counts for the samples.
-
imfusion.merge¶
The merge module contains functions for merging the results of the individual sample analyses (the insertions and expression counts) into a single combined dataset. This combined dataset is used as input for the CTG and differential expression analysis.
-
imfusion.merge.merge_samples(dir_paths, samples=None, with_expression=True)¶ Merges samples in dir_paths to a single insertions/exon counts frame.
Parameters: - dir_paths (List[pathlib.Path]) – Paths to the sample directories.
- samples (List[str]) – Samples to subset the results to.
- with_expression (bool) – Whether to include expression.
Returns: Two DataFrames respectively containing the merged insertions and the merged exon counts. If with_expression is False, the merged counts frame is returned as None.
Return type: tuple(pandas.DataFrame, pandas.DataFrame)
imfusion.model¶
Two model classes, Fusion and Insertion, are used to represent fusions and insertions respectively. These classes are mainly used to track which attributes fusions and insertions have and to convert between model instances and DataFrame representations.
-
class
imfusion.model.Fusion¶ Class representing a gene-transposon fusion.
Used by fusion identification tools (such as Tophat2) to return the fusions that are identified. Not all fields are required if these are not available. However, the following fields should at least be provided: seqname, anchor_genome, anchor_transposon, strand_genome, strand_transposon.
-
seqname¶ str
Chromosome involved in the fusion.
-
anchor_genome¶ int
Genomic fusion breakpoint.
-
anchor_transposon¶ int
Transposon fusion breakpoint.
-
strand_genome¶ int
Strand of fusion in genome (-1 or 1)
-
strand_transposon¶ int
Strand of fusion in transposon (-1 or 1)
-
flank_genome¶ int
Size of flanking region on genome.
-
flank_transposon¶ int
Size of flanking region in transposon.
-
gene_id¶ str
ID of affected gene.
-
gene_name¶ str
Name of affected gene.
-
gene_strand¶ int
Strand of affected gene.
-
feature_name¶ str
Name of affected transposon feature.
-
feature_type¶ str
Feature type (SD/SA).
-
feature_strand¶ int
Strand of affected transposon feature.
-
spanning_reads¶ int
Number of supporting single-end reads.
-
supporting_mates¶ int
Number of mate pairs that support the fusion, but do not span the breakpoint with either mate.
-
supporting_spanning_mates¶ int
Number of mate pairs that support the fusion and have at least one mate spanning the breakpoint.
-
-
class
imfusion.model.Insertion¶ Class respresenting a RNA-seq transposon insertion site.
Used to represent insertions derived from RNA-seq fusions. Not all fields are required, though at least the following should be specified: id, seqname, position, strand and sample.
-
id¶ str
ID of the insertion.
-
seqname¶ str
Chromosome of the insertion.
-
position¶ int
Genomic position of the insertion.
-
strand¶ int
Strand of the insertion (-1 or 1).
-
sample_id¶ str
Sample in which the insertion was identified.
-
gene_id¶ str
ID of the gene involved in the fusion.
-
gene_name¶ str
Name of the gene involved in the fusion.
-
gene_strand¶ int
Strand of the gene involved in the fusion.
-
orientation¶ str
Relative orientation of the insertion.
-
feature_name¶ str
Name of the transposon feature involved in the fusion.
-
feature_strand¶ int
Strand of transposon feature involved in the fusion.
-
anchor_genome¶ int
Genomic fusion breakpoint.
-
anchor_transposon¶ int
Transposon fusion breakpoint.
-
flank_genome¶ int
Size of flanking region on genome.
-
flank_transposon¶ int
Size of flanking region in transposon.
-
spanning_reads¶ int
Number of supporting single-end reads.
-
supporting_mates¶ int
Number of mate pairs that support the fusion, but do not span the breakpoint with either mate.
-
spanning_mates¶ int
Number of mate pairs that support the fusion and have at least one mate spanning the breakpoint.
-
imfusion.util¶
The util module contains various helper modules shared between different parts of im-fusion. The most important submodules are fusions, insertions and reference, which contain functions that are used by the aligners (currently only Tophat2) to generate augmented reference genomes and identify insertion sites.
imfusion.util.check¶
Utility functions for checking the validity of inputs.
-
imfusion.util.check.check_features(transposon_features)¶ Checks if a transposon feature frame is valid.
imfusion.util.fusions¶
Utility functions for annotating fusions and converting fusions into insertions by determining the approximate position of the corresponding insertion in the genome (‘placing’ the fusion).
-
imfusion.util.fusions.annotate_fusions(fusions, reference_gtf, transposon_features)¶ Annotates fusions with gene and transposon features.
Main function for annotating identified gene-transposon fusions. Adds the following annotations to fusions: gene features, transposon features and the relative orientation of the corresponding insertion with respect to the identified gene.
Parameters: - fusions (List[Fusion]) – Fusions to annotate.
- reference_gtf (GtfFile) – GtfFile instance containing reference gene features.
- transposon_features (pandas.DataFrame) – Dataframe containing positions for the features present in the transposon sequence. Used to identify transposon features (such as splice acceptors or donors) that are involved in the identified fusions.
Yields: Fusion – Next fusion, annotated with gene/transposon features.
-
imfusion.util.fusions.place_fusions(fusions, sample_id, reference_gtf, offset=20, max_dist=5000)¶ Derives insertions by placing fusions at approximate genomic locations.
Main function for deriving insertions from annotated gene-transposon fusions. Derives insertions by determining an approximate genomic location that is compatible with the gene/transposon feature annotations of the fusions. Fusions are therefore expected to be properly annotated for gene/transposon features.
An insertion is essentially ‘placed’ by looking for the first genomic position that does not overlap with a reference feature, in the direction that is compatible with the insertions orientation w.r.t. its target gene.
Parameters: - fusions (List[Fusion]) – List of fusions to convert.
- sample_id (str) – Sample id that should be used for the insertions.
- reference_gtf (GtfFile) – GtfFile containing the reference features. Expected to conform to the Ensembl reference gtf format.
- offset (int) – Minimum offset of the transposon to the closest reference gene feature.
- max_dist (int) – Maximum distance that an insertion may be placed from the genomic anchor of the fusion.
Yields: Insertion – Next insertion derived from the given fusions.
imfusion.util.insertions¶
Utility functions for filtering invalid/unwanted insertions.
-
imfusion.util.insertions.filter_invalid_insertions(insertions)¶ Filters invalid insertions.
Main function for filtering invalid insertions. Effectively applies both the filter_wrong_orientation and filter_unexpected_sites filters to filter insertions that have the wrong orientation (w.r.t the transposon feature and the annotated gene) or involve features of the transposon that we are not interested in (typically non-splice acceptor/donor features).
Parameters: insertions (List[Insertion]) – Insertions to filter. Yields: Insertion – Next filtered insertion.
-
imfusion.util.insertions.filter_wrong_orientation(insertions, drop_na=False)¶ Filters insertions with wrong feature orientations w.r.t. their genes.
This filter removes any insertions with a transposon feature that is in the wrong orientation with respect to the annotated gene. This is based on the premise that, for example, a splice acceptor can only splice to a gene that is in the same orientation as the acceptor.
Parameters: insertions (List[Insertion]) – Insertions to filter. Yields: Insertion – Next filtered insertion.
-
imfusion.util.insertions.filter_unexpected_sites(insertions)¶ Filters insertions that have non splice-acceptor/donor features.
This filter removes any insertions that splice to tranposon features that aren’t splice-acceptors or splice-donors. This is based on the premise that these other sites are unlikely to be involved in any splicing and that therefore these insertions are likely to be false positives of the fusion identification.
Parameters: insertions (List[Insertion]) – Insertions to filter. Yields: Insertion – Next filtered insertion.
-
imfusion.util.insertions.filter_blacklist(insertions, gene_ids, reference_gtf=None, filter_overlap=True)¶ Filters insertions for blacklisted genes.
Parameters: - insertions (List[Insertion]) – Insertions to filter.
- gene_ids (set[str]) – IDs of the blacklisted genes.
- reference_gtf (GtfFile) – GtfFile instance containing reference gene features. (Only needed if filter_overlap is True).
- filter_overlap (bool) – Whether to filter any insertions overlapping with the listed genes. If False (default), only genes explicitly splicing to the gene are filtered.
Yields: Insertion – Next filtered insertion.
imfusion.util.reference¶
Utility functions used for generating augmented reference genomes.
-
imfusion.util.reference.concatenate_fastas(fasta_paths, output_path)¶ Concatenates multiple fasta files into a single file.
This function combines multiple fasta files into a single output file. It is mainly used to generate the combined reference genome that contains both the reference genome sequence and the transposon sequence.
Parameters: - fasta_paths (List[pathlib.Path]) – Paths to fasta files that should be concatentated.
- output_path (pathlib.Paths) – Path for the combined output file.
-
imfusion.util.reference.mask_regions(refseq_path, blacklist_regions)¶ Masks blacklisted regions in a given reference sequence.
This function removes blacklisted regions from a given reference genome. Blacklisted regions are removed by replacing their original sequence with a sequences of ‘N’ nucleotides.
Parameters: - refseq_path (pathlib.Path) – Path to the reference sequence in fasta format.
- blacklist_regions (List[tuple(str, int, int)]:) – List of regions that should be blacklisted in the reference sequence. Regions should be specified as a tuple of (chromosome, start_position, end_position). For example: (‘1’, 2000, 2200).
Returns: Path to edited reference sequence.
Return type: pathlib.Path
-
imfusion.util.reference.blacklist_for_regions(region_strs)¶ Builds blacklist region list for region strings.
Parses region strings into a list of blacklist region tuples. Region strings should be provided in the following format: ‘chromosome:start-end’. For example, ‘X:1000-2000’ denotes a a region on chromosome X from position 1000 to 2000.
Parameters: region_strs (List[str]) – List of region strings. Returns: List of blacklist region tuples. Return type: tuple(str, int, int)
-
imfusion.util.reference.blacklist_for_genes(gene_ids, reference_gtf)¶ Builds blacklist frame for given genes.
Returns a list of blacklist regions encompassing the regions spanned by the genes corresponding to the given gene ids.
Parameters: - gene_ids (List[str]) – List of (Ensembl) gene ids.
- reference_gtf (GtfFile) – GtfFile instance containing reference gene features.
Returns: List of blacklist region tuples.
Return type: tuple(str, int, int)
imfusion.util.tabix¶
Utility classes used for fast access to Gtf and Bed files.
-
class
imfusion.util.tabix.GtfFile(file_path)¶ -
__delattr__¶ x.__delattr__(‘name’) <==> del x.name
-
__format__()¶ default object formatter
-
__getattribute__¶ x.__getattribute__(‘name’) <==> x.name
-
__hash__¶
-
__reduce__()¶ helper for pickle
-
__reduce_ex__()¶ helper for pickle
-
__setattr__¶ x.__setattr__(‘name’, value) <==> x.name = value
-
__sizeof__() → int¶ size of object in memory, in bytes
-
__str__¶
-
classmethod
compress(file_path, out_path=None, sort=True, create_index=True)¶ Compresses and indexes a gtf file using bgzip and tabix.
-
fetch(reference=None, start=None, end=None, filters=None, incl_left=True, incl_right=True)¶ Fetches records for the given region.
-
get_gene(gene_id, feature_type='gene', field_name='gene_id', **kwargs)¶ Fetchs a given gene by id.
-
get_region(reference=None, start=None, end=None, filters=None, incl_left=True, incl_right=True)¶ Fetches DataFrame of features for the given region.
-
classmethod
sort(file_path, out_path)¶ Sorts a gtf file by position, required for indexing by tabix.
-