Regions module¶

Module for handling genomic regions such as chromosomes, bins, and restriction fragments.

The class RegionsTable is an implementation of the RegionBased interface from the genomic_regions package. More details on how to use the RegionBased interface can be found in the genomic_regions documentation, but here is an example to get you started:

import fanc
rt = fanc.RegionsTable()

# demo how to add regions
regions = []
for chromosome in ['chr1', 'chr2']:
    for start in range(1, 10000, 1000):
        r = fanc.GenomicRegion(chromosome=chromosome, start=start, end=start+999)
        regions.append(r)
rt.add_regions(regions)

# query regions on chromosome 1
for r in rt.regions('chr1'):
    print(r)  # chr1:1-1000, chr1:1001-2000, ..., chr1:9001-10000

The class Chromosome holds chromosome information, specifically the chromosome name in the reference genome, its length in base pairs, and its DNA sequence. Multiple Chromosome objects can be grouped into a Genome, which provides convenient access to its chromosomes and has useful functions for in silico digestion and genome binning.

import fanc

# create chromosomes
chromosome1 = fanc.Chromosome(name='chr1', sequence='AAGTCCGTGCTGTCGATCATAGCTAGCTAGCTA')
chromosome2 = fanc.Chromosome(name='chr2', sequence='GTGTCGATCAAATCGAAA')
len(chromosome1)  # 33

# create genome
genome = fanc.Genome()
genome.add_chromosome(chromosome1)
genome.add_chromosome(chromosome2)

# in silico digestion
restriction_fragments = genome.get_regions('MboI')  # cuts 'GATC'
for r in restriction_fragments.regions:
    print(r)  # chr1:1-15, chr1:16-33, chr2:1-6, chr2:7-18

# binning
bins = genome.get_regions(10)  # cuts 'GATC'
for r in bins.regions:
    print(r)  # chr1:1-10, chr1:11-20, chr1:21-30, chr1:31-33, chr2:1-10, chr2:11-18

# load genome from file
genome_from_file = Genome.from_string("hg19_chr18_19.fa")
genome_from_file.chromosomes() # ['chr18', 'chr19']

class fanc.regions.Chromosome(name=None, length=None, sequence=None)¶

Bases: object

Chromosome data type.

name¶: Name of the chromosome

length¶: Length of the chromosome in base-pairs

sequence¶: Base-pair sequence of DNA in the chromosome

classmethod from_fasta(file_name, name=None, include_sequence=True)¶

Create a Chromosome from a FASTA file.

This class method will load a FASTA file and convert it into a Chromosome object. If the FASTA file contains multiple sequences, the output will be a list of Chromosome objects.

Parameters:	file_name – Path to the FASTA file name – Chromosome name. If None (default), will be read from the FASTA file header include_sequence – If True (default), stores the chromosome sequence in memory. Else, the sequence attribute will be set to None.
Returns:	`Chromosome` if there is only a single FASTA sequence in the file, list(`Chromosome`) if there are multiple sequences.

get_restriction_sites(restriction_enzyme)¶

Find the restriction sites of a provided enzyme in this chromosome.

Internally uses Biopython to find RE sites.

Parameters:	restriction_enzyme – The name of the restriction enzyme (e.g. HindIII)
Returns:	List of RE sites in base-pairs (1-based)

class fanc.regions.Genome(file_name=None, chromosomes=None, mode='a', tmpdir=None, _table_name_chromosomes='chromosomes')¶

Bases: fanc.general.FileGroup

Class representing a collection of chromosomes.

Provides some convenience batch methods that call Chromosome methods for every chromosome in this object.

class ChromosomeDefinition¶: Bases: tables.description.IsDescription

add_chromosome(chromosome)¶

Add a Chromosome to this object.

Will choose suitable defaults for missing attributes.

Parameters:	chromosome – `Chromosome` object or similar object (e.g. dict) with the same fields

chromosomes()¶: Get list of chromosomes in this object

close(copy_tmp=True, remove_tmp=True)¶

Close this HDF5 file and run exit operations.

If file was opened with tmpdir in read-only mode: close file and delete temporary copy.

If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.

Parameters:	copy_tmp – If False, does not overwrite original with modified file. remove_tmp – If False, does not delete temporary copy of file.

classmethod from_folder(folder_name, file_name=None, exclude=None, include_sequence=True, tmpdir=None)¶

Load every FASTA file from a folder as a chromosome.

Parameters:	folder_name – Path to the folder to load file_name – File to save Genome object to exclude – List or set of chromosome names that should NOT be loaded include_sequence – If True, will save the chromosome sequences in the Genome object

classmethod from_string(genome_string, file_name=None, tmpdir=None, mode='a')¶

Convenience function to load a Genome from a string.

Parameters:	genome_string – Path to FASTA file, path to folder with FASTA files, comma-separated list of paths to FASTA files, path to HDF5 file file_name – Path to save file
Returns:	A `Genome` object

get_regions(split, file_name=None, chromosomes=None)¶

Extract genomic regions from genome.

Provides two options:

Splits chromosomes at restriction sites if the split parameter is the name of a restriction enzyme.
Splits chromosomes at equi-distant points if split is an integer

Parameters:	split – Name of a restriction enzyme or positive integer file_name – Name of a file if the result of this method should be saved to file chromosomes – List of chromosome names to include. Default: all
Returns:	`GenomicRegions`

sub_sequence(chromosome, start=None, end=None)¶

Extract the chromosome DNA sequence between start and end.

Parameters:	chromosome – Name of chromosome start – start position in bp (1-based, inclusive) end – end position in bp (1-based, inclusive)
Returns:	str

class fanc.regions.RegionsTable(file_name=None, mode='a', tmpdir=None, additional_fields=None, _table_name_regions='regions')¶

Bases: fanc.regions.RegionBasedWithBins, fanc.general.FileGroup

PyTables Table wrapper for storing genomic regions.

This class is inherited by objects working with lists of genomic regions, such as equidistant bins along chromosomes in a genome (Hic) or restriction fragments of genomic DNA (ReadPairs)

Internally, each genomic region is encoded in a PyTables Table and the following region attributes are represented as table columns: ix, chromosome, start, end, and strand. To add additional region attributes, such as a score, use the “additional_fields” parameter of the __init__ method. This must be a dict where the keys are str and values are PyTables column descriptors, such as StringCol. Example for adding a score field:

import fanc
import tables
rt = fanc.RegionsTable(
        additional_fields={'score': tables.Float32Col()}
     )

class ChromosomeDescription¶

Bases: tables.description.IsDescription

Description of the chromosomes in this object.

class RegionDescription¶

Bases: tables.description.IsDescription

Description of a genomic region for PyTables Table

add_region(region, *args, **kwargs)¶

Add a genomic region to this object.

This method offers some flexibility in the types of objects that can be loaded. See parameters for details.

Parameters:	region – Can be a `GenomicRegion`, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).

add_regions(regions, *args, **kwargs)¶

Bulk insert multiple genomic regions.

Parameters:	regions – List (or any iterator) with objects that describe a genomic region. See `add_region` for options.

static bin_intervals(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶

Bin a given set of intervals into a fixed number of bins.

Parameters:

intervals – iterator of tuples (start, end, score)
bins – Number of bins to divide the region into
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN

Returns:

iterator of tuples: (start, end, score)

static bin_intervals_equidistant(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶

Bin a given set of intervals into bins with a fixed size.

Parameters:

intervals – iterator of tuples (start, end, score)
bin_size – Size of each bin in base pairs
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN

Returns:

iterator of tuples: (start, end, score)

bin_size¶

Return the length of the first region in the dataset.

Assumes all bins have equal size.

Returns:	int

binned_regions(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶

Same as region_intervals, but returns GenomicRegion objects instead of tuples.

Parameters:

region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals

Returns:

iterator of GenomicRegion objects

bins_to_distance(bins)¶

Convert fraction of bins to base pairs

Parameters:	bins – float, fraction of bins
Returns:	int, base pairs

chromosome_bins¶

Returns a dictionary of chromosomes and the start and end index of the bins they cover.

Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.

chromosome_lengths¶: Returns a dictionary of chromosomes and their length in bp.

chromosomes()¶: List all chromosomes in this regions table. :return: list of chromosome names.

close(copy_tmp=True, remove_tmp=True)¶

Close this HDF5 file and run exit operations.

If file was opened with tmpdir in read-only mode: close file and delete temporary copy.

If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.

Parameters:	copy_tmp – If False, does not overwrite original with modified file. remove_tmp – If False, does not delete temporary copy of file.

distance_to_bins(distance)¶

Convert base pairs to fraction of bins.

Parameters:	distance – distance in base pairs
Returns:	float, distance as fraction of bin size

find_region(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)¶

Find the region that is at the center of a region.

Parameters:	query_regions – Region selector string, :class:~GenomicRegion, or list of the former
Returns:	index (or list of indexes) of the region at the center of the query region

flush()¶: Write buffered data to file.

intervals(*args, **kwargs)¶: Alias for region_intervals.

region_bins(*args, **kwargs)¶

Return slice of start and end indices spanned by a region.

Parameters:	args – provide a `GenomicRegion` here to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.
Returns:

region_data(key, value=None)¶

Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced

Parameters:	key – Name of the data column value – vector with region-based data (one entry per region)

region_intervals(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)¶

Return equally-sized genomic intervals and associated scores.

Use either bins or bin_size argument to control binning.

Parameters:

region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals

Returns:

iterator of tuples: (start, end, score)

region_subset(region, *args, **kwargs)¶

Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.

Parameters:	region – String or class:~GenomicRegion object for which covered bins will be returned.

regions¶

Iterate over genomic regions in this object.

Will return a GenomicRegion object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.

Returns:	RegionIter

regions_dict¶

Return a dictionary with region index as keys and regions as values.

Returns:	dict {region.ix: region, …}

to_bed(file_name, subset=None, **kwargs)¶

Export regions as BED file

Parameters:	file_name – Path of file to write regions to subset – optional `GenomicRegion` or str to write only regions overlapping this region kwargs – Passed to `write_bed()`

to_bigwig(file_name, subset=None, **kwargs)¶

Export regions as BigWig file.

Parameters:	file_name – Path of file to write regions to subset – optional `GenomicRegion` or str to write only regions overlapping this region kwargs – Passed to `write_bigwig()`

to_gff(file_name, subset=None, **kwargs)¶

Export regions as GFF file

Parameters:	file_name – Path of file to write regions to subset – optional `GenomicRegion` or str to write only regions overlapping this region kwargs – Passed to `write_gff()`

class fanc.regions.LazyGenomicRegion(row, ix=None, auto_update=True)¶

Bases: genomic_regions.regions.GenomicRegion

A GenomicRegion object with lazy attribute loading.

This class is central to an efficient retrieval of regions from objects subclassing RegionsTable. Its handling should be mostly identical to GenomicRegion, but attributes will only be loaded on demand. Changes to attributes will change the underlying row in the HDF5 regions table if the auto_update parameter is set to True (default). Else you can manually call update().

as_bed_line(score_field='score', name_field='name')¶

Return a representation of this object as line in a BED file.

Parameters:	score_field – name of the attribute to be used in the ‘score’ field of the BED line name_field – name of the attribute to be used in the ‘name’ field of the BED line
Returns:	str

as_gff_line(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')¶

Return a representation of this object as line in a GFF file.

Parameters:

source_field – name of the attribute to be used in the ‘source’ field of the GFF line
feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
score_field – name of the attribute to be used in the ‘score’ field of the GFF line
frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
float_format – Formatting string for the float fields

Returns:

str

attributes¶: List of all attributes in this region.

center¶

Return the center coordinate of the GenomicRegion.

Returns:	float

contains(region)¶

Check if the specified region is completely contained in this region.

Parameters:	region – `GenomicRegion` object or string

copy()¶

Return a (shallow) copy of this GenomicRegion

Returns:	`GenomicRegion`

expand(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)¶

Expand this region by a relative or an absolute amount.

Parameters:

absolute – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>
relative – Relative amount as fraction of region by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>
absolute_left – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
absolute_right – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
relative_left – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
relative_right – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
copy – If True, return a copy of the original region, if False will modify the existing region in place
from_center – If True measures distance from center rather than start and end of the old region

Returns:

GenomicRegion

five_prime¶

Return the position of the 5’ end of this GenomicRegion on the reference.

Returns:	int

fix_chromosome(copy=False)¶

Change chromosome representation from chr<NN> to <NN> or vice versa.

Parameters:	copy – If True, make copy of region, otherwise will modify existing region in place.
Returns:	`GenomicRegion`

classmethod from_string(region_string)¶

Convert a string into a GenomicRegion.

This is a very useful convenience function to quickly define a GenomicRegion object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.

Parameters:	region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.
Returns:	`GenomicRegion`

is_forward()¶

Return True if this region is on the forward strand of the reference genome.

Returns:	True if on ‘+’ strand, False otherwise.

is_reverse()¶

Return True if this region is on the reverse strand of the reference genome.

Returns:	True if on ‘-’ strand, False otherwise.

ix¶

Region index.

Location in underlying list of regions.

overlap(region)¶

Return the overlap in base pairs between this region and another region.

Parameters:	region – `GenomicRegion` to find overlap for
Returns:	overlap as int in base pairs

overlaps(region)¶

Check if this region overlaps with the specified region.

Parameters:	region – `GenomicRegion` object or string

set_attribute(attribute, value)¶

Safely set an attribute on the GenomicRegion object.

This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.

Parameters:	attribute – Name of the attribute to be set value – Value of the attribute to be set

strand¶

Strand this region is located on as int.

Returns:	1: forward strand, -1 reverse strand, 0 or None: unknown.

strand_string¶

Return the ‘strand’ attribute as string.

Returns:	strand as str (‘+’, ‘-’, or ‘.’)

three_prime¶

Return the position of the 3’ end of this GenomicRegion on the reference.

Returns:	int

to_string()¶

Convert this GenomicRegion to its string representation.

Returns:	str

fanc.regions.genome_regions(re_or_file, restriction_enzyme=None)¶

Obtain RE fragments or bins of equal size from a reference genome.

Parameters:

re_or_file – Path to or RegionBased file with regions corresponding to restriction fragments, or path to FASTA file with chromosomes
restriction_enzyme – If the first argument is a FASTA file, you must provide the name of a restriction enzyme here for in silico genome digestion. You can also provide an integer, in which case the genome will be binned into regions of this size in bp.

Returns:

RegionsTable of restriction fragments