Regions module

Module for handling genomic regions such as chromosomes, bins, and restriction fragments.

The class RegionsTable is an implementation of the RegionBased interface from the genomic_regions package. More details on how to use the RegionBased interface can be found in the genomic_regions documentation, but here is an example to get you started:

import fanc
rt = fanc.RegionsTable()

# demo how to add regions
regions = []
for chromosome in ['chr1', 'chr2']:
    for start in range(1, 10000, 1000):
        r = fanc.GenomicRegion(chromosome=chromosome, start=start, end=start+999)
        regions.append(r)
rt.add_regions(regions)

# query regions on chromosome 1
for r in rt.regions('chr1'):
    print(r)  # chr1:1-1000, chr1:1001-2000, ..., chr1:9001-10000

The class Chromosome holds chromosome information, specifically the chromosome name in the reference genome, its length in base pairs, and its DNA sequence. Multiple Chromosome objects can be grouped into a Genome, which provides convenient access to its chromosomes and has useful functions for in silico digestion and genome binning.

import fanc

# create chromosomes
chromosome1 = fanc.Chromosome(name='chr1', sequence='AAGTCCGTGCTGTCGATCATAGCTAGCTAGCTA')
chromosome2 = fanc.Chromosome(name='chr2', sequence='GTGTCGATCAAATCGAAA')
len(chromosome1)  # 33

# create genome
genome = fanc.Genome()
genome.add_chromosome(chromosome1)
genome.add_chromosome(chromosome2)

# in silico digestion
restriction_fragments = genome.get_regions('MboI')  # cuts 'GATC'
for r in restriction_fragments.regions:
    print(r)  # chr1:1-15, chr1:16-33, chr2:1-6, chr2:7-18

# binning
bins = genome.get_regions(10)  # cuts 'GATC'
for r in bins.regions:
    print(r)  # chr1:1-10, chr1:11-20, chr1:21-30, chr1:31-33, chr2:1-10, chr2:11-18

# load genome from file
genome_from_file = Genome.from_string("hg19_chr18_19.fa")
genome_from_file.chromosomes() # ['chr18', 'chr19']
class fanc.regions.Chromosome(name=None, length=None, sequence=None)

Bases: object

Chromosome data type.

name

Name of the chromosome

length

Length of the chromosome in base-pairs

sequence

Base-pair sequence of DNA in the chromosome

classmethod from_fasta(file_name, name=None, include_sequence=True)

Create a Chromosome from a FASTA file.

This class method will load a FASTA file and convert it into a Chromosome object. If the FASTA file contains multiple sequences, the output will be a list of Chromosome objects.

Parameters:
  • file_name – Path to the FASTA file
  • name – Chromosome name. If None (default), will be read from the FASTA file header
  • include_sequence – If True (default), stores the chromosome sequence in memory. Else, the sequence attribute will be set to None.
Returns:

Chromosome if there is only a single FASTA sequence in the file, list(Chromosome) if there are multiple sequences.

get_restriction_sites(restriction_enzyme)

Find the restriction sites of a provided enzyme in this chromosome.

Internally uses Biopython to find RE sites.

Parameters:restriction_enzyme – The name of the restriction enzyme (e.g. HindIII)
Returns:List of RE sites in base-pairs (1-based)
class fanc.regions.Genome(file_name=None, chromosomes=None, mode='a', tmpdir=None, _table_name_chromosomes='chromosomes')

Bases: fanc.general.FileGroup

Class representing a collection of chromosomes.

Provides some convenience batch methods that call Chromosome methods for every chromosome in this object.

class ChromosomeDefinition

Bases: tables.description.IsDescription

add_chromosome(chromosome)

Add a Chromosome to this object.

Will choose suitable defaults for missing attributes.

Parameters:chromosomeChromosome object or similar object (e.g. dict) with the same fields
chromosomes()

Get list of chromosomes in this object

close(copy_tmp=True, remove_tmp=True)

Close this HDF5 file and run exit operations.

If file was opened with tmpdir in read-only mode: close file and delete temporary copy.

If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.

Parameters:
  • copy_tmp – If False, does not overwrite original with modified file.
  • remove_tmp – If False, does not delete temporary copy of file.
classmethod from_folder(folder_name, file_name=None, exclude=None, include_sequence=True, tmpdir=None)

Load every FASTA file from a folder as a chromosome.

Parameters:
  • folder_name – Path to the folder to load
  • file_name – File to save Genome object to
  • exclude – List or set of chromosome names that should NOT be loaded
  • include_sequence – If True, will save the chromosome sequences in the Genome object
classmethod from_string(genome_string, file_name=None, tmpdir=None, mode='a')

Convenience function to load a Genome from a string.

Parameters:
  • genome_string – Path to FASTA file, path to folder with FASTA files, comma-separated list of paths to FASTA files, path to HDF5 file
  • file_name – Path to save file
Returns:

A Genome object

get_regions(split, file_name=None, chromosomes=None)

Extract genomic regions from genome.

Provides two options:

  • Splits chromosomes at restriction sites if the split parameter is the name of a restriction enzyme.
  • Splits chromosomes at equi-distant points if split is an integer
Parameters:
  • split – Name of a restriction enzyme or positive integer
  • file_name – Name of a file if the result of this method should be saved to file
  • chromosomes – List of chromosome names to include. Default: all
Returns:

GenomicRegions

sub_sequence(chromosome, start=None, end=None)

Extract the chromosome DNA sequence between start and end.

Parameters:
  • chromosome – Name of chromosome
  • start – start position in bp (1-based, inclusive)
  • end – end position in bp (1-based, inclusive)
Returns:

str

class fanc.regions.RegionsTable(file_name=None, mode='a', tmpdir=None, additional_fields=None, _table_name_regions='regions')

Bases: fanc.regions.RegionBasedWithBins, fanc.general.FileGroup

PyTables Table wrapper for storing genomic regions.

This class is inherited by objects working with lists of genomic regions, such as equidistant bins along chromosomes in a genome (Hic) or restriction fragments of genomic DNA (ReadPairs)

Internally, each genomic region is encoded in a PyTables Table and the following region attributes are represented as table columns: ix, chromosome, start, end, and strand. To add additional region attributes, such as a score, use the “additional_fields” parameter of the __init__ method. This must be a dict where the keys are str and values are PyTables column descriptors, such as StringCol. Example for adding a score field:

import fanc
import tables
rt = fanc.RegionsTable(
        additional_fields={'score': tables.Float32Col()}
     )
class ChromosomeDescription

Bases: tables.description.IsDescription

Description of the chromosomes in this object.

class RegionDescription

Bases: tables.description.IsDescription

Description of a genomic region for PyTables Table

add_region(region, *args, **kwargs)

Add a genomic region to this object.

This method offers some flexibility in the types of objects that can be loaded. See parameters for details.

Parameters:region – Can be a GenomicRegion, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).
add_regions(regions, *args, **kwargs)

Bulk insert multiple genomic regions.

Parameters:regions – List (or any iterator) with objects that describe a genomic region. See add_region for options.
static bin_intervals(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)

Bin a given set of intervals into a fixed number of bins.

Parameters:
  • intervals – iterator of tuples (start, end, score)
  • bins – Number of bins to divide the region into
  • interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
Returns:

iterator of tuples: (start, end, score)

static bin_intervals_equidistant(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)

Bin a given set of intervals into bins with a fixed size.

Parameters:
  • intervals – iterator of tuples (start, end, score)
  • bin_size – Size of each bin in base pairs
  • interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
Returns:

iterator of tuples: (start, end, score)

bin_size

Return the length of the first region in the dataset.

Assumes all bins have equal size.

Returns:int
binned_regions(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)

Same as region_intervals, but returns GenomicRegion objects instead of tuples.

Parameters:
  • region – String or class:~GenomicRegion object denoting the region to be binned
  • bins – Number of bins to divide the region into
  • bin_size – Size of each bin (alternative to bins argument)
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
  • args – Arguments passed to _region_intervals
  • kwargs – Keyword arguments passed to _region_intervals
Returns:

iterator of GenomicRegion objects

bins_to_distance(bins)

Convert fraction of bins to base pairs

Parameters:bins – float, fraction of bins
Returns:int, base pairs
chromosome_bins

Returns a dictionary of chromosomes and the start and end index of the bins they cover.

Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.

chromosome_lengths

Returns a dictionary of chromosomes and their length in bp.

chromosomes()

List all chromosomes in this regions table. :return: list of chromosome names.

close(copy_tmp=True, remove_tmp=True)

Close this HDF5 file and run exit operations.

If file was opened with tmpdir in read-only mode: close file and delete temporary copy.

If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.

Parameters:
  • copy_tmp – If False, does not overwrite original with modified file.
  • remove_tmp – If False, does not delete temporary copy of file.
distance_to_bins(distance)

Convert base pairs to fraction of bins.

Parameters:distance – distance in base pairs
Returns:float, distance as fraction of bin size
find_region(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)

Find the region that is at the center of a region.

Parameters:query_regions – Region selector string, :class:~GenomicRegion, or list of the former
Returns:index (or list of indexes) of the region at the center of the query region
flush()

Write buffered data to file.

intervals(*args, **kwargs)

Alias for region_intervals.

region_bins(*args, **kwargs)

Return slice of start and end indices spanned by a region.

Parameters:args – provide a GenomicRegion here to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.
Returns:
region_data(key, value=None)

Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced

Parameters:
  • key – Name of the data column
  • value – vector with region-based data (one entry per region)
region_intervals(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)

Return equally-sized genomic intervals and associated scores.

Use either bins or bin_size argument to control binning.

Parameters:
  • region – String or class:~GenomicRegion object denoting the region to be binned
  • bins – Number of bins to divide the region into
  • bin_size – Size of each bin (alternative to bins argument)
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
  • args – Arguments passed to _region_intervals
  • kwargs – Keyword arguments passed to _region_intervals
Returns:

iterator of tuples: (start, end, score)

region_subset(region, *args, **kwargs)

Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.

Parameters:region – String or class:~GenomicRegion object for which covered bins will be returned.
regions

Iterate over genomic regions in this object.

Will return a GenomicRegion object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.

Returns:RegionIter
regions_dict

Return a dictionary with region index as keys and regions as values.

Returns:dict {region.ix: region, …}
to_bed(file_name, subset=None, **kwargs)

Export regions as BED file

Parameters:
  • file_name – Path of file to write regions to
  • subset – optional GenomicRegion or str to write only regions overlapping this region
  • kwargs – Passed to write_bed()
to_bigwig(file_name, subset=None, **kwargs)

Export regions as BigWig file.

Parameters:
  • file_name – Path of file to write regions to
  • subset – optional GenomicRegion or str to write only regions overlapping this region
  • kwargs – Passed to write_bigwig()
to_gff(file_name, subset=None, **kwargs)

Export regions as GFF file

Parameters:
  • file_name – Path of file to write regions to
  • subset – optional GenomicRegion or str to write only regions overlapping this region
  • kwargs – Passed to write_gff()
class fanc.regions.LazyGenomicRegion(row, ix=None, auto_update=True)

Bases: genomic_regions.regions.GenomicRegion

A GenomicRegion object with lazy attribute loading.

This class is central to an efficient retrieval of regions from objects subclassing RegionsTable. Its handling should be mostly identical to GenomicRegion, but attributes will only be loaded on demand. Changes to attributes will change the underlying row in the HDF5 regions table if the auto_update parameter is set to True (default). Else you can manually call update().

as_bed_line(score_field='score', name_field='name')

Return a representation of this object as line in a BED file.

Parameters:
  • score_field – name of the attribute to be used in the ‘score’ field of the BED line
  • name_field – name of the attribute to be used in the ‘name’ field of the BED line
Returns:

str

as_gff_line(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')

Return a representation of this object as line in a GFF file.

Parameters:
  • source_field – name of the attribute to be used in the ‘source’ field of the GFF line
  • feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
  • score_field – name of the attribute to be used in the ‘score’ field of the GFF line
  • frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
  • float_format – Formatting string for the float fields
Returns:

str

attributes

List of all attributes in this region.

center

Return the center coordinate of the GenomicRegion.

Returns:float
contains(region)

Check if the specified region is completely contained in this region.

Parameters:regionGenomicRegion object or string
copy()

Return a (shallow) copy of this GenomicRegion

Returns:GenomicRegion
expand(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)

Expand this region by a relative or an absolute amount.

Parameters:
  • absolute – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>
  • relative – Relative amount as fraction of region by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>
  • absolute_left – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
  • absolute_right – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
  • relative_left – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
  • relative_right – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
  • copy – If True, return a copy of the original region, if False will modify the existing region in place
  • from_center – If True measures distance from center rather than start and end of the old region
Returns:

GenomicRegion

five_prime

Return the position of the 5’ end of this GenomicRegion on the reference.

Returns:int
fix_chromosome(copy=False)

Change chromosome representation from chr<NN> to <NN> or vice versa.

Parameters:copy – If True, make copy of region, otherwise will modify existing region in place.
Returns:GenomicRegion
classmethod from_string(region_string)

Convert a string into a GenomicRegion.

This is a very useful convenience function to quickly define a GenomicRegion object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.

Parameters:region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.
Returns:GenomicRegion
is_forward()

Return True if this region is on the forward strand of the reference genome.

Returns:True if on ‘+’ strand, False otherwise.
is_reverse()

Return True if this region is on the reverse strand of the reference genome.

Returns:True if on ‘-’ strand, False otherwise.
ix

Region index.

Location in underlying list of regions.

overlap(region)

Return the overlap in base pairs between this region and another region.

Parameters:regionGenomicRegion to find overlap for
Returns:overlap as int in base pairs
overlaps(region)

Check if this region overlaps with the specified region.

Parameters:regionGenomicRegion object or string
set_attribute(attribute, value)

Safely set an attribute on the GenomicRegion object.

This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.

Parameters:
  • attribute – Name of the attribute to be set
  • value – Value of the attribute to be set
strand

Strand this region is located on as int.

Returns:1: forward strand, -1 reverse strand, 0 or None: unknown.
strand_string

Return the ‘strand’ attribute as string.

Returns:strand as str (‘+’, ‘-’, or ‘.’)
three_prime

Return the position of the 3’ end of this GenomicRegion on the reference.

Returns:int
to_string()

Convert this GenomicRegion to its string representation.

Returns:str
fanc.regions.genome_regions(re_or_file, restriction_enzyme=None)

Obtain RE fragments or bins of equal size from a reference genome.

Parameters:
  • re_or_file – Path to or RegionBased file with regions corresponding to restriction fragments, or path to FASTA file with chromosomes
  • restriction_enzyme – If the first argument is a FASTA file, you must provide the name of a restriction enzyme here for in silico genome digestion. You can also provide an integer, in which case the genome will be binned into regions of this size in bp.
Returns:

RegionsTable of restriction fragments