Pairs module

class fanc.pairs.BwaMemQualityFilter(cutoff=0.9, mask=None)

Bases: fanc.pairs.ReadFilter

Filters bwa mem generated alignments based on the alignment score (normalized by the length of the alignment).

valid_read(read)

Check if a read has a high alignment score.

class fanc.pairs.BwaMemUniquenessFilter(strict=False, mask=None)

Bases: fanc.pairs.ReadFilter

Filters bwa mem generated alignments based on whether they are unique or not.

valid_read(read)

Determine if a read is valid or should be filtered.

When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.

Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.

Parameters:read – A Read object
Returns:True if Read is valid, False otherwise
class fanc.pairs.ContaminantFilter(contaminant_reads, mask=None)

Bases: fanc.pairs.ReadFilter

Filter reads that also map to a contaminant genome.

valid_read(read)

Check if a read also maps to a contaminant

class fanc.pairs.FourDNucleomePairGenerator(pairs_file)

Bases: fanc.pairs.TxtReadPairGenerator

Read pair generator that works on 4D Nucleome “.pairs” files.

For details on the 4D Nucleome pairs format see: https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md

add_filter(read_filter)

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:read_filterReadFilter
stats()

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:dict
class fanc.pairs.FragmentRead(fragment=None, position=None, strand=0)

Bases: object

Class representing a fragment-mapped read.

fragment

A GenomicRegion delineated by restriction sites.

position

The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.

strand

The strand this read maps to (-1 or +1).

re_distance()

Get the distance of the alignment to the nearest restriction site. :return: int

class fanc.pairs.FragmentReadPair(left_read, right_read, ix=None)

Bases: object

Container for two paired FragmentRead objects.

get_gap_size()

Get the gap size in base pairs between the fragments these reads map to.

Returns:0 if reads map to the same fragment or neighboring fragments, the distance between fragments if they are on the same chromosome, None otherwise
is_inward_pair()

Check if reads form an “inward-facing” pair.

A pair is considered inward-facing if the left read maps to the plus the right read to the minus strand and both reads map to the same chromosome.

Returns:True if reads are inward-facing, False otherwise
is_outward_pair()

Check if reads form an “outward-facing” pair.

A pair is considered outward-facing if the left read maps to the minus the right read to the plus strand and both reads map to the same chromosome.

Returns:True if reads are outward-facing, False otherwise
is_same_chromosome()

Check if both reads are mapped to the same chromosome.

Returns:True is reads map to the same chromosome, False otherwise
is_same_fragment()

Check if reads map to the same fragment.

Returns:True if reads map to the same fragment, False otherwise.
is_same_pair()

Check if reads face in the same direction.

Returns:True if reads are facing in the same direction, False otherwise.
class fanc.pairs.FragmentReadPairFilter(mask=None)

Bases: fanc.general.MaskFilter

Abstract class that provides filtering functionality for the FragmentReadPair object.

Extends MaskFilter and overrides valid(self, read) to make FragmentReadPair filtering more “natural”.

To create custom filters for the FragmentMappedReadPairs object, extend this class and override the valid_pair() method. valid_pair should return False for a specific FragmentReadPair object if the object is supposed to be filtered/masked and True otherwise. See InwardPairsFilter for an example.

Pass a custom filter to the filter method in FragmentMappedReadPairs to apply it.

valid(row)

Map validity check of rows to pairs.

class fanc.pairs.HicProPairGenerator(file_name)

Bases: fanc.pairs.TxtReadPairGenerator

Read pair generator for HiC-Pro “validPairs” files.

This generator is a subclass of TxtReadPairGenerator with presets for fields in HiC-Pro validPairs files.

add_filter(read_filter)

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:read_filterReadFilter
stats()

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:dict
class fanc.pairs.InwardPairsFilter(minimum_distance=10000, mask=None)

Bases: fanc.pairs.FragmentReadPairFilter

Filter inward-facing read pairs at a distance less than a specified cutoff.

valid(row)

Map validity check of rows to pairs.

valid_pair(pair)

Check if a pair is inward-facing and <minimum_distance> apart.

class fanc.pairs.LazyFragment(row, pairs, ix=None, side='left')

Bases: genomic_regions.regions.GenomicRegion

GenomicRegion representing a fragment with lazy attribute loading.

chromosome

The reference sequence this region is located on

start

start position of this region on the reference (1-based, inclusive)

end

end position of this region on the reference (1-based, inclusive)

strand

strand of the reference this region is located on (-1, 1, 0, or None)

ix

Region index within the context of regions from the same object.

as_bed_line(score_field='score', name_field='name')

Return a representation of this object as line in a BED file.

Parameters:
  • score_field – name of the attribute to be used in the ‘score’ field of the BED line
  • name_field – name of the attribute to be used in the ‘name’ field of the BED line
Returns:

str

as_gff_line(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')

Return a representation of this object as line in a GFF file.

Parameters:
  • source_field – name of the attribute to be used in the ‘source’ field of the GFF line
  • feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
  • score_field – name of the attribute to be used in the ‘score’ field of the GFF line
  • frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
  • float_format – Formatting string for the float fields
Returns:

str

attributes

Return all visible attributes of this GenomicRegion.

Returns all attribute names that do not start with an underscore. :return: list of attribute names

center

Return the center coordinate of the GenomicRegion.

Returns:float
contains(region)

Check if the specified region is completely contained in this region.

Parameters:regionGenomicRegion object or string
copy()

Return a (shallow) copy of this GenomicRegion

Returns:GenomicRegion
expand(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)

Expand this region by a relative or an absolute amount.

Parameters:
  • absolute – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>
  • relative – Relative amount as fraction of region by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>
  • absolute_left – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
  • absolute_right – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
  • relative_left – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
  • relative_right – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
  • copy – If True, return a copy of the original region, if False will modify the existing region in place
  • from_center – If True measures distance from center rather than start and end of the old region
Returns:

GenomicRegion

five_prime

Return the position of the 5’ end of this GenomicRegion on the reference.

Returns:int
fix_chromosome(copy=False)

Change chromosome representation from chr<NN> to <NN> or vice versa.

Parameters:copy – If True, make copy of region, otherwise will modify existing region in place.
Returns:GenomicRegion
classmethod from_string(region_string)

Convert a string into a GenomicRegion.

This is a very useful convenience function to quickly define a GenomicRegion object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.

Parameters:region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.
Returns:GenomicRegion
is_forward()

Return True if this region is on the forward strand of the reference genome.

Returns:True if on ‘+’ strand, False otherwise.
is_reverse()

Return True if this region is on the reverse strand of the reference genome.

Returns:True if on ‘-‘ strand, False otherwise.
overlap(region)

Return the overlap in base pairs between this region and another region.

Parameters:regionGenomicRegion to find overlap for
Returns:overlap as int in base pairs
overlaps(region)

Check if this region overlaps with the specified region.

Parameters:regionGenomicRegion object or string
set_attribute(attribute, value)

Safely set an attribute on the GenomicRegion object.

This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.

Parameters:
  • attribute – Name of the attribute to be set
  • value – Value of the attribute to be set
strand_string

Return the ‘strand’ attribute as string.

Returns:strand as str (‘+’, ‘-‘, or ‘.’)
three_prime

Return the position of the 3’ end of this GenomicRegion on the reference.

Returns:int
to_string()

Convert this GenomicRegion to its string representation.

Returns:str
class fanc.pairs.LazyFragmentRead(row, pairs, side='left')

Bases: fanc.pairs.FragmentRead

FragmentRead implementation with lazy attribute loading.

fragment

A GenomicRegion delineated by restriction sites.

position

The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.

strand

The strand this read maps to (-1 or +1).

re_distance()

Get the distance of the alignment to the nearest restriction site. :return: int

class fanc.pairs.MinimalRead(chromosome, position, strand)

Bases: object

Minimal class representing an aligned read.

chromosome

Chromosome name

reference_name

Identical to chromosome, exists for compatibility with pysam

position

Position of aligned read on chromosome

pos

Identical to position, exists for compatibility with Pysam

strand

Strand of the alignment

flag

Flag representing the strandedness of a read. 0 if on forward strand, else -1

class fanc.pairs.Monitor(value=0, manager=None)

Bases: fanc.tools.general.WorkerMonitor

Class to monitor fragment info worker threads.

is_generating_pairs()

Check the pair generating status.

set_generating_pairs(value)

Set the pair generating status.

class fanc.pairs.OutwardPairsFilter(minimum_distance=10000, mask=None)

Bases: fanc.pairs.FragmentReadPairFilter

Filter outward-facing read pairs at a distance less than a specified cutoff.

valid(row)

Map validity check of rows to pairs.

class fanc.pairs.PCRDuplicateFilter(pairs, threshold=2, mask=None)

Bases: fanc.pairs.FragmentReadPairFilter

Masks alignments that are suspected to be PCR duplicates. In order to be considered duplicates, two pairs need to have identical start positions of their respective left alignments AND of their right alignments.

valid(row)

Map validity check of rows to pairs.

valid_pair(pair)

Check if a pair is duplicated.

class fanc.pairs.PairedSamBamReadPairGenerator(sam_file)

Bases: fanc.pairs.ReadPairGenerator

Generate read pairs from single SAM/BAM files from reads with identical qname.

Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.

add_filter(read_filter)

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:read_filterReadFilter
static resolve_chimeric(reads, max_dist_same_locus=500)
Returns:read1, read2, is_chimeric
stats()

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:dict
class fanc.pairs.QualityFilter(cutoff=30, mask=None)

Bases: fanc.pairs.ReadFilter

Filter mapped reads based on mapping quality.

valid_read(read)

Check if a read has a mapq >= cutoff.

class fanc.pairs.ReDistanceFilter(maximum_distance=10000, mask=None)

Bases: fanc.pairs.FragmentReadPairFilter

Filters read pairs where the sum of distances of each read to its nearest restriction site is larger than a certain distance.

valid(row)

Map validity check of rows to pairs.

valid_pair(pair)

Check sum of restriction site distances < maximum_distance.

class fanc.pairs.ReadFilter(mask=None)

Bases: object

Abstract class that provides filtering functionality for MinimalRead, AlignedSegment or compatible.

To create a custom ReadFilter, extend this class and override the valid_read(self, read) method. valid_read should return False for a specific read object if the object is supposed to be filtered/masked and True otherwise. See QualityFilter for an example.

Pass a custom filter to the filter method in ReadPairGenerator to apply it.

valid_read(read)

Determine if a read is valid or should be filtered.

When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.

Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.

Parameters:read – A Read object
Returns:True if Read is valid, False otherwise
class fanc.pairs.ReadPairGenerator

Bases: object

Base class for generating and filtering read pairs.

This class primarily provides filtering capabilities for read pairs generated by subclasses of ReadPairGenerator. You can add a ReadFilter using add_filter(), which will be used during read pair generation. Filtering statistics are collected during the run, and can be obtained via stats().

These generators are primarily meant as input for add_read_pairs(), but can also be used independently.

Subclasses of ReadPairGenerator must implement the _iter_read_pairs() function.

add_filter(read_filter)

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:read_filterReadFilter
stats()

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:dict
class fanc.pairs.ReadPairs(file_name=None, mode='a', partition_strategy='auto', _group_name='fragment_map', _table_name_fragments='fragments', _table_name_pairs='pairs', tmpdir=None)

Bases: fanc.matrix.RegionPairsTable

Class representing a collection of read pairs mapped to restriction fragments.

This class is a RegionBased object, where each GenomicRegion represents a restriction fragment from a Hi-C experiment. A list of fragments can be obtained with the genome_regions() function, for example.

To create a ReadPairs object, you first have to add the restriction fragments before adding read pairs:

import fanc

re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII")

rp = fanc.ReadPairs()
rp.add_regions(re_fragments.regions)

Read pairs can easily be generate fro different types of input using ReadPairGenerator implementations, e.g. HicProReadPairGenerator or SamBamReadPairGenerator.

rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam",
                                            "output/sam/SRR4271982_chr18_19_2_sort.bam")
rp.add_read_pairs(rp_generator, threads=4)

You can query regions using the RegionBased interface:

chr1_fragments = rp.regions("chr1")

and you can iterate over read pairs using the pairs():

you can also use the :cls:Region

class ChromosomeDescription

Bases: tables.description.IsDescription

Description of the chromosomes in this object.

class MaskDescription

Bases: tables.description.IsDescription

class RegionDescription

Bases: tables.description.IsDescription

Description of a genomic region for PyTables Table

add_contact(contact, *args, **kwargs)

Alias for add_edge()

Parameters:
  • contactEdge
  • args – Positional arguments passed to _add_edge()
  • kwargs – Keyword arguments passed to _add_edge()
add_contacts(contacts, *args, **kwargs)

Alias for add_edges()

add_edge(edge, check_nodes_exist=True, *args, **kwargs)

Add an edge / contact between two regions to this object.

Parameters:
  • edgeEdge, dict with at least the attributes source and sink, optionally weight, or a list of length 2 (source, sink) or 3 (source, sink, weight).
  • check_nodes_exist – Make sure that there are nodes that match source and sink indexes
  • args – Positional arguments passed to _add_edge()
  • kwargs – Keyword arguments passed to _add_edge()
add_edge_from_dict(edge, *args, **kwargs)

Direct method to add an edge from dict input.

Parameters:edge – dict with at least the keys “source” and “sink”. Additional keys will be loaded as edge attributes
add_edge_from_edge(edge, *args, **kwargs)

Direct method to add an edge from Edge input.

Parameters:edgeEdge
add_edge_from_list(edge, *args, **kwargs)

Direct method to add an edge from list or tuple input.

Parameters:edge – List or tuple. Should be of length 2 (source, sink) or 3 (source, sink, weight)
add_edge_simple(source, sink, weight=None, *args, **kwargs)

Direct method to add an edge from Edge input.

Parameters:
  • source – Source region index
  • sink – Sink region index
  • weight – Weight of the edge
add_edges(edges, flush=True, *args, **kwargs)

Bulk-add edges from a list.

List items can be any of the supported edge types, list, tuple, dict, or Edge. Repeatedly calls add_edge(), so may be inefficient for large amounts of data.

Parameters:edges – List (or iterator) of edges. See add_edge() for details
add_mask_description(name, description)

Add a mask description to the _mask table and return its ID.

Parameters:
  • name (str) – name of the mask
  • description (str) – description of the mask
Returns:

id of the mask

Return type:

int

add_read_pairs(read_pairs, batch_size=1000000, threads=1)

Add read pairs to this object.

This function requires tuples of read pairs as input, for example from MinimalRead or AlignedSegment. Typically, you won’t have to construct these from scratch, but can use one of the ReadPairGenerator classes to generate read pairs from input files.

import fanc

re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII")
rp = fanc.ReadPairs()
rp.add_regions(re_fragments.regions)

# read pairs are added here from BAM files
rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam",
                                            "output/sam/SRR4271982_chr18_19_2_sort.bam")
rp.add_read_pairs(rp_generator, threads=4)
Parameters:
  • read_pairs – iterator over tuples of read pairs. Typically instances of ReadPairGenerator
  • batch_size – Batch size of read pairs sent to fragment info workers
  • threads – Number of threads for simultaneous fragment info finding
add_region(region, *args, **kwargs)

Add a genomic region to this object.

This method offers some flexibility in the types of objects that can be loaded. See parameters for details.

Parameters:region – Can be a GenomicRegion, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).
add_regions(regions, *args, **kwargs)

Bulk insert multiple genomic regions.

Parameters:regions – List (or any iterator) with objects that describe a genomic region. See add_region for options.
static bin_intervals(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)

Bin a given set of intervals into a fixed number of bins.

Parameters:
  • intervals – iterator of tuples (start, end, score)
  • bins – Number of bins to divide the region into
  • interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
Returns:

iterator of tuples: (start, end, score)

static bin_intervals_equidistant(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)

Bin a given set of intervals into bins with a fixed size.

Parameters:
  • intervals – iterator of tuples (start, end, score)
  • bin_size – Size of each bin in base pairs
  • interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
Returns:

iterator of tuples: (start, end, score)

bin_size

Return the length of the first region in the dataset.

Assumes all bins have equal size.

Returns:int
binned_regions(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)

Same as region_intervals, but returns GenomicRegion objects instead of tuples.

Parameters:
  • region – String or class:~GenomicRegion object denoting the region to be binned
  • bins – Number of bins to divide the region into
  • bin_size – Size of each bin (alternative to bins argument)
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
  • args – Arguments passed to _region_intervals
  • kwargs – Keyword arguments passed to _region_intervals
Returns:

iterator of GenomicRegion objects

bins_to_distance(bins)

Convert fraction of bins to base pairs

Parameters:bins – float, fraction of bins
Returns:int, base pairs
chromosome_bins

Returns a dictionary of chromosomes and the start and end index of the bins they cover.

Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.

chromosome_lengths

Returns a dictionary of chromosomes and their length in bp.

chromosomes()

List all chromosomes in this regions table. :return: list of chromosome names.

close(copy_tmp=True, remove_tmp=True)

Close this HDF5 file and run exit operations.

If file was opened with tmpdir in read-only mode: close file and delete temporary copy.

If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.

Parameters:
  • copy_tmp – If False, does not overwrite original with modified file.
  • remove_tmp – If False, does not delete temporary copy of file.
distance_to_bins(distance)

Convert base pairs to fraction of bins.

Parameters:distance – distance in base pairs
Returns:float, distance as fraction of bin size
downsample(n, file_name=None)

Sample edges from this object.

Sampling is always done on uncorrected Hi-C matrices.

Parameters:
  • n – Sample size or reference object. If n < 1 will be interpreted as a fraction of total reads in this object.
  • file_name – Output file name for down-sampled object.
Returns:

RegionPairsTable

edge_data(attribute, *args, **kwargs)

Iterate over specific edge attribute.

Parameters:
  • attribute – Name of the attribute, e.g. “weight”
  • args – Positional arguments passed to edges()
  • kwargs – Keyword arguments passed to edges()
Returns:

iterator over edge attribute

edge_subset(key=None, *args, **kwargs)

Get a subset of edges.

This is an alias for edges().

Returns:generator (Edge)
edges

Iterate over contacts / edges.

edges() is the central function of RegionPairsContainer. Here, we will use the Hic implementation for demonstration purposes, but the usage is exactly the same for all compatible objects implementing RegionPairsContainer, including JuicerHic and CoolerHic.

import fanc

# file from FAN-C examples
hic = fanc.load("output/hic/binned/fanc_example_1mb.hic")

We can easily find the number of edges in the sample Hic object:

len(hic.edges)  # 8695

When used in an iterator context, edges() iterates over all edges in the RegionPairsContainer:

for edge in hic.edges:
    # do something with edge
    print(edge)
    # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173
    # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057
    # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876
    # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468
    # ...

Calling edges() as a method has the same effect:

# note the '()'
for edge in hic.edges():
    # do something with edge
    print(edge)
    # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173
    # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057
    # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876
    # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468
    # ...

Rather than iterate over all edges in the object, we can select only a subset. If the key is a string or a GenomicRegion, all non-zero edges connecting the region described by the key to any other region are returned. If the key is a tuple of strings or GenomicRegion, only edges between the two regions are returned.

# select all edges between chromosome 19
# and any other region:
for edge in hic.edges("chr19"):
    print(edge)
    # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033
    # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978
    # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917
    # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048
    # ...

# select all edges that are only on
# chromosome 19
for edge in hic.edges(('chr19', 'chr19')):
    print(edge)
    # 90--116; bias: 0.00021173151730025176; sink_node: chr19:37000001-38000000; source_node: chr19:11000001-12000000; weight: 0.009104455243910825
    # 135--135; bias: 0.00018003890596887822; sink_node: chr19:56000001-57000000; source_node: chr19:56000001-57000000; weight: 0.10028167062466517
    # 123--123; bias: 0.00011063368998965993; sink_node: chr19:44000001-45000000; source_node: chr19:44000001-45000000; weight: 0.1386240135570439
    # 92--93; bias: 0.00040851066434864896; sink_node: chr19:14000001-15000000; source_node: chr19:13000001-14000000; weight: 0.10090213409411629
    # ...

# select inter-chromosomal edges
# between chromosomes 18 and 19
for edge in hic.edges(('chr18', 'chr19')):
    print(edge)
    # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033
    # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978
    # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917
    # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048
    # ...

By default, edges() will retrieve all edge attributes, which can be slow when iterating over a lot of edges. This is why all file-based FAN-C RegionPairsContainer objects support lazy loading, where attributes are only read on demand.

for edge in hic.edges('chr18', lazy=True):
    print(edge.source, edge.sink, edge.weight, edge)
    # 42 42 0.12291311562018173 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #0>
    # 24 28 0.025205961072838057 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #1>
    # 5 76 0.00961709840049876 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #2>
    # 66 68 0.03876763316345468 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #3>
    # ...

Warning

The lazy iterator reuses the LazyEdge object in every iteration, and overwrites the LazyEdge attributes. Therefore do not use lazy iterators if you need to store edge objects for later access. For example, the following code works as expected list(hic.edges()), with all Edge objects stored in the list, while this code list(hic.edges(lazy=True)) will result in a list of identical LazyEdge objects. Always ensure you do all edge processing in the loop when working with lazy iterators!

When working with normalised contact frequencies, such as obtained through matrix balancing in the example above, edges() automatically returns normalised edge weights. In addition, the bias attribute will (typically) have a value different from 1.

When you are interested in the raw contact frequency, use the norm=False parameter:

for edge in hic.edges('chr18', lazy=True, norm=False):
    print(edge.source, edge.sink, edge.weight)
    # 42 42 2120.0
    # 24 28 388.0
    # 5 76 94.0
    # 66 68 470.0
    # ...

You can also choose to omit all intra- or inter-chromosomal edges using intra_chromosomal=False or inter_chromosomal=False, respectively.

Returns:Iterator over Edge or equivalent.
edges_dict(*args, **kwargs)

Edges iterator with access by bracket notation.

This iterator always returns unnormalised edges.

Returns:dict or dict-like iterator
filter(pair_filter, queue=False, log_progress=True)

Apply a FragmentReadPairFilter to the read pairs in this object.

Parameters:
  • pair_filterFragmentReadPairFilter
  • queue – If True, does not do the filtering immediately, but queues this filter. All queued filters can then be run at the same time using run_queued_filters()
  • log_progress
Returns:

filter_inward(minimum_distance=None, queue=False, **kwargs)

Convenience function that applies an InwardPairsFilter.

Parameters:
  • minimum_distance – Minimum distance inward-facing read pairs must have to pass this filter
  • queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
  • kwargs – Additional arguments to pass to get_ligation_structure_biases()
filter_ligation_products(inward_threshold=None, outward_threshold=None, queue=False, **kwargs)

Convenience function that applies an OutwardPairsFilter and an InwardPairsFilter.

Parameters:
  • inward_threshold – Minimum distance inward-facing read pairs must have to pass this filter. If None, will be inferred from the data
  • outward_threshold – Minimum distance outward-facing read pairs must have to pass this filter. If None, will be inferred from the data
  • queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
  • kwargs – Additional arguments to pass to get_ligation_structure_biases()
filter_outward(minimum_distance=None, queue=False, **kwargs)

Convenience function that applies an OutwardPairsFilter.

Parameters:
  • minimum_distance – Minimum distance outward-facing read pairs must have to pass this filter
  • queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
  • kwargs – Additional arguments to pass to get_ligation_structure_biases()
filter_pcr_duplicates(threshold=3, queue=False)

Convenience function that applies an PCRDuplicateFilter.

Parameters:
  • threshold – If distance between two alignments is smaller or equal the threshold, the alignments are considered to be starting at the same position
  • queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
filter_re_dist(maximum_distance, queue=False)

Convenience function that applies an ReDistanceFilter.

Parameters:
  • maximum_distance – Maximum distance a read can have to the nearest region border (=restriction site)
  • queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
filter_self_ligated(queue=False)

Convenience function that applies an SelfLigationFilter.

Parameters:queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
filter_statistics()

Get filtering statistics for this object. :return: dict with {filter_type: count, …}

find_region(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)

Find the region that is at the center of a region.

Parameters:query_regions – Region selector string, :class:~GenomicRegion, or list of the former
Returns:index (or list of indexes) of the region at the center of the query region
flush(silent=False)

Write buffered data to file and update indexes,

Parameters:silent – If True, does not use progressbars.
get_edge(item, *row_conversion_args, **row_conversion_kwargs)

Get an edge by index.

Parameters:
  • row_conversion_args – Arguments passed to RegionPairs._row_to_edge()
  • row_conversion_kwargs – Keyword arguments passed to RegionPairs._row_to_edge()
Returns:

Edge

get_ligation_structure_biases(sampling=None, skip_self_ligations=True, **kwargs)

Compute the ligation biases (inward and outward to same-strand) of this data set.

Parameters:
  • sampling – Approximate number of data points to average per point in the plot. If None (default), this will be determined on a best-guess basis.
  • skip_self_ligations – If True (default), will not consider self-ligated fragments for assessing the error rates.
  • unfiltered – If True, uses all read pairs, even those that do not pass filters, for the ligation error calculation
Returns:

tuple with (list of gap sizes between reads, list of matching le type ratios)

get_mask(key)

Search _mask table for key and return Mask.

Parameters:
  • key (int) – search by mask name
  • key – search by mask ID
Returns:

Mask

get_masks(ix)

Extract mask IDs encoded in parameter and return masks.

IDs are powers of 2, so a single int field in the table can hold multiple masks by simply adding up the IDs. Similar principle to UNIX chmod (although that uses base 8)

Parameters:ix (int) – integer that is the sum of powers of 2. Note that this value is not necessarily itself a power of 2.
Returns:list of Masks extracted from ix
Return type:list (Mask)
intervals(*args, **kwargs)

Alias for region_intervals.

mappable(region=None)

Get the mappability of regions in this object.

A “mappable” region has at least one contact to another region in the genome.

Returns:array where True means mappable and False unmappable
classmethod merge(pairs, *args, **kwargs)

Merge two or more RegionPairsTable objects.

Parameters:pairs – list of RegionPairsTable
Returns:merged RegionPairsTable
pairs(key=None, lazy=False, *args, **kwargs)

Iterate over the FragmentReadPair objects.

Parameters:
  • key – Region string of the form <chromosome>[:<start>-<end>], GenomicRegion or tuples thereof
  • lazy – If True, use lazy loading of objects and their attributes. Much faster, but can lead to unexpected results if one is not careful. For example, this: list(object.pairs()) is not the same as list(object.pairs(lazy=True))! In the latter case, all objects in the list will be identical due to the lazy loading process. Only use lazy loading to access attributes in an iterator!
  • args – Positional arguments passed to edges_dict()
  • kwargs – Keyword arguments passed to edges_dict()
Returns:

pairs_by_chromosomes(chromosome1, chromosome2, **kwargs)

Only iterate over read pairs in this combination of chromosomes. :param chromosome1: Name of first chromosome :param chromosome2: Name of second chromosome :param kwargs: Keyword arguments passed to pairs() :return:

region_bins(*args, **kwargs)

Return slice of start and end indices spanned by a region.

Parameters:args – provide a GenomicRegion here to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.
Returns:
region_data(key, value=None)

Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced

Parameters:
  • key – Name of the data column
  • value – vector with region-based data (one entry per region)
region_intervals(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)

Return equally-sized genomic intervals and associated scores.

Use either bins or bin_size argument to control binning.

Parameters:
  • region – String or class:~GenomicRegion object denoting the region to be binned
  • bins – Number of bins to divide the region into
  • bin_size – Size of each bin (alternative to bins argument)
  • smoothing_window – Size of window (in bins) to smooth scores over
  • nan_replacement – NaN values in the scores will be replaced with this value
  • zero_to_nan – If True, will convert bins with score 0 to NaN
  • args – Arguments passed to _region_intervals
  • kwargs – Keyword arguments passed to _region_intervals
Returns:

iterator of tuples: (start, end, score)

region_subset(region, *args, **kwargs)

Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.

Parameters:region – String or class:~GenomicRegion object for which covered bins will be returned.
regions

Iterate over genomic regions in this object.

Will return a GenomicRegion object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.

Returns:RegionIter
regions_and_edges(key, *args, **kwargs)

Convenient access to regions and edges selected by key.

Parameters:
  • key – Edge selector, see edges()
  • args – Positional arguments passed to edges()
  • kwargs – Keyword arguments passed to edges()
Returns:

list of row regions, list of col regions, iterator over edges

regions_dict

Return a dictionary with region index as keys and regions as values.

Returns:dict {region.ix: region, …}
static regions_identical(pairs)

Check if the regions in all objects in the list are identical.

Parameters:pairslist of RegionBased objects
Returns:True if chromosome, start, and end are identical between all regions in the same list positions.
run_queued_filters(log_progress=True)

Run queued filters. See filter()

Parameters:log_progress – If true, process iterating through all edges will be continuously reported.
subset(*regions, **kwargs)

Subset a Hic object by specifying one or more subset regions.

Parameters:
  • regions – string or GenomicRegion object(s)
  • kwargs – Supports file_name: destination file name of subset Hic object; tmpdir: if True works in tmp until object is closed additional parameters are passed to edges()
Returns:

Hic

to_bed(file_name, subset=None, **kwargs)

Export regions as BED file

Parameters:
  • file_name – Path of file to write regions to
  • subset – optional GenomicRegion or str to write only regions overlapping this region
  • kwargs – Passed to write_bed()
to_bigwig(file_name, subset=None, **kwargs)

Export regions as BigWig file.

Parameters:
  • file_name – Path of file to write regions to
  • subset – optional GenomicRegion or str to write only regions overlapping this region
  • kwargs – Passed to write_bigwig()
to_gff(file_name, subset=None, **kwargs)

Export regions as GFF file

Parameters:
  • file_name – Path of file to write regions to
  • subset – optional GenomicRegion or str to write only regions overlapping this region
  • kwargs – Passed to write_gff()
to_hic(file_name=None, tmpdir=None, _hic_class=<class 'fanc.hic.Hic'>)

Convert this ReadPairs to a Hic object.

Parameters:
  • file_name – Path to the Hic output file
  • tmpdir – If True (or path to temporary directory) will work in temporary directory until closed
class fanc.pairs.SamBamReadPairGenerator(sam_file1, sam_file2, check_sorted=True)

Bases: fanc.pairs.ReadPairGenerator

Generate read pairs from paired-end SAM/BAM files.

This ReadPairGenerator iterates over two SAM or BAM files that have been sorted by qname (for example with samtools sort -n).

Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.

add_filter(read_filter)

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:read_filterReadFilter
filter_multi_mapping(strict=True)

Convenience function that registers a UniquenessFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.

Parameters:strict – If True will filter if XS tag is present. If False, will filter only when XS tag is not 0. This is applied if alignments are from bowtie2.
filter_quality(cutoff=30)

Convenience function that registers a QualityFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.

Parameters:cutoff – Minimum mapping quality (mapq) a read must have to pass the filter
filter_unmapped()

Convenience function that registers an UnmappedFilter.

stats()

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:dict
class fanc.pairs.SelfLigationFilter(mask=None)

Bases: fanc.pairs.FragmentReadPairFilter

Filters read pairs where one or both reads map to the same restriction fragment.

valid(row)

Map validity check of rows to pairs.

valid_pair(pair)

Check if both reads are on the same fragment.

class fanc.pairs.TxtReadPairGenerator(valid_pairs_file, sep=None, chr1_field=1, pos1_field=2, strand1_field=3, chr2_field=4, pos2_field=5, strand2_field=6)

Bases: fanc.pairs.ReadPairGenerator

Generate read pairs from a plain text file.

This is an implementation of ReadPairGenerator for reading read pairs from an arbitrary text file. For specific text file formats have a look at FourDNucleomePairGenerator or HicProPairGenerator.

TxtReadPairGenerator iterates over lines in a file and splits them into fields at “sep”. It then extracts the chromosome, position and strand for each read in the pair, according to the fields specified by “chr<1|2>_field”, “pos<1|2>_field”, and “strand<1|2>_field”. If your file does not have strand fields, or if you don’t want to load them, you can simply set them to “None”.

add_filter(read_filter)

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:read_filterReadFilter
stats()

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:dict
class fanc.pairs.UniquenessFilter(strict=True, mask=None)

Bases: fanc.pairs.ReadFilter

Filter reads that do not map uniquely to the reference sequence.

valid_read(read)

Check if a read has an XS tag.

If strict is enabled checks if a read has an XS tag. If not strict, XS has to be smaller than AS (alignment score) for a valid read.

class fanc.pairs.UnmappedFilter(mask=None)

Bases: fanc.pairs.ReadFilter

Filter reads that do not map to the reference sequence.

valid_read(read)

Check if the the flag bit 4 is set.

fanc.pairs.generate_pairs(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=10000000)

Generate Pairs object from SAM/BAM files.

This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.

Parameters:
  • sam1_file – Path to a sorted SAM/BAM file (1st mate)
  • sam2_file – Path to a sorted SAM/BAM file (2nd mate)
  • regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
  • restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
  • read_filters – List of ReadFilter to filter reads while loading from SAM/BAM
  • output_file – Path to output file
  • check_sorted – Double-check that input SAM files are sorted if True (default)
  • threads – Number of threads used for finding restriction fragments for read pairs
  • batch_size – Number of read pairs sent to each restriction fragment worker
Returns:

ReadPairs

fanc.pairs.generate_pairs_split(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=1000000)

Generate Pairs object from SAM/BAM files.

This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.

Parameters:
  • sam1_file – Path to a sorted SAM/BAM file (1st mate)
  • sam2_file – Path to a sorted SAM/BAM file (2nd mate)
  • regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
  • restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
  • read_filters – List of ReadFilter to filter reads while loading from SAM/BAM
  • output_file – Path to output file
  • check_sorted – Double-check that input SAM files are sorted if True (default)
  • threads – Number of threads used for finding restriction fragments for read pairs
  • batch_size – Number of read pairs sent to each restriction fragment worker
Returns:

ReadPairs