Pairs module¶

class fanc.pairs.BwaMemQualityFilter(cutoff=0.9, mask=None)¶

Bases: fanc.pairs.ReadFilter

Filters bwa mem generated alignments based on the alignment score (normalized by the length of the alignment).

valid_read(read)¶: Check if a read has a high alignment score.

class fanc.pairs.BwaMemUniquenessFilter(strict=False, mask=None)¶

Bases: fanc.pairs.ReadFilter

Filters bwa mem generated alignments based on whether they are unique or not.

valid_read(read)¶

Determine if a read is valid or should be filtered.

When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.

Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.

Parameters:	read – A `Read` object
Returns:	True if Read is valid, False otherwise

class fanc.pairs.ContaminantFilter(contaminant_reads, mask=None)¶

Bases: fanc.pairs.ReadFilter

Filter reads that also map to a contaminant genome.

valid_read(read)¶: Check if a read also maps to a contaminant

class fanc.pairs.FourDNucleomePairGenerator(pairs_file)¶

Bases: fanc.pairs.TxtReadPairGenerator

Read pair generator that works on 4D Nucleome “.pairs” files.

For details on the 4D Nucleome pairs format see: https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md

add_filter(read_filter)¶

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:	read_filter – `ReadFilter`

stats()¶

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:	dict

class fanc.pairs.FragmentRead(fragment=None, position=None, strand=0)¶

Bases: object

Class representing a fragment-mapped read.

fragment¶: A GenomicRegion delineated by restriction sites.

position¶: The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.

strand¶: The strand this read maps to (-1 or +1).

re_distance()¶: Get the distance of the alignment to the nearest restriction site. :return: int

class fanc.pairs.FragmentReadPair(left_read, right_read, ix=None)¶

Bases: object

Container for two paired FragmentRead objects.

get_gap_size()¶

Get the gap size in base pairs between the fragments these reads map to.

Returns:	0 if reads map to the same fragment or neighboring fragments, the distance between fragments if they are on the same chromosome, None otherwise

is_inward_pair()¶

Check if reads form an “inward-facing” pair.

A pair is considered inward-facing if the left read maps to the plus the right read to the minus strand and both reads map to the same chromosome.

Returns:	True if reads are inward-facing, False otherwise

is_outward_pair()¶

Check if reads form an “outward-facing” pair.

A pair is considered outward-facing if the left read maps to the minus the right read to the plus strand and both reads map to the same chromosome.

Returns:	True if reads are outward-facing, False otherwise

is_same_chromosome()¶

Check if both reads are mapped to the same chromosome.

Returns:	True is reads map to the same chromosome, False otherwise

is_same_fragment()¶

Check if reads map to the same fragment.

Returns:	True if reads map to the same fragment, False otherwise.

is_same_pair()¶

Check if reads face in the same direction.

Returns:	True if reads are facing in the same direction, False otherwise.

class fanc.pairs.FragmentReadPairFilter(mask=None)¶

Bases: fanc.general.MaskFilter

Abstract class that provides filtering functionality for the FragmentReadPair object.

Extends MaskFilter and overrides valid(self, read) to make FragmentReadPair filtering more “natural”.

To create custom filters for the FragmentMappedReadPairs object, extend this class and override the valid_pair() method. valid_pair should return False for a specific FragmentReadPair object if the object is supposed to be filtered/masked and True otherwise. See InwardPairsFilter for an example.

Pass a custom filter to the filter method in FragmentMappedReadPairs to apply it.

valid(row)¶: Map validity check of rows to pairs.

class fanc.pairs.HicProPairGenerator(file_name)¶

Bases: fanc.pairs.TxtReadPairGenerator

Read pair generator for HiC-Pro “validPairs” files.

This generator is a subclass of TxtReadPairGenerator with presets for fields in HiC-Pro validPairs files.

add_filter(read_filter)¶

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:	read_filter – `ReadFilter`

stats()¶

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:	dict

class fanc.pairs.InwardPairsFilter(minimum_distance=10000, mask=None)¶

Bases: fanc.pairs.FragmentReadPairFilter

Filter inward-facing read pairs at a distance less than a specified cutoff.

valid(row)¶: Map validity check of rows to pairs.

valid_pair(pair)¶: Check if a pair is inward-facing and <minimum_distance> apart.

class fanc.pairs.LazyFragment(row, pairs, ix=None, side='left')¶

Bases: genomic_regions.regions.GenomicRegion

GenomicRegion representing a fragment with lazy attribute loading.

chromosome¶: The reference sequence this region is located on

start¶: start position of this region on the reference (1-based, inclusive)

end¶: end position of this region on the reference (1-based, inclusive)

strand¶: strand of the reference this region is located on (-1, 1, 0, or None)

ix¶: Region index within the context of regions from the same object.

as_bed_line(score_field='score', name_field='name')¶

Return a representation of this object as line in a BED file.

Parameters:	score_field – name of the attribute to be used in the ‘score’ field of the BED line name_field – name of the attribute to be used in the ‘name’ field of the BED line
Returns:	str

as_gff_line(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')¶

Return a representation of this object as line in a GFF file.

Parameters:

source_field – name of the attribute to be used in the ‘source’ field of the GFF line
feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
score_field – name of the attribute to be used in the ‘score’ field of the GFF line
frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
float_format – Formatting string for the float fields

Returns:

str

attributes¶

Return all visible attributes of this GenomicRegion.

Returns all attribute names that do not start with an underscore. :return: list of attribute names

center¶

Return the center coordinate of the GenomicRegion.

Returns:	float

contains(region)¶

Check if the specified region is completely contained in this region.

Parameters:	region – `GenomicRegion` object or string

copy()¶

Return a (shallow) copy of this GenomicRegion

Returns:	`GenomicRegion`

expand(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)¶

Expand this region by a relative or an absolute amount.

Parameters:

absolute – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute>
relative – Relative amount as fraction of region by which to expand the region represented by this GenomicRegion object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)>
absolute_left – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
absolute_right – Absolute amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
relative_left – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the left side
relative_right – Relative amount in base pairs by which to expand the region represented by this GenomicRegion object on the right side
copy – If True, return a copy of the original region, if False will modify the existing region in place
from_center – If True measures distance from center rather than start and end of the old region

Returns:

GenomicRegion

five_prime¶

Return the position of the 5’ end of this GenomicRegion on the reference.

Returns:	int

fix_chromosome(copy=False)¶

Change chromosome representation from chr<NN> to <NN> or vice versa.

Parameters:	copy – If True, make copy of region, otherwise will modify existing region in place.
Returns:	`GenomicRegion`

classmethod from_string(region_string)¶

Convert a string into a GenomicRegion.

This is a very useful convenience function to quickly define a GenomicRegion object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.

Parameters:	region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen.
Returns:	`GenomicRegion`

is_forward()¶

Return True if this region is on the forward strand of the reference genome.

Returns:	True if on ‘+’ strand, False otherwise.

is_reverse()¶

Return True if this region is on the reverse strand of the reference genome.

Returns:	True if on ‘-’ strand, False otherwise.

overlap(region)¶

Return the overlap in base pairs between this region and another region.

Parameters:	region – `GenomicRegion` to find overlap for
Returns:	overlap as int in base pairs

overlaps(region)¶

Check if this region overlaps with the specified region.

Parameters:	region – `GenomicRegion` object or string

set_attribute(attribute, value)¶

Safely set an attribute on the GenomicRegion object.

This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.

Parameters:	attribute – Name of the attribute to be set value – Value of the attribute to be set

strand_string¶

Return the ‘strand’ attribute as string.

Returns:	strand as str (‘+’, ‘-’, or ‘.’)

three_prime¶

Return the position of the 3’ end of this GenomicRegion on the reference.

Returns:	int

to_string()¶

Convert this GenomicRegion to its string representation.

Returns:	str

class fanc.pairs.LazyFragmentRead(row, pairs, side='left')¶

Bases: fanc.pairs.FragmentRead

FragmentRead implementation with lazy attribute loading.

fragment¶: A GenomicRegion delineated by restriction sites.

position¶: The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.

strand¶: The strand this read maps to (-1 or +1).

re_distance()¶: Get the distance of the alignment to the nearest restriction site. :return: int

class fanc.pairs.MinimalRead(chromosome, position, strand)¶

Bases: object

Minimal class representing an aligned read.

chromosome¶: Chromosome name

reference_name¶: Identical to chromosome, exists for compatibility with pysam

position¶: Position of aligned read on chromosome

pos¶: Identical to position, exists for compatibility with Pysam

strand¶: Strand of the alignment

flag¶: Flag representing the strandedness of a read. 0 if on forward strand, else -1

class fanc.pairs.Monitor(value=0, manager=None)¶

Bases: fanc.tools.general.WorkerMonitor

Class to monitor fragment info worker threads.

is_generating_pairs()¶: Check the pair generating status.

set_generating_pairs(value)¶: Set the pair generating status.

class fanc.pairs.OutwardPairsFilter(minimum_distance=10000, mask=None)¶

Bases: fanc.pairs.FragmentReadPairFilter

Filter outward-facing read pairs at a distance less than a specified cutoff.

valid(row)¶: Map validity check of rows to pairs.

class fanc.pairs.PCRDuplicateFilter(pairs, threshold=2, mask=None)¶

Bases: fanc.pairs.FragmentReadPairFilter

Masks alignments that are suspected to be PCR duplicates. In order to be considered duplicates, two pairs need to have identical start positions of their respective left alignments AND of their right alignments.

valid(row)¶: Map validity check of rows to pairs.

valid_pair(pair)¶: Check if a pair is duplicated.

class fanc.pairs.PairedSamBamReadPairGenerator(sam_file)¶

Bases: fanc.pairs.ReadPairGenerator

Generate read pairs from single SAM/BAM files from reads with identical qname.

Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.

add_filter(read_filter)¶

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:	read_filter – `ReadFilter`

static resolve_chimeric(reads, max_dist_same_locus=500)¶

Returns:	read1, read2, is_chimeric

stats()¶

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:	dict

class fanc.pairs.QualityFilter(cutoff=30, mask=None)¶

Bases: fanc.pairs.ReadFilter

Filter mapped reads based on mapping quality.

valid_read(read)¶: Check if a read has a mapq >= cutoff.

class fanc.pairs.ReDistanceFilter(maximum_distance=10000, mask=None)¶

Bases: fanc.pairs.FragmentReadPairFilter

Filters read pairs where the sum of distances of each read to its nearest restriction site is larger than a certain distance.

valid(row)¶: Map validity check of rows to pairs.

valid_pair(pair)¶: Check sum of restriction site distances < maximum_distance.

class fanc.pairs.ReadFilter(mask=None)¶

Bases: object

Abstract class that provides filtering functionality for MinimalRead, AlignedSegment or compatible.

To create a custom ReadFilter, extend this class and override the valid_read(self, read) method. valid_read should return False for a specific read object if the object is supposed to be filtered/masked and True otherwise. See QualityFilter for an example.

Pass a custom filter to the filter method in ReadPairGenerator to apply it.

valid_read(read)¶

Determine if a read is valid or should be filtered.

When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.

Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.

Parameters:	read – A `Read` object
Returns:	True if Read is valid, False otherwise

class fanc.pairs.ReadPairGenerator¶

Bases: object

Base class for generating and filtering read pairs.

This class primarily provides filtering capabilities for read pairs generated by subclasses of ReadPairGenerator. You can add a ReadFilter using add_filter(), which will be used during read pair generation. Filtering statistics are collected during the run, and can be obtained via stats().

These generators are primarily meant as input for add_read_pairs(), but can also be used independently.

Subclasses of ReadPairGenerator must implement the _iter_read_pairs() function.

add_filter(read_filter)¶

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:	read_filter – `ReadFilter`

stats()¶

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:	dict

class fanc.pairs.ReadPairs(file_name=None, mode='a', partition_strategy='auto', _group_name='fragment_map', _table_name_fragments='fragments', _table_name_pairs='pairs', tmpdir=None)¶

Bases: fanc.matrix.RegionPairsTable

Class representing a collection of read pairs mapped to restriction fragments.

This class is a RegionBased object, where each GenomicRegion represents a restriction fragment from a Hi-C experiment. A list of fragments can be obtained with the genome_regions() function, for example.

To create a ReadPairs object, you first have to add the restriction fragments before adding read pairs:

import fanc

re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII")

rp = fanc.ReadPairs()
rp.add_regions(re_fragments.regions)

Read pairs can easily be generate fro different types of input using ReadPairGenerator implementations, e.g. HicProReadPairGenerator or SamBamReadPairGenerator.

rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam",
                                            "output/sam/SRR4271982_chr18_19_2_sort.bam")
rp.add_read_pairs(rp_generator, threads=4)

You can query regions using the RegionBased interface:

chr1_fragments = rp.regions("chr1")

and you can iterate over read pairs using the pairs():

you can also use the :cls:Region

class ChromosomeDescription¶

Bases: tables.description.IsDescription

Description of the chromosomes in this object.

class MaskDescription¶: Bases: tables.description.IsDescription

class RegionDescription¶

Bases: tables.description.IsDescription

Description of a genomic region for PyTables Table

add_contact(contact, *args, **kwargs)¶

Alias for add_edge()

Parameters:	contact – `Edge` args – Positional arguments passed to `_add_edge()` kwargs – Keyword arguments passed to `_add_edge()`

add_contacts(contacts, *args, **kwargs)¶: Alias for add_edges()

add_edge(edge, check_nodes_exist=True, *args, **kwargs)¶

Add an edge / contact between two regions to this object.

Parameters:	edge – `Edge`, dict with at least the attributes source and sink, optionally weight, or a list of length 2 (source, sink) or 3 (source, sink, weight). check_nodes_exist – Make sure that there are nodes that match source and sink indexes args – Positional arguments passed to `_add_edge()` kwargs – Keyword arguments passed to `_add_edge()`

add_edge_from_dict(edge, *args, **kwargs)¶

Direct method to add an edge from dict input.

Parameters:	edge – dict with at least the keys “source” and “sink”. Additional keys will be loaded as edge attributes

add_edge_from_edge(edge, *args, **kwargs)¶

Direct method to add an edge from Edge input.

Parameters:	edge – `Edge`

add_edge_from_list(edge, *args, **kwargs)¶

Direct method to add an edge from list or tuple input.

Parameters:	edge – List or tuple. Should be of length 2 (source, sink) or 3 (source, sink, weight)

add_edge_simple(source, sink, weight=None, *args, **kwargs)¶

Direct method to add an edge from Edge input.

Parameters:	source – Source region index sink – Sink region index weight – Weight of the edge

add_edges(edges, flush=True, *args, **kwargs)¶

Bulk-add edges from a list.

List items can be any of the supported edge types, list, tuple, dict, or Edge. Repeatedly calls add_edge(), so may be inefficient for large amounts of data.

Parameters:	edges – List (or iterator) of edges. See `add_edge()` for details

add_mask_description(name, description)¶

Add a mask description to the _mask table and return its ID.

Parameters:	name (str) – name of the mask description (str) – description of the mask
Returns:	id of the mask
Return type:	int

add_read_pairs(read_pairs, batch_size=1000000, threads=1)¶

Add read pairs to this object.

This function requires tuples of read pairs as input, for example from MinimalRead or AlignedSegment. Typically, you won’t have to construct these from scratch, but can use one of the ReadPairGenerator classes to generate read pairs from input files.

import fanc

re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII")
rp = fanc.ReadPairs()
rp.add_regions(re_fragments.regions)

# read pairs are added here from BAM files
rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam",
                                            "output/sam/SRR4271982_chr18_19_2_sort.bam")
rp.add_read_pairs(rp_generator, threads=4)

Parameters:	read_pairs – iterator over tuples of read pairs. Typically instances of `ReadPairGenerator` batch_size – Batch size of read pairs sent to fragment info workers threads – Number of threads for simultaneous fragment info finding

add_region(region, *args, **kwargs)¶

Add a genomic region to this object.

This method offers some flexibility in the types of objects that can be loaded. See parameters for details.

Parameters:	region – Can be a `GenomicRegion`, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).

add_regions(regions, *args, **kwargs)¶

Bulk insert multiple genomic regions.

Parameters:	regions – List (or any iterator) with objects that describe a genomic region. See `add_region` for options.

static bin_intervals(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶

Bin a given set of intervals into a fixed number of bins.

Parameters:

intervals – iterator of tuples (start, end, score)
bins – Number of bins to divide the region into
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN

Returns:

iterator of tuples: (start, end, score)

static bin_intervals_equidistant(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶

Bin a given set of intervals into bins with a fixed size.

Parameters:

intervals – iterator of tuples (start, end, score)
bin_size – Size of each bin in base pairs
interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN

Returns:

iterator of tuples: (start, end, score)

bin_size¶

Return the length of the first region in the dataset.

Assumes all bins have equal size.

Returns:	int

binned_regions(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶

Same as region_intervals, but returns GenomicRegion objects instead of tuples.

Parameters:

region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals

Returns:

iterator of GenomicRegion objects

bins_to_distance(bins)¶

Convert fraction of bins to base pairs

Parameters:	bins – float, fraction of bins
Returns:	int, base pairs

chromosome_bins¶

Returns a dictionary of chromosomes and the start and end index of the bins they cover.

Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.

chromosome_lengths¶: Returns a dictionary of chromosomes and their length in bp.

chromosomes()¶: List all chromosomes in this regions table. :return: list of chromosome names.

close(copy_tmp=True, remove_tmp=True)¶

Close this HDF5 file and run exit operations.

If file was opened with tmpdir in read-only mode: close file and delete temporary copy.

If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.

Parameters:	copy_tmp – If False, does not overwrite original with modified file. remove_tmp – If False, does not delete temporary copy of file.

distance_to_bins(distance)¶

Convert base pairs to fraction of bins.

Parameters:	distance – distance in base pairs
Returns:	float, distance as fraction of bin size

downsample(n, file_name=None)¶

Sample edges from this object.

Sampling is always done on uncorrected Hi-C matrices.

Parameters:	n – Sample size or reference object. If n < 1 will be interpreted as a fraction of total reads in this object. file_name – Output file name for down-sampled object.
Returns:	`RegionPairsTable`

edge_data(attribute, *args, **kwargs)¶

Iterate over specific edge attribute.

Parameters:	attribute – Name of the attribute, e.g. “weight” args – Positional arguments passed to `edges()` kwargs – Keyword arguments passed to `edges()`
Returns:	iterator over edge attribute

edge_subset(key=None, *args, **kwargs)¶

Get a subset of edges.

This is an alias for edges().

Returns:	generator (`Edge`)

edges¶

Iterate over contacts / edges.

edges() is the central function of RegionPairsContainer. Here, we will use the Hic implementation for demonstration purposes, but the usage is exactly the same for all compatible objects implementing RegionPairsContainer, including JuicerHic and CoolerHic.

import fanc

# file from FAN-C examples
hic = fanc.load("output/hic/binned/fanc_example_1mb.hic")

We can easily find the number of edges in the sample Hic object:

len(hic.edges)  # 8695

When used in an iterator context, edges() iterates over all edges in the RegionPairsContainer:

for edge in hic.edges:
    # do something with edge
    print(edge)
    # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173
    # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057
    # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876
    # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468
    # ...

Calling edges() as a method has the same effect:

# note the '()'
for edge in hic.edges():
    # do something with edge
    print(edge)
    # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173
    # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057
    # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876
    # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468
    # ...

Rather than iterate over all edges in the object, we can select only a subset. If the key is a string or a GenomicRegion, all non-zero edges connecting the region described by the key to any other region are returned. If the key is a tuple of strings or GenomicRegion, only edges between the two regions are returned.

# select all edges between chromosome 19
# and any other region:
for edge in hic.edges("chr19"):
    print(edge)
    # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033
    # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978
    # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917
    # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048
    # ...

# select all edges that are only on
# chromosome 19
for edge in hic.edges(('chr19', 'chr19')):
    print(edge)
    # 90--116; bias: 0.00021173151730025176; sink_node: chr19:37000001-38000000; source_node: chr19:11000001-12000000; weight: 0.009104455243910825
    # 135--135; bias: 0.00018003890596887822; sink_node: chr19:56000001-57000000; source_node: chr19:56000001-57000000; weight: 0.10028167062466517
    # 123--123; bias: 0.00011063368998965993; sink_node: chr19:44000001-45000000; source_node: chr19:44000001-45000000; weight: 0.1386240135570439
    # 92--93; bias: 0.00040851066434864896; sink_node: chr19:14000001-15000000; source_node: chr19:13000001-14000000; weight: 0.10090213409411629
    # ...

# select inter-chromosomal edges
# between chromosomes 18 and 19
for edge in hic.edges(('chr18', 'chr19')):
    print(edge)
    # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033
    # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978
    # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917
    # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048
    # ...

By default, edges() will retrieve all edge attributes, which can be slow when iterating over a lot of edges. This is why all file-based FAN-C RegionPairsContainer objects support lazy loading, where attributes are only read on demand.

for edge in hic.edges('chr18', lazy=True):
    print(edge.source, edge.sink, edge.weight, edge)
    # 42 42 0.12291311562018173 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #0>
    # 24 28 0.025205961072838057 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #1>
    # 5 76 0.00961709840049876 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #2>
    # 66 68 0.03876763316345468 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #3>
    # ...

Warning

The lazy iterator reuses the LazyEdge object in every iteration, and overwrites the LazyEdge attributes. Therefore do not use lazy iterators if you need to store edge objects for later access. For example, the following code works as expected list(hic.edges()), with all Edge objects stored in the list, while this code list(hic.edges(lazy=True)) will result in a list of identical LazyEdge objects. Always ensure you do all edge processing in the loop when working with lazy iterators!

When working with normalised contact frequencies, such as obtained through matrix balancing in the example above, edges() automatically returns normalised edge weights. In addition, the bias attribute will (typically) have a value different from 1.

When you are interested in the raw contact frequency, use the norm=False parameter:

for edge in hic.edges('chr18', lazy=True, norm=False):
    print(edge.source, edge.sink, edge.weight)
    # 42 42 2120.0
    # 24 28 388.0
    # 5 76 94.0
    # 66 68 470.0
    # ...

You can also choose to omit all intra- or inter-chromosomal edges using intra_chromosomal=False or inter_chromosomal=False, respectively.

Returns:	Iterator over `Edge` or equivalent.

edges_dict(*args, **kwargs)¶

Edges iterator with access by bracket notation.

This iterator always returns unnormalised edges.

Returns:	dict or dict-like iterator

filter(pair_filter, queue=False, log_progress=True)¶

Apply a FragmentReadPairFilter to the read pairs in this object.

Parameters:	pair_filter – `FragmentReadPairFilter` queue – If True, does not do the filtering immediately, but queues this filter. All queued filters can then be run at the same time using `run_queued_filters()` log_progress –
Returns:

filter_inward(minimum_distance=None, queue=False, **kwargs)¶

Convenience function that applies an InwardPairsFilter.

Parameters:	minimum_distance – Minimum distance inward-facing read pairs must have to pass this filter queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters kwargs – Additional arguments to pass to `get_ligation_structure_biases()`

filter_ligation_products(inward_threshold=None, outward_threshold=None, queue=False, **kwargs)¶

Convenience function that applies an OutwardPairsFilter and an InwardPairsFilter.

Parameters:

inward_threshold – Minimum distance inward-facing read pairs must have to pass this filter. If None, will be inferred from the data
outward_threshold – Minimum distance outward-facing read pairs must have to pass this filter. If None, will be inferred from the data
queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
kwargs – Additional arguments to pass to get_ligation_structure_biases()

filter_outward(minimum_distance=None, queue=False, **kwargs)¶

Convenience function that applies an OutwardPairsFilter.

Parameters:	minimum_distance – Minimum distance outward-facing read pairs must have to pass this filter queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters kwargs – Additional arguments to pass to `get_ligation_structure_biases()`

filter_pcr_duplicates(threshold=3, queue=False)¶

Convenience function that applies an PCRDuplicateFilter.

Parameters:	threshold – If distance between two alignments is smaller or equal the threshold, the alignments are considered to be starting at the same position queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters

filter_re_dist(maximum_distance, queue=False)¶

Convenience function that applies an ReDistanceFilter.

Parameters:	maximum_distance – Maximum distance a read can have to the nearest region border (=restriction site) queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters

filter_self_ligated(queue=False)¶

Convenience function that applies an SelfLigationFilter.

Parameters:	queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters

filter_statistics()¶: Get filtering statistics for this object. :return: dict with {filter_type: count, …}

find_region(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)¶

Find the region that is at the center of a region.

Parameters:	query_regions – Region selector string, :class:~GenomicRegion, or list of the former
Returns:	index (or list of indexes) of the region at the center of the query region

flush(silent=False)¶

Write buffered data to file and update indexes,

Parameters:	silent – If True, does not use progressbars.

get_edge(item, *row_conversion_args, **row_conversion_kwargs)¶

Get an edge by index.

Parameters:	row_conversion_args – Arguments passed to `RegionPairs._row_to_edge()` row_conversion_kwargs – Keyword arguments passed to `RegionPairs._row_to_edge()`
Returns:	`Edge`

get_ligation_structure_biases(sampling=None, skip_self_ligations=True, **kwargs)¶

Compute the ligation biases (inward and outward to same-strand) of this data set.

Parameters:	sampling – Approximate number of data points to average per point in the plot. If None (default), this will be determined on a best-guess basis. skip_self_ligations – If True (default), will not consider self-ligated fragments for assessing the error rates. unfiltered – If True, uses all read pairs, even those that do not pass filters, for the ligation error calculation
Returns:	tuple with (list of gap sizes between reads, list of matching le type ratios)

get_mask(key)¶

Search _mask table for key and return Mask.

Parameters:	key (int) – search by mask name key – search by mask ID
Returns:	Mask

get_masks(ix)¶

Extract mask IDs encoded in parameter and return masks.

IDs are powers of 2, so a single int field in the table can hold multiple masks by simply adding up the IDs. Similar principle to UNIX chmod (although that uses base 8)

Parameters:	ix (int) – integer that is the sum of powers of 2. Note that this value is not necessarily itself a power of 2.
Returns:	list of Masks extracted from ix
Return type:	list (Mask)

intervals(*args, **kwargs)¶: Alias for region_intervals.

mappable(region=None)¶

Get the mappability of regions in this object.

A “mappable” region has at least one contact to another region in the genome.

Returns:	`array` where True means mappable and False unmappable

classmethod merge(pairs, *args, **kwargs)¶

Merge two or more RegionPairsTable objects.

Parameters:	pairs – list of `RegionPairsTable`
Returns:	merged `RegionPairsTable`

pairs(key=None, lazy=False, *args, **kwargs)¶

Iterate over the FragmentReadPair objects.

Parameters:

key – Region string of the form <chromosome>[:<start>-<end>], GenomicRegion or tuples thereof
lazy – If True, use lazy loading of objects and their attributes. Much faster, but can lead to unexpected results if one is not careful. For example, this: list(object.pairs()) is not the same as list(object.pairs(lazy=True))! In the latter case, all objects in the list will be identical due to the lazy loading process. Only use lazy loading to access attributes in an iterator!
args – Positional arguments passed to edges_dict()
kwargs – Keyword arguments passed to edges_dict()

Returns:

pairs_by_chromosomes(chromosome1, chromosome2, **kwargs)¶: Only iterate over read pairs in this combination of chromosomes. :param chromosome1: Name of first chromosome :param chromosome2: Name of second chromosome :param kwargs: Keyword arguments passed to pairs() :return:

region_bins(*args, **kwargs)¶

Return slice of start and end indices spanned by a region.

Parameters:	args – provide a `GenomicRegion` here to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.
Returns:

region_data(key, value=None)¶

Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced

Parameters:	key – Name of the data column value – vector with region-based data (one entry per region)

region_intervals(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)¶

Return equally-sized genomic intervals and associated scores.

Use either bins or bin_size argument to control binning.

Parameters:

region – String or class:~GenomicRegion object denoting the region to be binned
bins – Number of bins to divide the region into
bin_size – Size of each bin (alternative to bins argument)
smoothing_window – Size of window (in bins) to smooth scores over
nan_replacement – NaN values in the scores will be replaced with this value
zero_to_nan – If True, will convert bins with score 0 to NaN
args – Arguments passed to _region_intervals
kwargs – Keyword arguments passed to _region_intervals

Returns:

iterator of tuples: (start, end, score)

region_subset(region, *args, **kwargs)¶

Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.

Parameters:	region – String or class:~GenomicRegion object for which covered bins will be returned.

regions¶

Iterate over genomic regions in this object.

Will return a GenomicRegion object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.

Returns:	RegionIter

regions_and_edges(key, *args, **kwargs)¶

Convenient access to regions and edges selected by key.

Parameters:	key – Edge selector, see `edges()` args – Positional arguments passed to `edges()` kwargs – Keyword arguments passed to `edges()`
Returns:	list of row regions, list of col regions, iterator over edges

regions_dict¶

Return a dictionary with region index as keys and regions as values.

Returns:	dict {region.ix: region, …}

static regions_identical(pairs)¶

Check if the regions in all objects in the list are identical.

Parameters:	pairs – `list` of `RegionBased` objects
Returns:	True if chromosome, start, and end are identical between all regions in the same list positions.

run_queued_filters(log_progress=True)¶

Run queued filters. See filter()

Parameters:	log_progress – If true, process iterating through all edges will be continuously reported.

subset(*regions, **kwargs)¶

Subset a Hic object by specifying one or more subset regions.

Parameters:	regions – string or GenomicRegion object(s) kwargs – Supports file_name: destination file name of subset Hic object; tmpdir: if True works in tmp until object is closed additional parameters are passed to `edges()`
Returns:	`Hic`

to_bed(file_name, subset=None, **kwargs)¶

Export regions as BED file

Parameters:	file_name – Path of file to write regions to subset – optional `GenomicRegion` or str to write only regions overlapping this region kwargs – Passed to `write_bed()`

to_bigwig(file_name, subset=None, **kwargs)¶

Export regions as BigWig file.

Parameters:	file_name – Path of file to write regions to subset – optional `GenomicRegion` or str to write only regions overlapping this region kwargs – Passed to `write_bigwig()`

to_gff(file_name, subset=None, **kwargs)¶

Export regions as GFF file

Parameters:	file_name – Path of file to write regions to subset – optional `GenomicRegion` or str to write only regions overlapping this region kwargs – Passed to `write_gff()`

to_hic(file_name=None, tmpdir=None, _hic_class=<class 'fanc.hic.Hic'>)¶

Convert this ReadPairs to a Hic object.

Parameters:	file_name – Path to the `Hic` output file tmpdir – If True (or path to temporary directory) will work in temporary directory until closed

class fanc.pairs.SamBamReadPairGenerator(sam_file1, sam_file2, check_sorted=True)¶

Bases: fanc.pairs.ReadPairGenerator

Generate read pairs from paired-end SAM/BAM files.

This ReadPairGenerator iterates over two SAM or BAM files that have been sorted by qname (for example with samtools sort -n).

Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.

add_filter(read_filter)¶

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:	read_filter – `ReadFilter`

filter_multi_mapping(strict=True)¶

Convenience function that registers a UniquenessFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.

Parameters:	strict – If True will filter if XS tag is present. If False, will filter only when XS tag is not 0. This is applied if alignments are from bowtie2.

filter_quality(cutoff=30)¶

Convenience function that registers a QualityFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.

Parameters:	cutoff – Minimum mapping quality (mapq) a read must have to pass the filter

filter_unmapped()¶: Convenience function that registers an UnmappedFilter.

stats()¶

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:	dict

class fanc.pairs.SelfLigationFilter(mask=None)¶

Bases: fanc.pairs.FragmentReadPairFilter

Filters read pairs where one or both reads map to the same restriction fragment.

valid(row)¶: Map validity check of rows to pairs.

valid_pair(pair)¶: Check if both reads are on the same fragment.

class fanc.pairs.TxtReadPairGenerator(valid_pairs_file, sep=None, chr1_field=1, pos1_field=2, strand1_field=3, chr2_field=4, pos2_field=5, strand2_field=6)¶

Bases: fanc.pairs.ReadPairGenerator

Generate read pairs from a plain text file.

This is an implementation of ReadPairGenerator for reading read pairs from an arbitrary text file. For specific text file formats have a look at FourDNucleomePairGenerator or HicProPairGenerator.

TxtReadPairGenerator iterates over lines in a file and splits them into fields at “sep”. It then extracts the chromosome, position and strand for each read in the pair, according to the fields specified by “chr<1|2>_field”, “pos<1|2>_field”, and “strand<1|2>_field”. If your file does not have strand fields, or if you don’t want to load them, you can simply set them to “None”.

add_filter(read_filter)¶

Add a ReadFilter to this object.

The filter will be applied during read pair generation.

Parameters:	read_filter – `ReadFilter`

stats()¶

Return filter statistics collected during read pair generation.

The __iter__() filters reads based on the filters added using add_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns a dict of the form <filter_name>: <number of reads filtered out>.

Returns:	dict

class fanc.pairs.UniquenessFilter(strict=True, mask=None)¶

Bases: fanc.pairs.ReadFilter

Filter reads that do not map uniquely to the reference sequence.

valid_read(read)¶

Check if a read has an XS tag.

If strict is enabled checks if a read has an XS tag. If not strict, XS has to be smaller than AS (alignment score) for a valid read.

class fanc.pairs.UnmappedFilter(mask=None)¶

Bases: fanc.pairs.ReadFilter

Filter reads that do not map to the reference sequence.

valid_read(read)¶: Check if the the flag bit 4 is set.

fanc.pairs.generate_pairs(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=10000000)¶

Generate Pairs object from SAM/BAM files.

This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.

Parameters:

sam1_file – Path to a sorted SAM/BAM file (1st mate)
sam2_file – Path to a sorted SAM/BAM file (2nd mate)
regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
read_filters – List of ReadFilter to filter reads while loading from SAM/BAM
output_file – Path to output file
check_sorted – Double-check that input SAM files are sorted if True (default)
threads – Number of threads used for finding restriction fragments for read pairs
batch_size – Number of read pairs sent to each restriction fragment worker

Returns:

ReadPairs

fanc.pairs.generate_pairs_split(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=1000000)¶

Generate Pairs object from SAM/BAM files.

This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.

Parameters:

sam1_file – Path to a sorted SAM/BAM file (1st mate)
sam2_file – Path to a sorted SAM/BAM file (2nd mate)
regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
read_filters – List of ReadFilter to filter reads while loading from SAM/BAM
output_file – Path to output file
check_sorted – Double-check that input SAM files are sorted if True (default)
threads – Number of threads used for finding restriction fragments for read pairs
batch_size – Number of read pairs sent to each restriction fragment worker

Returns:

ReadPairs