Pairs module¶
-
class
fanc.pairs.
BwaMemQualityFilter
(cutoff=0.9, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filters
bwa mem
generated alignments based on the alignment score (normalized by the length of the alignment).-
valid_read
(read)¶ Check if a read has a high alignment score.
-
-
class
fanc.pairs.
BwaMemUniquenessFilter
(strict=False, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filters bwa mem generated alignments based on whether they are unique or not.
-
valid_read
(read)¶ Determine if a read is valid or should be filtered.
When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.
Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.
Parameters: read – A Read
objectReturns: True if Read is valid, False otherwise
-
-
class
fanc.pairs.
ContaminantFilter
(contaminant_reads, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter reads that also map to a contaminant genome.
-
valid_read
(read)¶ Check if a read also maps to a contaminant
-
-
class
fanc.pairs.
FourDNucleomePairGenerator
(pairs_file)¶ Bases:
fanc.pairs.TxtReadPairGenerator
Read pair generator that works on 4D Nucleome “.pairs” files.
For details on the 4D Nucleome pairs format see: https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md
-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.
FragmentRead
(fragment=None, position=None, strand=0)¶ Bases:
object
Class representing a fragment-mapped read.
-
fragment
¶ A
GenomicRegion
delineated by restriction sites.
-
position
¶ The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.
-
strand
¶ The strand this read maps to (-1 or +1).
-
re_distance
()¶ Get the distance of the alignment to the nearest restriction site. :return: int
-
-
class
fanc.pairs.
FragmentReadPair
(left_read, right_read, ix=None)¶ Bases:
object
Container for two paired
FragmentRead
objects.-
get_gap_size
()¶ Get the gap size in base pairs between the fragments these reads map to.
Returns: 0 if reads map to the same fragment or neighboring fragments, the distance between fragments if they are on the same chromosome, None otherwise
-
is_inward_pair
()¶ Check if reads form an “inward-facing” pair.
A pair is considered inward-facing if the left read maps to the plus the right read to the minus strand and both reads map to the same chromosome.
Returns: True if reads are inward-facing, False otherwise
-
is_outward_pair
()¶ Check if reads form an “outward-facing” pair.
A pair is considered outward-facing if the left read maps to the minus the right read to the plus strand and both reads map to the same chromosome.
Returns: True if reads are outward-facing, False otherwise
-
is_same_chromosome
()¶ Check if both reads are mapped to the same chromosome.
Returns: True is reads map to the same chromosome, False otherwise
-
is_same_fragment
()¶ Check if reads map to the same fragment.
Returns: True if reads map to the same fragment, False otherwise.
-
is_same_pair
()¶ Check if reads face in the same direction.
Returns: True if reads are facing in the same direction, False otherwise.
-
-
class
fanc.pairs.
FragmentReadPairFilter
(mask=None)¶ Bases:
fanc.general.MaskFilter
Abstract class that provides filtering functionality for the
FragmentReadPair
object.Extends MaskFilter and overrides valid(self, read) to make
FragmentReadPair
filtering more “natural”.To create custom filters for the
FragmentMappedReadPairs
object, extend this class and override thevalid_pair()
method. valid_pair should return False for a specificFragmentReadPair
object if the object is supposed to be filtered/masked and True otherwise. SeeInwardPairsFilter
for an example.Pass a custom filter to the filter method in
FragmentMappedReadPairs
to apply it.-
valid
(row)¶ Map validity check of rows to pairs.
-
-
class
fanc.pairs.
HicProPairGenerator
(file_name)¶ Bases:
fanc.pairs.TxtReadPairGenerator
Read pair generator for HiC-Pro “validPairs” files.
This generator is a subclass of
TxtReadPairGenerator
with presets for fields in HiC-Pro validPairs files.-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.
InwardPairsFilter
(minimum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filter inward-facing read pairs at a distance less than a specified cutoff.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check if a pair is inward-facing and <minimum_distance> apart.
-
-
class
fanc.pairs.
LazyFragment
(row, pairs, ix=None, side='left')¶ Bases:
genomic_regions.regions.GenomicRegion
GenomicRegion
representing a fragment with lazy attribute loading.-
chromosome
¶ The reference sequence this region is located on
-
start
¶ start position of this region on the reference (1-based, inclusive)
-
end
¶ end position of this region on the reference (1-based, inclusive)
-
strand
¶ strand of the reference this region is located on (-1, 1, 0, or None)
-
ix
¶ Region index within the context of regions from the same object.
-
as_bed_line
(score_field='score', name_field='name')¶ Return a representation of this object as line in a BED file.
Parameters: - score_field – name of the attribute to be used in the ‘score’ field of the BED line
- name_field – name of the attribute to be used in the ‘name’ field of the BED line
Returns: str
-
as_gff_line
(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')¶ Return a representation of this object as line in a GFF file.
Parameters: - source_field – name of the attribute to be used in the ‘source’ field of the GFF line
- feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
- score_field – name of the attribute to be used in the ‘score’ field of the GFF line
- frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
- float_format – Formatting string for the float fields
Returns: str
-
attributes
¶ Return all visible attributes of this
GenomicRegion
.Returns all attribute names that do not start with an underscore. :return: list of attribute names
-
center
¶ Return the center coordinate of the
GenomicRegion
.Returns: float
-
contains
(region)¶ Check if the specified region is completely contained in this region.
Parameters: region – GenomicRegion
object or string
-
copy
()¶ Return a (shallow) copy of this
GenomicRegion
Returns: GenomicRegion
-
expand
(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)¶ Expand this region by a relative or an absolute amount.
Parameters: - absolute – Absolute amount in base pairs by which to
expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute> - relative – Relative amount as fraction of region by which to
expand the region represented by this
GenomicRegion
object on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)> - absolute_left – Absolute amount in base pairs by which to
expand the region represented by this
GenomicRegion
object on the left side - absolute_right – Absolute amount in base pairs by which to
expand the region represented by this
GenomicRegion
object on the right side - relative_left – Relative amount in base pairs by which to
expand the region represented by this
GenomicRegion
object on the left side - relative_right – Relative amount in base pairs by which to
expand the region represented by this
GenomicRegion
object on the right side - copy – If True, return a copy of the original region, if False will modify the existing region in place
- from_center – If True measures distance from center rather than start and end of the old region
Returns: GenomicRegion
- absolute – Absolute amount in base pairs by which to
expand the region represented by this
-
five_prime
¶ Return the position of the 5’ end of this
GenomicRegion
on the reference.Returns: int
-
fix_chromosome
(copy=False)¶ Change chromosome representation from chr<NN> to <NN> or vice versa.
Parameters: copy – If True, make copy of region, otherwise will modify existing region in place. Returns: GenomicRegion
-
classmethod
from_string
(region_string)¶ Convert a string into a
GenomicRegion
.This is a very useful convenience function to quickly define a
GenomicRegion
object from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.Parameters: region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen. Returns: GenomicRegion
-
is_forward
()¶ Return True if this region is on the forward strand of the reference genome.
Returns: True if on ‘+’ strand, False otherwise.
-
is_reverse
()¶ Return True if this region is on the reverse strand of the reference genome.
Returns: True if on ‘-’ strand, False otherwise.
-
overlap
(region)¶ Return the overlap in base pairs between this region and another region.
Parameters: region – GenomicRegion
to find overlap forReturns: overlap as int in base pairs
-
overlaps
(region)¶ Check if this region overlaps with the specified region.
Parameters: region – GenomicRegion
object or string
-
set_attribute
(attribute, value)¶ Safely set an attribute on the
GenomicRegion
object.This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.
Parameters: - attribute – Name of the attribute to be set
- value – Value of the attribute to be set
-
strand_string
¶ Return the ‘strand’ attribute as string.
Returns: strand as str (‘+’, ‘-’, or ‘.’)
-
three_prime
¶ Return the position of the 3’ end of this
GenomicRegion
on the reference.Returns: int
-
to_string
()¶ Convert this
GenomicRegion
to its string representation.Returns: str
-
-
class
fanc.pairs.
LazyFragmentRead
(row, pairs, side='left')¶ Bases:
fanc.pairs.FragmentRead
FragmentRead
implementation with lazy attribute loading.-
fragment
¶ A
GenomicRegion
delineated by restriction sites.
-
position
¶ The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.
-
strand
¶ The strand this read maps to (-1 or +1).
-
re_distance
()¶ Get the distance of the alignment to the nearest restriction site. :return: int
-
-
class
fanc.pairs.
MinimalRead
(chromosome, position, strand)¶ Bases:
object
Minimal class representing an aligned read.
-
chromosome
¶ Chromosome name
-
reference_name
¶ Identical to chromosome, exists for compatibility with pysam
-
position
¶ Position of aligned read on chromosome
-
pos
¶ Identical to position, exists for compatibility with Pysam
-
strand
¶ Strand of the alignment
-
flag
¶ Flag representing the strandedness of a read. 0 if on forward strand, else -1
-
-
class
fanc.pairs.
Monitor
(value=0, manager=None)¶ Bases:
fanc.tools.general.WorkerMonitor
Class to monitor fragment info worker threads.
-
is_generating_pairs
()¶ Check the pair generating status.
-
set_generating_pairs
(value)¶ Set the pair generating status.
-
-
class
fanc.pairs.
OutwardPairsFilter
(minimum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filter outward-facing read pairs at a distance less than a specified cutoff.
-
valid
(row)¶ Map validity check of rows to pairs.
-
-
class
fanc.pairs.
PCRDuplicateFilter
(pairs, threshold=2, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Masks alignments that are suspected to be PCR duplicates. In order to be considered duplicates, two pairs need to have identical start positions of their respective left alignments AND of their right alignments.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check if a pair is duplicated.
-
-
class
fanc.pairs.
PairedSamBamReadPairGenerator
(sam_file)¶ Bases:
fanc.pairs.ReadPairGenerator
Generate read pairs from single SAM/BAM files from reads with identical qname.
Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.
-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
static
resolve_chimeric
(reads, max_dist_same_locus=500)¶ Returns: read1, read2, is_chimeric
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.
QualityFilter
(cutoff=30, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter mapped reads based on mapping quality.
-
valid_read
(read)¶ Check if a read has a mapq >= cutoff.
-
-
class
fanc.pairs.
ReDistanceFilter
(maximum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filters read pairs where the sum of distances of each read to its nearest restriction site is larger than a certain distance.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check sum of restriction site distances < maximum_distance.
-
-
class
fanc.pairs.
ReadFilter
(mask=None)¶ Bases:
object
Abstract class that provides filtering functionality for
MinimalRead
,AlignedSegment
or compatible.To create a custom
ReadFilter
, extend this class and override the valid_read(self, read) method. valid_read should return False for a specific read object if the object is supposed to be filtered/masked and True otherwise. SeeQualityFilter
for an example.Pass a custom filter to the filter method in
ReadPairGenerator
to apply it.-
valid_read
(read)¶ Determine if a read is valid or should be filtered.
When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.
Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.
Parameters: read – A Read
objectReturns: True if Read is valid, False otherwise
-
-
class
fanc.pairs.
ReadPairGenerator
¶ Bases:
object
Base class for generating and filtering read pairs.
This class primarily provides filtering capabilities for read pairs generated by subclasses of
ReadPairGenerator
. You can add aReadFilter
usingadd_filter()
, which will be used during read pair generation. Filtering statistics are collected during the run, and can be obtained viastats()
.These generators are primarily meant as input for
add_read_pairs()
, but can also be used independently.Subclasses of
ReadPairGenerator
must implement the_iter_read_pairs()
function.-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.
ReadPairs
(file_name=None, mode='a', partition_strategy='auto', _group_name='fragment_map', _table_name_fragments='fragments', _table_name_pairs='pairs', tmpdir=None)¶ Bases:
fanc.matrix.RegionPairsTable
Class representing a collection of read pairs mapped to restriction fragments.
This class is a
RegionBased
object, where eachGenomicRegion
represents a restriction fragment from a Hi-C experiment. A list of fragments can be obtained with thegenome_regions()
function, for example.To create a
ReadPairs
object, you first have to add the restriction fragments before adding read pairs:import fanc re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII") rp = fanc.ReadPairs() rp.add_regions(re_fragments.regions)
Read pairs can easily be generate fro different types of input using
ReadPairGenerator
implementations, e.g.HicProReadPairGenerator
orSamBamReadPairGenerator
.rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam", "output/sam/SRR4271982_chr18_19_2_sort.bam") rp.add_read_pairs(rp_generator, threads=4)
You can query regions using the
RegionBased
interface:chr1_fragments = rp.regions("chr1")
and you can iterate over read pairs using the
pairs()
:you can also use the :cls:Region
-
class
ChromosomeDescription
¶ Bases:
tables.description.IsDescription
Description of the chromosomes in this object.
-
class
MaskDescription
¶ Bases:
tables.description.IsDescription
-
class
RegionDescription
¶ Bases:
tables.description.IsDescription
Description of a genomic region for PyTables Table
-
add_contact
(contact, *args, **kwargs)¶ Alias for
add_edge()
Parameters: - contact –
Edge
- args – Positional arguments passed to
_add_edge()
- kwargs – Keyword arguments passed to
_add_edge()
- contact –
-
add_contacts
(contacts, *args, **kwargs)¶ Alias for
add_edges()
-
add_edge
(edge, check_nodes_exist=True, *args, **kwargs)¶ Add an edge / contact between two regions to this object.
Parameters: - edge –
Edge
, dict with at least the attributes source and sink, optionally weight, or a list of length 2 (source, sink) or 3 (source, sink, weight). - check_nodes_exist – Make sure that there are nodes that match source and sink indexes
- args – Positional arguments passed to
_add_edge()
- kwargs – Keyword arguments passed to
_add_edge()
- edge –
-
add_edge_from_dict
(edge, *args, **kwargs)¶ Direct method to add an edge from dict input.
Parameters: edge – dict with at least the keys “source” and “sink”. Additional keys will be loaded as edge attributes
-
add_edge_from_edge
(edge, *args, **kwargs)¶ Direct method to add an edge from
Edge
input.Parameters: edge – Edge
-
add_edge_from_list
(edge, *args, **kwargs)¶ Direct method to add an edge from list or tuple input.
Parameters: edge – List or tuple. Should be of length 2 (source, sink) or 3 (source, sink, weight)
-
add_edge_simple
(source, sink, weight=None, *args, **kwargs)¶ Direct method to add an edge from
Edge
input.Parameters: - source – Source region index
- sink – Sink region index
- weight – Weight of the edge
-
add_edges
(edges, flush=True, *args, **kwargs)¶ Bulk-add edges from a list.
List items can be any of the supported edge types, list, tuple, dict, or
Edge
. Repeatedly callsadd_edge()
, so may be inefficient for large amounts of data.Parameters: edges – List (or iterator) of edges. See add_edge()
for details
-
add_mask_description
(name, description)¶ Add a mask description to the _mask table and return its ID.
Parameters: - name (str) – name of the mask
- description (str) – description of the mask
Returns: id of the mask
Return type: int
-
add_read_pairs
(read_pairs, batch_size=1000000, threads=1)¶ Add read pairs to this object.
This function requires tuples of read pairs as input, for example from
MinimalRead
orAlignedSegment
. Typically, you won’t have to construct these from scratch, but can use one of theReadPairGenerator
classes to generate read pairs from input files.import fanc re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII") rp = fanc.ReadPairs() rp.add_regions(re_fragments.regions) # read pairs are added here from BAM files rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam", "output/sam/SRR4271982_chr18_19_2_sort.bam") rp.add_read_pairs(rp_generator, threads=4)
Parameters: - read_pairs – iterator over tuples of read pairs. Typically
instances of
ReadPairGenerator
- batch_size – Batch size of read pairs sent to fragment info workers
- threads – Number of threads for simultaneous fragment info finding
- read_pairs – iterator over tuples of read pairs. Typically
instances of
-
add_region
(region, *args, **kwargs)¶ Add a genomic region to this object.
This method offers some flexibility in the types of objects that can be loaded. See parameters for details.
Parameters: region – Can be a GenomicRegion
, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).
-
add_regions
(regions, *args, **kwargs)¶ Bulk insert multiple genomic regions.
Parameters: regions – List (or any iterator) with objects that describe a genomic region. See add_region
for options.
-
static
bin_intervals
(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into a fixed number of bins.
Parameters: - intervals – iterator of tuples (start, end, score)
- bins – Number of bins to divide the region into
- interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
Returns: iterator of tuples: (start, end, score)
-
static
bin_intervals_equidistant
(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into bins with a fixed size.
Parameters: - intervals – iterator of tuples (start, end, score)
- bin_size – Size of each bin in base pairs
- interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
Returns: iterator of tuples: (start, end, score)
-
bin_size
¶ Return the length of the first region in the dataset.
Assumes all bins have equal size.
Returns: int
-
binned_regions
(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶ Same as region_intervals, but returns
GenomicRegion
objects instead of tuples.Parameters: - region – String or class:~GenomicRegion object denoting the region to be binned
- bins – Number of bins to divide the region into
- bin_size – Size of each bin (alternative to bins argument)
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
- args – Arguments passed to _region_intervals
- kwargs – Keyword arguments passed to _region_intervals
Returns: iterator of
GenomicRegion
objects
-
bins_to_distance
(bins)¶ Convert fraction of bins to base pairs
Parameters: bins – float, fraction of bins Returns: int, base pairs
-
chromosome_bins
¶ Returns a dictionary of chromosomes and the start and end index of the bins they cover.
Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.
-
chromosome_lengths
¶ Returns a dictionary of chromosomes and their length in bp.
-
chromosomes
()¶ List all chromosomes in this regions table. :return: list of chromosome names.
-
close
(copy_tmp=True, remove_tmp=True)¶ Close this HDF5 file and run exit operations.
If file was opened with tmpdir in read-only mode: close file and delete temporary copy.
If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.
Parameters: - copy_tmp – If False, does not overwrite original with modified file.
- remove_tmp – If False, does not delete temporary copy of file.
-
distance_to_bins
(distance)¶ Convert base pairs to fraction of bins.
Parameters: distance – distance in base pairs Returns: float, distance as fraction of bin size
-
downsample
(n, file_name=None)¶ Sample edges from this object.
Sampling is always done on uncorrected Hi-C matrices.
Parameters: - n – Sample size or reference object. If n < 1 will be interpreted as a fraction of total reads in this object.
- file_name – Output file name for down-sampled object.
Returns: RegionPairsTable
-
edge_data
(attribute, *args, **kwargs)¶ Iterate over specific edge attribute.
Parameters: - attribute – Name of the attribute, e.g. “weight”
- args – Positional arguments passed to
edges()
- kwargs – Keyword arguments passed to
edges()
Returns: iterator over edge attribute
-
edge_subset
(key=None, *args, **kwargs)¶ Get a subset of edges.
This is an alias for
edges()
.Returns: generator ( Edge
)
-
edges
¶ Iterate over contacts / edges.
edges()
is the central function ofRegionPairsContainer
. Here, we will use theHic
implementation for demonstration purposes, but the usage is exactly the same for all compatible objects implementingRegionPairsContainer
, includingJuicerHic
andCoolerHic
.import fanc # file from FAN-C examples hic = fanc.load("output/hic/binned/fanc_example_1mb.hic")
We can easily find the number of edges in the sample
Hic
object:len(hic.edges) # 8695
When used in an iterator context,
edges()
iterates over all edges in theRegionPairsContainer
:for edge in hic.edges: # do something with edge print(edge) # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173 # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057 # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876 # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468 # ...
Calling
edges()
as a method has the same effect:# note the '()' for edge in hic.edges(): # do something with edge print(edge) # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173 # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057 # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876 # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468 # ...
Rather than iterate over all edges in the object, we can select only a subset. If the key is a string or a
GenomicRegion
, all non-zero edges connecting the region described by the key to any other region are returned. If the key is a tuple of strings orGenomicRegion
, only edges between the two regions are returned.# select all edges between chromosome 19 # and any other region: for edge in hic.edges("chr19"): print(edge) # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033 # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978 # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917 # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048 # ... # select all edges that are only on # chromosome 19 for edge in hic.edges(('chr19', 'chr19')): print(edge) # 90--116; bias: 0.00021173151730025176; sink_node: chr19:37000001-38000000; source_node: chr19:11000001-12000000; weight: 0.009104455243910825 # 135--135; bias: 0.00018003890596887822; sink_node: chr19:56000001-57000000; source_node: chr19:56000001-57000000; weight: 0.10028167062466517 # 123--123; bias: 0.00011063368998965993; sink_node: chr19:44000001-45000000; source_node: chr19:44000001-45000000; weight: 0.1386240135570439 # 92--93; bias: 0.00040851066434864896; sink_node: chr19:14000001-15000000; source_node: chr19:13000001-14000000; weight: 0.10090213409411629 # ... # select inter-chromosomal edges # between chromosomes 18 and 19 for edge in hic.edges(('chr18', 'chr19')): print(edge) # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033 # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978 # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917 # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048 # ...
By default,
edges()
will retrieve all edge attributes, which can be slow when iterating over a lot of edges. This is why all file-based FAN-CRegionPairsContainer
objects support lazy loading, where attributes are only read on demand.for edge in hic.edges('chr18', lazy=True): print(edge.source, edge.sink, edge.weight, edge) # 42 42 0.12291311562018173 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #0> # 24 28 0.025205961072838057 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #1> # 5 76 0.00961709840049876 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #2> # 66 68 0.03876763316345468 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #3> # ...
Warning
The lazy iterator reuses the
LazyEdge
object in every iteration, and overwrites theLazyEdge
attributes. Therefore do not use lazy iterators if you need to store edge objects for later access. For example, the following code works as expectedlist(hic.edges())
, with allEdge
objects stored in the list, while this codelist(hic.edges(lazy=True))
will result in a list of identicalLazyEdge
objects. Always ensure you do all edge processing in the loop when working with lazy iterators!When working with normalised contact frequencies, such as obtained through matrix balancing in the example above,
edges()
automatically returns normalised edge weights. In addition, thebias
attribute will (typically) have a value different from 1.When you are interested in the raw contact frequency, use the
norm=False
parameter:for edge in hic.edges('chr18', lazy=True, norm=False): print(edge.source, edge.sink, edge.weight) # 42 42 2120.0 # 24 28 388.0 # 5 76 94.0 # 66 68 470.0 # ...
You can also choose to omit all intra- or inter-chromosomal edges using
intra_chromosomal=False
orinter_chromosomal=False
, respectively.Returns: Iterator over Edge
or equivalent.
-
edges_dict
(*args, **kwargs)¶ Edges iterator with access by bracket notation.
This iterator always returns unnormalised edges.
Returns: dict or dict-like iterator
-
filter
(pair_filter, queue=False, log_progress=True)¶ Apply a
FragmentReadPairFilter
to the read pairs in this object.Parameters: - pair_filter –
FragmentReadPairFilter
- queue – If True, does not do the filtering immediately, but
queues this filter. All queued filters can then be run
at the same time using
run_queued_filters()
- log_progress –
Returns: - pair_filter –
-
filter_inward
(minimum_distance=None, queue=False, **kwargs)¶ Convenience function that applies an
InwardPairsFilter
.Parameters: - minimum_distance – Minimum distance inward-facing read pairs must have to pass this filter
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
- kwargs – Additional arguments to pass
to
get_ligation_structure_biases()
-
filter_ligation_products
(inward_threshold=None, outward_threshold=None, queue=False, **kwargs)¶ Convenience function that applies an
OutwardPairsFilter
and anInwardPairsFilter
.Parameters: - inward_threshold – Minimum distance inward-facing read pairs must have to pass this filter. If None, will be inferred from the data
- outward_threshold – Minimum distance outward-facing read pairs must have to pass this filter. If None, will be inferred from the data
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
- kwargs – Additional arguments to pass
to
get_ligation_structure_biases()
-
filter_outward
(minimum_distance=None, queue=False, **kwargs)¶ Convenience function that applies an
OutwardPairsFilter
.Parameters: - minimum_distance – Minimum distance outward-facing read pairs must have to pass this filter
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
- kwargs – Additional arguments to pass
to
get_ligation_structure_biases()
-
filter_pcr_duplicates
(threshold=3, queue=False)¶ Convenience function that applies an
PCRDuplicateFilter
.Parameters: - threshold – If distance between two alignments is smaller or equal the threshold, the alignments are considered to be starting at the same position
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_re_dist
(maximum_distance, queue=False)¶ Convenience function that applies an
ReDistanceFilter
.Parameters: - maximum_distance – Maximum distance a read can have to the nearest region border (=restriction site)
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_self_ligated
(queue=False)¶ Convenience function that applies an
SelfLigationFilter
.Parameters: queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_statistics
()¶ Get filtering statistics for this object. :return: dict with {filter_type: count, …}
-
find_region
(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)¶ Find the region that is at the center of a region.
Parameters: query_regions – Region selector string, :class:~GenomicRegion, or list of the former Returns: index (or list of indexes) of the region at the center of the query region
-
flush
(silent=False)¶ Write buffered data to file and update indexes,
Parameters: silent – If True, does not use progressbars.
-
get_edge
(item, *row_conversion_args, **row_conversion_kwargs)¶ Get an edge by index.
Parameters: - row_conversion_args – Arguments passed to
RegionPairs._row_to_edge()
- row_conversion_kwargs – Keyword arguments passed to
RegionPairs._row_to_edge()
Returns: Edge
- row_conversion_args – Arguments passed to
-
get_ligation_structure_biases
(sampling=None, skip_self_ligations=True, **kwargs)¶ Compute the ligation biases (inward and outward to same-strand) of this data set.
Parameters: - sampling – Approximate number of data points to average per point in the plot. If None (default), this will be determined on a best-guess basis.
- skip_self_ligations – If True (default), will not consider self-ligated fragments for assessing the error rates.
- unfiltered – If True, uses all read pairs, even those that do not pass filters, for the ligation error calculation
Returns: tuple with (list of gap sizes between reads, list of matching le type ratios)
-
get_mask
(key)¶ Search _mask table for key and return Mask.
Parameters: - key (int) – search by mask name
- key – search by mask ID
Returns: Mask
-
get_masks
(ix)¶ Extract mask IDs encoded in parameter and return masks.
IDs are powers of 2, so a single int field in the table can hold multiple masks by simply adding up the IDs. Similar principle to UNIX chmod (although that uses base 8)
Parameters: ix (int) – integer that is the sum of powers of 2. Note that this value is not necessarily itself a power of 2. Returns: list of Masks extracted from ix Return type: list (Mask)
-
intervals
(*args, **kwargs)¶ Alias for region_intervals.
-
mappable
(region=None)¶ Get the mappability of regions in this object.
A “mappable” region has at least one contact to another region in the genome.
Returns: array
where True means mappable and False unmappable
-
classmethod
merge
(pairs, *args, **kwargs)¶ Merge two or more
RegionPairsTable
objects.Parameters: pairs – list of RegionPairsTable
Returns: merged RegionPairsTable
-
pairs
(key=None, lazy=False, *args, **kwargs)¶ Iterate over the
FragmentReadPair
objects.Parameters: - key – Region string of the form <chromosome>[:<start>-<end>],
GenomicRegion
or tuples thereof - lazy – If True, use lazy loading of objects and their attributes.
Much faster, but can lead to unexpected results if one is
not careful. For example, this:
list(object.pairs())
is not the same aslist(object.pairs(lazy=True))
! In the latter case, all objects in the list will be identical due to the lazy loading process. Only use lazy loading to access attributes in an iterator! - args – Positional arguments passed to
edges_dict()
- kwargs – Keyword arguments passed to
edges_dict()
Returns: - key – Region string of the form <chromosome>[:<start>-<end>],
-
pairs_by_chromosomes
(chromosome1, chromosome2, **kwargs)¶ Only iterate over read pairs in this combination of chromosomes. :param chromosome1: Name of first chromosome :param chromosome2: Name of second chromosome :param kwargs: Keyword arguments passed to
pairs()
:return:
-
region_bins
(*args, **kwargs)¶ Return slice of start and end indices spanned by a region.
Parameters: args – provide a GenomicRegion
here to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.Returns:
-
region_data
(key, value=None)¶ Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced
Parameters: - key – Name of the data column
- value – vector with region-based data (one entry per region)
-
region_intervals
(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)¶ Return equally-sized genomic intervals and associated scores.
Use either bins or bin_size argument to control binning.
Parameters: - region – String or class:~GenomicRegion object denoting the region to be binned
- bins – Number of bins to divide the region into
- bin_size – Size of each bin (alternative to bins argument)
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
- args – Arguments passed to _region_intervals
- kwargs – Keyword arguments passed to _region_intervals
Returns: iterator of tuples: (start, end, score)
-
region_subset
(region, *args, **kwargs)¶ Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.
Parameters: region – String or class:~GenomicRegion object for which covered bins will be returned.
-
regions
¶ Iterate over genomic regions in this object.
Will return a
GenomicRegion
object in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.Returns: RegionIter
-
regions_and_edges
(key, *args, **kwargs)¶ Convenient access to regions and edges selected by key.
Parameters: - key – Edge selector, see
edges()
- args – Positional arguments passed to
edges()
- kwargs – Keyword arguments passed to
edges()
Returns: list of row regions, list of col regions, iterator over edges
- key – Edge selector, see
-
regions_dict
¶ Return a dictionary with region index as keys and regions as values.
Returns: dict {region.ix: region, …}
-
static
regions_identical
(pairs)¶ Check if the regions in all objects in the list are identical.
Parameters: pairs – list
ofRegionBased
objectsReturns: True if chromosome, start, and end are identical between all regions in the same list positions.
-
run_queued_filters
(log_progress=True)¶ Run queued filters. See
filter()
Parameters: log_progress – If true, process iterating through all edges will be continuously reported.
-
subset
(*regions, **kwargs)¶ Subset a Hic object by specifying one or more subset regions.
Parameters: - regions – string or GenomicRegion object(s)
- kwargs – Supports
file_name: destination file name of subset Hic object;
tmpdir: if True works in tmp until object is closed
additional parameters are passed to
edges()
Returns: Hic
-
to_bed
(file_name, subset=None, **kwargs)¶ Export regions as BED file
Parameters: - file_name – Path of file to write regions to
- subset – optional
GenomicRegion
or str to write only regions overlapping this region - kwargs – Passed to
write_bed()
-
to_bigwig
(file_name, subset=None, **kwargs)¶ Export regions as BigWig file.
Parameters: - file_name – Path of file to write regions to
- subset – optional
GenomicRegion
or str to write only regions overlapping this region - kwargs – Passed to
write_bigwig()
-
to_gff
(file_name, subset=None, **kwargs)¶ Export regions as GFF file
Parameters: - file_name – Path of file to write regions to
- subset – optional
GenomicRegion
or str to write only regions overlapping this region - kwargs – Passed to
write_gff()
-
class
-
class
fanc.pairs.
SamBamReadPairGenerator
(sam_file1, sam_file2, check_sorted=True)¶ Bases:
fanc.pairs.ReadPairGenerator
Generate read pairs from paired-end SAM/BAM files.
This
ReadPairGenerator
iterates over two SAM or BAM files that have been sorted by qname (for example withsamtools sort -n
).Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.
-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
filter_multi_mapping
(strict=True)¶ Convenience function that registers a UniquenessFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.
Parameters: strict – If True will filter if XS tag is present. If False, will filter only when XS tag is not 0. This is applied if alignments are from bowtie2.
-
filter_quality
(cutoff=30)¶ Convenience function that registers a QualityFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.
Parameters: cutoff – Minimum mapping quality (mapq) a read must have to pass the filter
-
filter_unmapped
()¶ Convenience function that registers an UnmappedFilter.
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.
SelfLigationFilter
(mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilter
Filters read pairs where one or both reads map to the same restriction fragment.
-
valid
(row)¶ Map validity check of rows to pairs.
-
valid_pair
(pair)¶ Check if both reads are on the same fragment.
-
-
class
fanc.pairs.
TxtReadPairGenerator
(valid_pairs_file, sep=None, chr1_field=1, pos1_field=2, strand1_field=3, chr2_field=4, pos2_field=5, strand2_field=6)¶ Bases:
fanc.pairs.ReadPairGenerator
Generate read pairs from a plain text file.
This is an implementation of
ReadPairGenerator
for reading read pairs from an arbitrary text file. For specific text file formats have a look atFourDNucleomePairGenerator
orHicProPairGenerator
.TxtReadPairGenerator
iterates over lines in a file and splits them into fields at “sep”. It then extracts the chromosome, position and strand for each read in the pair, according to the fields specified by “chr<1|2>_field”, “pos<1|2>_field”, and “strand<1|2>_field”. If your file does not have strand fields, or if you don’t want to load them, you can simply set them to “None”.-
add_filter
(read_filter)¶ Add a
ReadFilter
to this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats
()¶ Return filter statistics collected during read pair generation.
The
__iter__()
filters reads based on the filters added usingadd_filter()
. During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adict
of the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.
UniquenessFilter
(strict=True, mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter reads that do not map uniquely to the reference sequence.
-
valid_read
(read)¶ Check if a read has an XS tag.
If strict is enabled checks if a read has an XS tag. If not strict, XS has to be smaller than AS (alignment score) for a valid read.
-
-
class
fanc.pairs.
UnmappedFilter
(mask=None)¶ Bases:
fanc.pairs.ReadFilter
Filter reads that do not map to the reference sequence.
-
valid_read
(read)¶ Check if the the flag bit 4 is set.
-
-
fanc.pairs.
generate_pairs
(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=10000000)¶ Generate Pairs object from SAM/BAM files.
This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.
Parameters: - sam1_file – Path to a sorted SAM/BAM file (1st mate)
- sam2_file – Path to a sorted SAM/BAM file (2nd mate)
- regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
- restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
- read_filters – List of
ReadFilter
to filter reads while loading from SAM/BAM - output_file – Path to output file
- check_sorted – Double-check that input SAM files are sorted if True (default)
- threads – Number of threads used for finding restriction fragments for read pairs
- batch_size – Number of read pairs sent to each restriction fragment worker
Returns:
-
fanc.pairs.
generate_pairs_split
(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=1000000)¶ Generate Pairs object from SAM/BAM files.
This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.
Parameters: - sam1_file – Path to a sorted SAM/BAM file (1st mate)
- sam2_file – Path to a sorted SAM/BAM file (2nd mate)
- regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
- restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
- read_filters – List of
ReadFilter
to filter reads while loading from SAM/BAM - output_file – Path to output file
- check_sorted – Double-check that input SAM files are sorted if True (default)
- threads – Number of threads used for finding restriction fragments for read pairs
- batch_size – Number of read pairs sent to each restriction fragment worker
Returns: