Pairs module¶
-
class
fanc.pairs.BwaMemQualityFilter(cutoff=0.9, mask=None)¶ Bases:
fanc.pairs.ReadFilterFilters
bwa memgenerated alignments based on the alignment score (normalized by the length of the alignment).-
valid_read(read)¶ Check if a read has a high alignment score.
-
-
class
fanc.pairs.BwaMemUniquenessFilter(strict=False, mask=None)¶ Bases:
fanc.pairs.ReadFilterFilters bwa mem generated alignments based on whether they are unique or not.
-
valid_read(read)¶ Determine if a read is valid or should be filtered.
When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.
Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.
Parameters: read – A ReadobjectReturns: True if Read is valid, False otherwise
-
-
class
fanc.pairs.ContaminantFilter(contaminant_reads, mask=None)¶ Bases:
fanc.pairs.ReadFilterFilter reads that also map to a contaminant genome.
-
valid_read(read)¶ Check if a read also maps to a contaminant
-
-
class
fanc.pairs.FourDNucleomePairGenerator(pairs_file)¶ Bases:
fanc.pairs.TxtReadPairGeneratorRead pair generator that works on 4D Nucleome “.pairs” files.
For details on the 4D Nucleome pairs format see: https://github.com/4dn-dcic/pairix/blob/master/pairs_format_specification.md
-
add_filter(read_filter)¶ Add a
ReadFilterto this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats()¶ Return filter statistics collected during read pair generation.
The
__iter__()filters reads based on the filters added usingadd_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adictof the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.FragmentRead(fragment=None, position=None, strand=0)¶ Bases:
objectClass representing a fragment-mapped read.
-
fragment¶ A
GenomicRegiondelineated by restriction sites.
-
position¶ The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.
-
strand¶ The strand this read maps to (-1 or +1).
-
re_distance()¶ Get the distance of the alignment to the nearest restriction site. :return: int
-
-
class
fanc.pairs.FragmentReadPair(left_read, right_read, ix=None)¶ Bases:
objectContainer for two paired
FragmentReadobjects.-
get_gap_size()¶ Get the gap size in base pairs between the fragments these reads map to.
Returns: 0 if reads map to the same fragment or neighboring fragments, the distance between fragments if they are on the same chromosome, None otherwise
-
is_inward_pair()¶ Check if reads form an “inward-facing” pair.
A pair is considered inward-facing if the left read maps to the plus the right read to the minus strand and both reads map to the same chromosome.
Returns: True if reads are inward-facing, False otherwise
-
is_outward_pair()¶ Check if reads form an “outward-facing” pair.
A pair is considered outward-facing if the left read maps to the minus the right read to the plus strand and both reads map to the same chromosome.
Returns: True if reads are outward-facing, False otherwise
-
is_same_chromosome()¶ Check if both reads are mapped to the same chromosome.
Returns: True is reads map to the same chromosome, False otherwise
-
is_same_fragment()¶ Check if reads map to the same fragment.
Returns: True if reads map to the same fragment, False otherwise.
-
is_same_pair()¶ Check if reads face in the same direction.
Returns: True if reads are facing in the same direction, False otherwise.
-
-
class
fanc.pairs.FragmentReadPairFilter(mask=None)¶ Bases:
fanc.general.MaskFilterAbstract class that provides filtering functionality for the
FragmentReadPairobject.Extends MaskFilter and overrides valid(self, read) to make
FragmentReadPairfiltering more “natural”.To create custom filters for the
FragmentMappedReadPairsobject, extend this class and override thevalid_pair()method. valid_pair should return False for a specificFragmentReadPairobject if the object is supposed to be filtered/masked and True otherwise. SeeInwardPairsFilterfor an example.Pass a custom filter to the filter method in
FragmentMappedReadPairsto apply it.-
valid(row)¶ Map validity check of rows to pairs.
-
-
class
fanc.pairs.HicProPairGenerator(file_name)¶ Bases:
fanc.pairs.TxtReadPairGeneratorRead pair generator for HiC-Pro “validPairs” files.
This generator is a subclass of
TxtReadPairGeneratorwith presets for fields in HiC-Pro validPairs files.-
add_filter(read_filter)¶ Add a
ReadFilterto this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats()¶ Return filter statistics collected during read pair generation.
The
__iter__()filters reads based on the filters added usingadd_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adictof the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.InwardPairsFilter(minimum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilterFilter inward-facing read pairs at a distance less than a specified cutoff.
-
valid(row)¶ Map validity check of rows to pairs.
-
valid_pair(pair)¶ Check if a pair is inward-facing and <minimum_distance> apart.
-
-
class
fanc.pairs.LazyFragment(row, pairs, ix=None, side='left')¶ Bases:
genomic_regions.regions.GenomicRegionGenomicRegionrepresenting a fragment with lazy attribute loading.-
chromosome¶ The reference sequence this region is located on
-
start¶ start position of this region on the reference (1-based, inclusive)
-
end¶ end position of this region on the reference (1-based, inclusive)
-
strand¶ strand of the reference this region is located on (-1, 1, 0, or None)
-
ix¶ Region index within the context of regions from the same object.
-
as_bed_line(score_field='score', name_field='name')¶ Return a representation of this object as line in a BED file.
Parameters: - score_field – name of the attribute to be used in the ‘score’ field of the BED line
- name_field – name of the attribute to be used in the ‘name’ field of the BED line
Returns: str
-
as_gff_line(source_field='source', feature_field='feature', score_field='score', frame_field='frame', float_format='.2e')¶ Return a representation of this object as line in a GFF file.
Parameters: - source_field – name of the attribute to be used in the ‘source’ field of the GFF line
- feature_field – name of the attribute to be used in the ‘feature’ field of the GFF line
- score_field – name of the attribute to be used in the ‘score’ field of the GFF line
- frame_field – name of the attribute to be used in the ‘frame’ field of the GFF line
- float_format – Formatting string for the float fields
Returns: str
-
attributes¶ Return all visible attributes of this
GenomicRegion.Returns all attribute names that do not start with an underscore. :return: list of attribute names
-
center¶ Return the center coordinate of the
GenomicRegion.Returns: float
-
contains(region)¶ Check if the specified region is completely contained in this region.
Parameters: region – GenomicRegionobject or string
-
copy()¶ Return a (shallow) copy of this
GenomicRegionReturns: GenomicRegion
-
expand(absolute=None, relative=None, absolute_left=0, absolute_right=0, relative_left=0.0, relative_right=0.0, copy=True, from_center=False)¶ Expand this region by a relative or an absolute amount.
Parameters: - absolute – Absolute amount in base pairs by which to
expand the region represented by this
GenomicRegionobject on both sides. New region start will be <old start - absolute>, new region end will be <old end + absolute> - relative – Relative amount as fraction of region by which to
expand the region represented by this
GenomicRegionobject on both sides. New region start will be <old start - relative*len(self)>, new region end will be <old end + relative*(len(self)> - absolute_left – Absolute amount in base pairs by which to
expand the region represented by this
GenomicRegionobject on the left side - absolute_right – Absolute amount in base pairs by which to
expand the region represented by this
GenomicRegionobject on the right side - relative_left – Relative amount in base pairs by which to
expand the region represented by this
GenomicRegionobject on the left side - relative_right – Relative amount in base pairs by which to
expand the region represented by this
GenomicRegionobject on the right side - copy – If True, return a copy of the original region, if False will modify the existing region in place
- from_center – If True measures distance from center rather than start and end of the old region
Returns: GenomicRegion- absolute – Absolute amount in base pairs by which to
expand the region represented by this
-
five_prime¶ Return the position of the 5’ end of this
GenomicRegionon the reference.Returns: int
-
fix_chromosome(copy=False)¶ Change chromosome representation from chr<NN> to <NN> or vice versa.
Parameters: copy – If True, make copy of region, otherwise will modify existing region in place. Returns: GenomicRegion
-
classmethod
from_string(region_string)¶ Convert a string into a
GenomicRegion.This is a very useful convenience function to quickly define a
GenomicRegionobject from a descriptor string. Numbers can be abbreviated as ‘12k’, ‘1.5M’, etc.Parameters: region_string – A string of the form <chromosome>[:<start>-<end>[:<strand>]] (with square brackets indicating optional parts of the string). If any optional part of the string is omitted, intuitive defaults will be chosen. Returns: GenomicRegion
-
is_forward()¶ Return True if this region is on the forward strand of the reference genome.
Returns: True if on ‘+’ strand, False otherwise.
-
is_reverse()¶ Return True if this region is on the reverse strand of the reference genome.
Returns: True if on ‘-’ strand, False otherwise.
-
overlap(region)¶ Return the overlap in base pairs between this region and another region.
Parameters: region – GenomicRegionto find overlap forReturns: overlap as int in base pairs
-
overlaps(region)¶ Check if this region overlaps with the specified region.
Parameters: region – GenomicRegionobject or string
-
set_attribute(attribute, value)¶ Safely set an attribute on the
GenomicRegionobject.This automatically decodes bytes objects into UTF-8 strings. If you do not care about this, you can also use region.<attribute> = <value> directly.
Parameters: - attribute – Name of the attribute to be set
- value – Value of the attribute to be set
-
strand_string¶ Return the ‘strand’ attribute as string.
Returns: strand as str (‘+’, ‘-’, or ‘.’)
-
three_prime¶ Return the position of the 3’ end of this
GenomicRegionon the reference.Returns: int
-
to_string()¶ Convert this
GenomicRegionto its string representation.Returns: str
-
-
class
fanc.pairs.LazyFragmentRead(row, pairs, side='left')¶ Bases:
fanc.pairs.FragmentReadFragmentReadimplementation with lazy attribute loading.-
fragment¶ A
GenomicRegiondelineated by restriction sites.
-
position¶ The position of this read in base-pairs (1-based) from the start of the chromosome it maps to.
-
strand¶ The strand this read maps to (-1 or +1).
-
re_distance()¶ Get the distance of the alignment to the nearest restriction site. :return: int
-
-
class
fanc.pairs.MinimalRead(chromosome, position, strand)¶ Bases:
objectMinimal class representing an aligned read.
-
chromosome¶ Chromosome name
-
reference_name¶ Identical to chromosome, exists for compatibility with pysam
-
position¶ Position of aligned read on chromosome
-
pos¶ Identical to position, exists for compatibility with Pysam
-
strand¶ Strand of the alignment
-
flag¶ Flag representing the strandedness of a read. 0 if on forward strand, else -1
-
-
class
fanc.pairs.Monitor(value=0, manager=None)¶ Bases:
fanc.tools.general.WorkerMonitorClass to monitor fragment info worker threads.
-
is_generating_pairs()¶ Check the pair generating status.
-
set_generating_pairs(value)¶ Set the pair generating status.
-
-
class
fanc.pairs.OutwardPairsFilter(minimum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilterFilter outward-facing read pairs at a distance less than a specified cutoff.
-
valid(row)¶ Map validity check of rows to pairs.
-
-
class
fanc.pairs.PCRDuplicateFilter(pairs, threshold=2, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilterMasks alignments that are suspected to be PCR duplicates. In order to be considered duplicates, two pairs need to have identical start positions of their respective left alignments AND of their right alignments.
-
valid(row)¶ Map validity check of rows to pairs.
-
valid_pair(pair)¶ Check if a pair is duplicated.
-
-
class
fanc.pairs.PairedSamBamReadPairGenerator(sam_file)¶ Bases:
fanc.pairs.ReadPairGeneratorGenerate read pairs from single SAM/BAM files from reads with identical qname.
Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.
-
add_filter(read_filter)¶ Add a
ReadFilterto this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
static
resolve_chimeric(reads, max_dist_same_locus=500)¶ Returns: read1, read2, is_chimeric
-
stats()¶ Return filter statistics collected during read pair generation.
The
__iter__()filters reads based on the filters added usingadd_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adictof the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.QualityFilter(cutoff=30, mask=None)¶ Bases:
fanc.pairs.ReadFilterFilter mapped reads based on mapping quality.
-
valid_read(read)¶ Check if a read has a mapq >= cutoff.
-
-
class
fanc.pairs.ReDistanceFilter(maximum_distance=10000, mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilterFilters read pairs where the sum of distances of each read to its nearest restriction site is larger than a certain distance.
-
valid(row)¶ Map validity check of rows to pairs.
-
valid_pair(pair)¶ Check sum of restriction site distances < maximum_distance.
-
-
class
fanc.pairs.ReadFilter(mask=None)¶ Bases:
objectAbstract class that provides filtering functionality for
MinimalRead,AlignedSegmentor compatible.To create a custom
ReadFilter, extend this class and override the valid_read(self, read) method. valid_read should return False for a specific read object if the object is supposed to be filtered/masked and True otherwise. SeeQualityFilterfor an example.Pass a custom filter to the filter method in
ReadPairGeneratorto apply it.-
valid_read(read)¶ Determine if a read is valid or should be filtered.
When implementing custom read filters this method must be overridden. It should return False for reads that are to be filtered and True otherwise.
Internally, the ReadPairs object will iterate over all Read instances to determine their validity on an individual basis.
Parameters: read – A ReadobjectReturns: True if Read is valid, False otherwise
-
-
class
fanc.pairs.ReadPairGenerator¶ Bases:
objectBase class for generating and filtering read pairs.
This class primarily provides filtering capabilities for read pairs generated by subclasses of
ReadPairGenerator. You can add aReadFilterusingadd_filter(), which will be used during read pair generation. Filtering statistics are collected during the run, and can be obtained viastats().These generators are primarily meant as input for
add_read_pairs(), but can also be used independently.Subclasses of
ReadPairGeneratormust implement the_iter_read_pairs()function.-
add_filter(read_filter)¶ Add a
ReadFilterto this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats()¶ Return filter statistics collected during read pair generation.
The
__iter__()filters reads based on the filters added usingadd_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adictof the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.ReadPairs(file_name=None, mode='a', partition_strategy='auto', _group_name='fragment_map', _table_name_fragments='fragments', _table_name_pairs='pairs', tmpdir=None)¶ Bases:
fanc.matrix.RegionPairsTableClass representing a collection of read pairs mapped to restriction fragments.
This class is a
RegionBasedobject, where eachGenomicRegionrepresents a restriction fragment from a Hi-C experiment. A list of fragments can be obtained with thegenome_regions()function, for example.To create a
ReadPairsobject, you first have to add the restriction fragments before adding read pairs:import fanc re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII") rp = fanc.ReadPairs() rp.add_regions(re_fragments.regions)
Read pairs can easily be generate fro different types of input using
ReadPairGeneratorimplementations, e.g.HicProReadPairGeneratororSamBamReadPairGenerator.rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam", "output/sam/SRR4271982_chr18_19_2_sort.bam") rp.add_read_pairs(rp_generator, threads=4)
You can query regions using the
RegionBasedinterface:chr1_fragments = rp.regions("chr1")
and you can iterate over read pairs using the
pairs():you can also use the :cls:Region
-
class
ChromosomeDescription¶ Bases:
tables.description.IsDescriptionDescription of the chromosomes in this object.
-
class
MaskDescription¶ Bases:
tables.description.IsDescription
-
class
RegionDescription¶ Bases:
tables.description.IsDescriptionDescription of a genomic region for PyTables Table
-
add_contact(contact, *args, **kwargs)¶ Alias for
add_edge()Parameters: - contact –
Edge - args – Positional arguments passed to
_add_edge() - kwargs – Keyword arguments passed to
_add_edge()
- contact –
-
add_contacts(contacts, *args, **kwargs)¶ Alias for
add_edges()
-
add_edge(edge, check_nodes_exist=True, *args, **kwargs)¶ Add an edge / contact between two regions to this object.
Parameters: - edge –
Edge, dict with at least the attributes source and sink, optionally weight, or a list of length 2 (source, sink) or 3 (source, sink, weight). - check_nodes_exist – Make sure that there are nodes that match source and sink indexes
- args – Positional arguments passed to
_add_edge() - kwargs – Keyword arguments passed to
_add_edge()
- edge –
-
add_edge_from_dict(edge, *args, **kwargs)¶ Direct method to add an edge from dict input.
Parameters: edge – dict with at least the keys “source” and “sink”. Additional keys will be loaded as edge attributes
-
add_edge_from_edge(edge, *args, **kwargs)¶ Direct method to add an edge from
Edgeinput.Parameters: edge – Edge
-
add_edge_from_list(edge, *args, **kwargs)¶ Direct method to add an edge from list or tuple input.
Parameters: edge – List or tuple. Should be of length 2 (source, sink) or 3 (source, sink, weight)
-
add_edge_simple(source, sink, weight=None, *args, **kwargs)¶ Direct method to add an edge from
Edgeinput.Parameters: - source – Source region index
- sink – Sink region index
- weight – Weight of the edge
-
add_edges(edges, flush=True, *args, **kwargs)¶ Bulk-add edges from a list.
List items can be any of the supported edge types, list, tuple, dict, or
Edge. Repeatedly callsadd_edge(), so may be inefficient for large amounts of data.Parameters: edges – List (or iterator) of edges. See add_edge()for details
-
add_mask_description(name, description)¶ Add a mask description to the _mask table and return its ID.
Parameters: - name (str) – name of the mask
- description (str) – description of the mask
Returns: id of the mask
Return type: int
-
add_read_pairs(read_pairs, batch_size=1000000, threads=1)¶ Add read pairs to this object.
This function requires tuples of read pairs as input, for example from
MinimalReadorAlignedSegment. Typically, you won’t have to construct these from scratch, but can use one of theReadPairGeneratorclasses to generate read pairs from input files.import fanc re_fragments = fanc.genome_regions("hg19_chr18_19.fa", "HindIII") rp = fanc.ReadPairs() rp.add_regions(re_fragments.regions) # read pairs are added here from BAM files rp_generator = fanc.SamBamReadPairGenerator("output/sam/SRR4271982_chr18_19_1_sort.bam", "output/sam/SRR4271982_chr18_19_2_sort.bam") rp.add_read_pairs(rp_generator, threads=4)
Parameters: - read_pairs – iterator over tuples of read pairs. Typically
instances of
ReadPairGenerator - batch_size – Batch size of read pairs sent to fragment info workers
- threads – Number of threads for simultaneous fragment info finding
- read_pairs – iterator over tuples of read pairs. Typically
instances of
-
add_region(region, *args, **kwargs)¶ Add a genomic region to this object.
This method offers some flexibility in the types of objects that can be loaded. See parameters for details.
Parameters: region – Can be a GenomicRegion, a str in the form ‘<chromosome>:<start>-<end>[:<strand>], a dict with at least the fields ‘chromosome’, ‘start’, and ‘end’, optionally ‘ix’, or a list of length 3 (chromosome, start, end) or 4 (ix, chromosome, start, end).
-
add_regions(regions, *args, **kwargs)¶ Bulk insert multiple genomic regions.
Parameters: regions – List (or any iterator) with objects that describe a genomic region. See add_regionfor options.
-
static
bin_intervals(intervals, bins, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into a fixed number of bins.
Parameters: - intervals – iterator of tuples (start, end, score)
- bins – Number of bins to divide the region into
- interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
Returns: iterator of tuples: (start, end, score)
-
static
bin_intervals_equidistant(intervals, bin_size, interval_range=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False)¶ Bin a given set of intervals into bins with a fixed size.
Parameters: - intervals – iterator of tuples (start, end, score)
- bin_size – Size of each bin in base pairs
- interval_range – Optional. Tuple (start, end) in base pairs of range of interval to be binned. Useful if intervals argument does not cover to exact genomic range to be binned.
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
Returns: iterator of tuples: (start, end, score)
-
bin_size¶ Return the length of the first region in the dataset.
Assumes all bins have equal size.
Returns: int
-
binned_regions(region=None, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, *args, **kwargs)¶ Same as region_intervals, but returns
GenomicRegionobjects instead of tuples.Parameters: - region – String or class:~GenomicRegion object denoting the region to be binned
- bins – Number of bins to divide the region into
- bin_size – Size of each bin (alternative to bins argument)
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
- args – Arguments passed to _region_intervals
- kwargs – Keyword arguments passed to _region_intervals
Returns: iterator of
GenomicRegionobjects
-
bins_to_distance(bins)¶ Convert fraction of bins to base pairs
Parameters: bins – float, fraction of bins Returns: int, base pairs
-
chromosome_bins¶ Returns a dictionary of chromosomes and the start and end index of the bins they cover.
Returned list is range-compatible, i.e. chromosome bins [0,5] cover chromosomes 1, 2, 3, and 4, not 5.
-
chromosome_lengths¶ Returns a dictionary of chromosomes and their length in bp.
-
chromosomes()¶ List all chromosomes in this regions table. :return: list of chromosome names.
-
close(copy_tmp=True, remove_tmp=True)¶ Close this HDF5 file and run exit operations.
If file was opened with tmpdir in read-only mode: close file and delete temporary copy.
If file was opened with tmpdir in write or append mode: Replace original file with copy and delete copy.
Parameters: - copy_tmp – If False, does not overwrite original with modified file.
- remove_tmp – If False, does not delete temporary copy of file.
-
distance_to_bins(distance)¶ Convert base pairs to fraction of bins.
Parameters: distance – distance in base pairs Returns: float, distance as fraction of bin size
-
downsample(n, file_name=None)¶ Sample edges from this object.
Sampling is always done on uncorrected Hi-C matrices.
Parameters: - n – Sample size or reference object. If n < 1 will be interpreted as a fraction of total reads in this object.
- file_name – Output file name for down-sampled object.
Returns: RegionPairsTable
-
edge_data(attribute, *args, **kwargs)¶ Iterate over specific edge attribute.
Parameters: - attribute – Name of the attribute, e.g. “weight”
- args – Positional arguments passed to
edges() - kwargs – Keyword arguments passed to
edges()
Returns: iterator over edge attribute
-
edge_subset(key=None, *args, **kwargs)¶ Get a subset of edges.
This is an alias for
edges().Returns: generator ( Edge)
-
edges¶ Iterate over contacts / edges.
edges()is the central function ofRegionPairsContainer. Here, we will use theHicimplementation for demonstration purposes, but the usage is exactly the same for all compatible objects implementingRegionPairsContainer, includingJuicerHicandCoolerHic.import fanc # file from FAN-C examples hic = fanc.load("output/hic/binned/fanc_example_1mb.hic")
We can easily find the number of edges in the sample
Hicobject:len(hic.edges) # 8695
When used in an iterator context,
edges()iterates over all edges in theRegionPairsContainer:for edge in hic.edges: # do something with edge print(edge) # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173 # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057 # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876 # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468 # ...
Calling
edges()as a method has the same effect:# note the '()' for edge in hic.edges(): # do something with edge print(edge) # 42--42; bias: 5.797788472650082e-05; sink_node: chr18:42000001-43000000; source_node: chr18:42000001-43000000; weight: 0.12291311562018173 # 24--28; bias: 6.496381719803623e-05; sink_node: chr18:28000001-29000000; source_node: chr18:24000001-25000000; weight: 0.025205961072838057 # 5--76; bias: 0.00010230955745211447; sink_node: chr18:76000001-77000000; source_node: chr18:5000001-6000000; weight: 0.00961709840049876 # 66--68; bias: 8.248432587969082e-05; sink_node: chr18:68000001-69000000; source_node: chr18:66000001-67000000; weight: 0.03876763316345468 # ...
Rather than iterate over all edges in the object, we can select only a subset. If the key is a string or a
GenomicRegion, all non-zero edges connecting the region described by the key to any other region are returned. If the key is a tuple of strings orGenomicRegion, only edges between the two regions are returned.# select all edges between chromosome 19 # and any other region: for edge in hic.edges("chr19"): print(edge) # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033 # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978 # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917 # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048 # ... # select all edges that are only on # chromosome 19 for edge in hic.edges(('chr19', 'chr19')): print(edge) # 90--116; bias: 0.00021173151730025176; sink_node: chr19:37000001-38000000; source_node: chr19:11000001-12000000; weight: 0.009104455243910825 # 135--135; bias: 0.00018003890596887822; sink_node: chr19:56000001-57000000; source_node: chr19:56000001-57000000; weight: 0.10028167062466517 # 123--123; bias: 0.00011063368998965993; sink_node: chr19:44000001-45000000; source_node: chr19:44000001-45000000; weight: 0.1386240135570439 # 92--93; bias: 0.00040851066434864896; sink_node: chr19:14000001-15000000; source_node: chr19:13000001-14000000; weight: 0.10090213409411629 # ... # select inter-chromosomal edges # between chromosomes 18 and 19 for edge in hic.edges(('chr18', 'chr19')): print(edge) # 49--106; bias: 0.00026372303696871666; sink_node: chr19:27000001-28000000; source_node: chr18:49000001-50000000; weight: 0.003692122517562033 # 6--82; bias: 0.00021923129703834945; sink_node: chr19:3000001-4000000; source_node: chr18:6000001-7000000; weight: 0.0008769251881533978 # 47--107; bias: 0.00012820949175399097; sink_node: chr19:28000001-29000000; source_node: chr18:47000001-48000000; weight: 0.0015385139010478917 # 38--112; bias: 0.0001493344481069762; sink_node: chr19:33000001-34000000; source_node: chr18:38000001-39000000; weight: 0.0005973377924279048 # ...
By default,
edges()will retrieve all edge attributes, which can be slow when iterating over a lot of edges. This is why all file-based FAN-CRegionPairsContainerobjects support lazy loading, where attributes are only read on demand.for edge in hic.edges('chr18', lazy=True): print(edge.source, edge.sink, edge.weight, edge) # 42 42 0.12291311562018173 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #0> # 24 28 0.025205961072838057 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #1> # 5 76 0.00961709840049876 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #2> # 66 68 0.03876763316345468 <fanc.matrix.LazyEdge for row /edges/chrpair_0_0.row (Row), pointing to row #3> # ...
Warning
The lazy iterator reuses the
LazyEdgeobject in every iteration, and overwrites theLazyEdgeattributes. Therefore do not use lazy iterators if you need to store edge objects for later access. For example, the following code works as expectedlist(hic.edges()), with allEdgeobjects stored in the list, while this codelist(hic.edges(lazy=True))will result in a list of identicalLazyEdgeobjects. Always ensure you do all edge processing in the loop when working with lazy iterators!When working with normalised contact frequencies, such as obtained through matrix balancing in the example above,
edges()automatically returns normalised edge weights. In addition, thebiasattribute will (typically) have a value different from 1.When you are interested in the raw contact frequency, use the
norm=Falseparameter:for edge in hic.edges('chr18', lazy=True, norm=False): print(edge.source, edge.sink, edge.weight) # 42 42 2120.0 # 24 28 388.0 # 5 76 94.0 # 66 68 470.0 # ...
You can also choose to omit all intra- or inter-chromosomal edges using
intra_chromosomal=Falseorinter_chromosomal=False, respectively.Returns: Iterator over Edgeor equivalent.
-
edges_dict(*args, **kwargs)¶ Edges iterator with access by bracket notation.
This iterator always returns unnormalised edges.
Returns: dict or dict-like iterator
-
filter(pair_filter, queue=False, log_progress=True)¶ Apply a
FragmentReadPairFilterto the read pairs in this object.Parameters: - pair_filter –
FragmentReadPairFilter - queue – If True, does not do the filtering immediately, but
queues this filter. All queued filters can then be run
at the same time using
run_queued_filters() - log_progress –
Returns: - pair_filter –
-
filter_inward(minimum_distance=None, queue=False, **kwargs)¶ Convenience function that applies an
InwardPairsFilter.Parameters: - minimum_distance – Minimum distance inward-facing read pairs must have to pass this filter
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
- kwargs – Additional arguments to pass
to
get_ligation_structure_biases()
-
filter_ligation_products(inward_threshold=None, outward_threshold=None, queue=False, **kwargs)¶ Convenience function that applies an
OutwardPairsFilterand anInwardPairsFilter.Parameters: - inward_threshold – Minimum distance inward-facing read pairs must have to pass this filter. If None, will be inferred from the data
- outward_threshold – Minimum distance outward-facing read pairs must have to pass this filter. If None, will be inferred from the data
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
- kwargs – Additional arguments to pass
to
get_ligation_structure_biases()
-
filter_outward(minimum_distance=None, queue=False, **kwargs)¶ Convenience function that applies an
OutwardPairsFilter.Parameters: - minimum_distance – Minimum distance outward-facing read pairs must have to pass this filter
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
- kwargs – Additional arguments to pass
to
get_ligation_structure_biases()
-
filter_pcr_duplicates(threshold=3, queue=False)¶ Convenience function that applies an
PCRDuplicateFilter.Parameters: - threshold – If distance between two alignments is smaller or equal the threshold, the alignments are considered to be starting at the same position
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_re_dist(maximum_distance, queue=False)¶ Convenience function that applies an
ReDistanceFilter.Parameters: - maximum_distance – Maximum distance a read can have to the nearest region border (=restriction site)
- queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_self_ligated(queue=False)¶ Convenience function that applies an
SelfLigationFilter.Parameters: queue – If True, filter will be queued and can be executed along with other queued filters using run_queued_filters
-
filter_statistics()¶ Get filtering statistics for this object. :return: dict with {filter_type: count, …}
-
find_region(query_regions, _regions_dict=None, _region_ends=None, _chromosomes=None)¶ Find the region that is at the center of a region.
Parameters: query_regions – Region selector string, :class:~GenomicRegion, or list of the former Returns: index (or list of indexes) of the region at the center of the query region
-
flush(silent=False)¶ Write buffered data to file and update indexes,
Parameters: silent – If True, does not use progressbars.
-
get_edge(item, *row_conversion_args, **row_conversion_kwargs)¶ Get an edge by index.
Parameters: - row_conversion_args – Arguments passed to
RegionPairs._row_to_edge() - row_conversion_kwargs – Keyword arguments passed to
RegionPairs._row_to_edge()
Returns: Edge- row_conversion_args – Arguments passed to
-
get_ligation_structure_biases(sampling=None, skip_self_ligations=True, **kwargs)¶ Compute the ligation biases (inward and outward to same-strand) of this data set.
Parameters: - sampling – Approximate number of data points to average per point in the plot. If None (default), this will be determined on a best-guess basis.
- skip_self_ligations – If True (default), will not consider self-ligated fragments for assessing the error rates.
- unfiltered – If True, uses all read pairs, even those that do not pass filters, for the ligation error calculation
Returns: tuple with (list of gap sizes between reads, list of matching le type ratios)
-
get_mask(key)¶ Search _mask table for key and return Mask.
Parameters: - key (int) – search by mask name
- key – search by mask ID
Returns: Mask
-
get_masks(ix)¶ Extract mask IDs encoded in parameter and return masks.
IDs are powers of 2, so a single int field in the table can hold multiple masks by simply adding up the IDs. Similar principle to UNIX chmod (although that uses base 8)
Parameters: ix (int) – integer that is the sum of powers of 2. Note that this value is not necessarily itself a power of 2. Returns: list of Masks extracted from ix Return type: list (Mask)
-
intervals(*args, **kwargs)¶ Alias for region_intervals.
-
mappable(region=None)¶ Get the mappability of regions in this object.
A “mappable” region has at least one contact to another region in the genome.
Returns: arraywhere True means mappable and False unmappable
-
classmethod
merge(pairs, *args, **kwargs)¶ Merge two or more
RegionPairsTableobjects.Parameters: pairs – list of RegionPairsTableReturns: merged RegionPairsTable
-
pairs(key=None, lazy=False, *args, **kwargs)¶ Iterate over the
FragmentReadPairobjects.Parameters: - key – Region string of the form <chromosome>[:<start>-<end>],
GenomicRegionor tuples thereof - lazy – If True, use lazy loading of objects and their attributes.
Much faster, but can lead to unexpected results if one is
not careful. For example, this:
list(object.pairs())is not the same aslist(object.pairs(lazy=True))! In the latter case, all objects in the list will be identical due to the lazy loading process. Only use lazy loading to access attributes in an iterator! - args – Positional arguments passed to
edges_dict() - kwargs – Keyword arguments passed to
edges_dict()
Returns: - key – Region string of the form <chromosome>[:<start>-<end>],
-
pairs_by_chromosomes(chromosome1, chromosome2, **kwargs)¶ Only iterate over read pairs in this combination of chromosomes. :param chromosome1: Name of first chromosome :param chromosome2: Name of second chromosome :param kwargs: Keyword arguments passed to
pairs():return:
-
region_bins(*args, **kwargs)¶ Return slice of start and end indices spanned by a region.
Parameters: args – provide a GenomicRegionhere to get the slice of start and end bins of onlythis region. To get the slice over all regions leave this blank.Returns:
-
region_data(key, value=None)¶ Retrieve or add vector-data to this object. If there is existing data in this object with the same name, it will be replaced
Parameters: - key – Name of the data column
- value – vector with region-based data (one entry per region)
-
region_intervals(region, bins=None, bin_size=None, smoothing_window=None, nan_replacement=None, zero_to_nan=False, score_field='score', *args, **kwargs)¶ Return equally-sized genomic intervals and associated scores.
Use either bins or bin_size argument to control binning.
Parameters: - region – String or class:~GenomicRegion object denoting the region to be binned
- bins – Number of bins to divide the region into
- bin_size – Size of each bin (alternative to bins argument)
- smoothing_window – Size of window (in bins) to smooth scores over
- nan_replacement – NaN values in the scores will be replaced with this value
- zero_to_nan – If True, will convert bins with score 0 to NaN
- args – Arguments passed to _region_intervals
- kwargs – Keyword arguments passed to _region_intervals
Returns: iterator of tuples: (start, end, score)
-
region_subset(region, *args, **kwargs)¶ Takes a class:~GenomicRegion and returns all regions that overlap with the supplied region.
Parameters: region – String or class:~GenomicRegion object for which covered bins will be returned.
-
regions¶ Iterate over genomic regions in this object.
Will return a
GenomicRegionobject in every iteration. Can also be used to get the number of regions by calling len() on the object returned by this method.Returns: RegionIter
-
regions_and_edges(key, *args, **kwargs)¶ Convenient access to regions and edges selected by key.
Parameters: - key – Edge selector, see
edges() - args – Positional arguments passed to
edges() - kwargs – Keyword arguments passed to
edges()
Returns: list of row regions, list of col regions, iterator over edges
- key – Edge selector, see
-
regions_dict¶ Return a dictionary with region index as keys and regions as values.
Returns: dict {region.ix: region, …}
-
static
regions_identical(pairs)¶ Check if the regions in all objects in the list are identical.
Parameters: pairs – listofRegionBasedobjectsReturns: True if chromosome, start, and end are identical between all regions in the same list positions.
-
run_queued_filters(log_progress=True)¶ Run queued filters. See
filter()Parameters: log_progress – If true, process iterating through all edges will be continuously reported.
-
subset(*regions, **kwargs)¶ Subset a Hic object by specifying one or more subset regions.
Parameters: - regions – string or GenomicRegion object(s)
- kwargs – Supports
file_name: destination file name of subset Hic object;
tmpdir: if True works in tmp until object is closed
additional parameters are passed to
edges()
Returns: Hic
-
to_bed(file_name, subset=None, **kwargs)¶ Export regions as BED file
Parameters: - file_name – Path of file to write regions to
- subset – optional
GenomicRegionor str to write only regions overlapping this region - kwargs – Passed to
write_bed()
-
to_bigwig(file_name, subset=None, **kwargs)¶ Export regions as BigWig file.
Parameters: - file_name – Path of file to write regions to
- subset – optional
GenomicRegionor str to write only regions overlapping this region - kwargs – Passed to
write_bigwig()
-
to_gff(file_name, subset=None, **kwargs)¶ Export regions as GFF file
Parameters: - file_name – Path of file to write regions to
- subset – optional
GenomicRegionor str to write only regions overlapping this region - kwargs – Passed to
write_gff()
-
class
-
class
fanc.pairs.SamBamReadPairGenerator(sam_file1, sam_file2, check_sorted=True)¶ Bases:
fanc.pairs.ReadPairGeneratorGenerate read pairs from paired-end SAM/BAM files.
This
ReadPairGeneratoriterates over two SAM or BAM files that have been sorted by qname (for example withsamtools sort -n).Chimeric reads (mapping partially to multiple genomic locations), such as output by BWA, are handled as follows: If both mate pairs are chimeric, they are removed. If only one mate is chimeric, but it is split into more than 2 alignments, it is also removed. If one mate is chimeric and split into two alignments. If one part of the chimeric alignment maps within 100bp of the regular alignment, the read pair is kept and returned. In all other cases, the pair is removed.
-
add_filter(read_filter)¶ Add a
ReadFilterto this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
filter_multi_mapping(strict=True)¶ Convenience function that registers a UniquenessFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.
Parameters: strict – If True will filter if XS tag is present. If False, will filter only when XS tag is not 0. This is applied if alignments are from bowtie2.
-
filter_quality(cutoff=30)¶ Convenience function that registers a QualityFilter. The actual algorithm and rationale used for filtering will depend on the internal _mapper attribute.
Parameters: cutoff – Minimum mapping quality (mapq) a read must have to pass the filter
-
filter_unmapped()¶ Convenience function that registers an UnmappedFilter.
-
stats()¶ Return filter statistics collected during read pair generation.
The
__iter__()filters reads based on the filters added usingadd_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adictof the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.SelfLigationFilter(mask=None)¶ Bases:
fanc.pairs.FragmentReadPairFilterFilters read pairs where one or both reads map to the same restriction fragment.
-
valid(row)¶ Map validity check of rows to pairs.
-
valid_pair(pair)¶ Check if both reads are on the same fragment.
-
-
class
fanc.pairs.TxtReadPairGenerator(valid_pairs_file, sep=None, chr1_field=1, pos1_field=2, strand1_field=3, chr2_field=4, pos2_field=5, strand2_field=6)¶ Bases:
fanc.pairs.ReadPairGeneratorGenerate read pairs from a plain text file.
This is an implementation of
ReadPairGeneratorfor reading read pairs from an arbitrary text file. For specific text file formats have a look atFourDNucleomePairGeneratororHicProPairGenerator.TxtReadPairGeneratoriterates over lines in a file and splits them into fields at “sep”. It then extracts the chromosome, position and strand for each read in the pair, according to the fields specified by “chr<1|2>_field”, “pos<1|2>_field”, and “strand<1|2>_field”. If your file does not have strand fields, or if you don’t want to load them, you can simply set them to “None”.-
add_filter(read_filter)¶ Add a
ReadFilterto this object.The filter will be applied during read pair generation.
Parameters: read_filter – ReadFilter
-
stats()¶ Return filter statistics collected during read pair generation.
The
__iter__()filters reads based on the filters added usingadd_filter(). During filtering, it keeps track of the numbers of read pairs that were filtered. This function returns adictof the form <filter_name>: <number of reads filtered out>.Returns: dict
-
-
class
fanc.pairs.UniquenessFilter(strict=True, mask=None)¶ Bases:
fanc.pairs.ReadFilterFilter reads that do not map uniquely to the reference sequence.
-
valid_read(read)¶ Check if a read has an XS tag.
If strict is enabled checks if a read has an XS tag. If not strict, XS has to be smaller than AS (alignment score) for a valid read.
-
-
class
fanc.pairs.UnmappedFilter(mask=None)¶ Bases:
fanc.pairs.ReadFilterFilter reads that do not map to the reference sequence.
-
valid_read(read)¶ Check if the the flag bit 4 is set.
-
-
fanc.pairs.generate_pairs(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=10000000)¶ Generate Pairs object from SAM/BAM files.
This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.
Parameters: - sam1_file – Path to a sorted SAM/BAM file (1st mate)
- sam2_file – Path to a sorted SAM/BAM file (2nd mate)
- regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
- restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
- read_filters – List of
ReadFilterto filter reads while loading from SAM/BAM - output_file – Path to output file
- check_sorted – Double-check that input SAM files are sorted if True (default)
- threads – Number of threads used for finding restriction fragments for read pairs
- batch_size – Number of read pairs sent to each restriction fragment worker
Returns:
-
fanc.pairs.generate_pairs_split(sam1_file, sam2_file, regions, restriction_enzyme=None, output_file=None, read_filters=(), check_sorted=True, threads=1, batch_size=1000000)¶ Generate Pairs object from SAM/BAM files.
This is a convenience function that let’s you create a Pairs object from SAM/BAM data in a single step.
Parameters: - sam1_file – Path to a sorted SAM/BAM file (1st mate)
- sam2_file – Path to a sorted SAM/BAM file (2nd mate)
- regions – Path to file with restriction fragments (BED, GFF) or FASTA with genomic sequence
- restriction_enzyme – Name of restriction enzyme (only when providing FASTA as regions)
- read_filters – List of
ReadFilterto filter reads while loading from SAM/BAM - output_file – Path to output file
- check_sorted – Double-check that input SAM files are sorted if True (default)
- threads – Number of threads used for finding restriction fragments for read pairs
- batch_size – Number of read pairs sent to each restriction fragment worker
Returns: