Generating and filtering read pairs

Once we have mapped and sorted BAM files (see Mapping FASTQ files) we can load the paired reads into a ReadPairs object.

Obtaining fragments

ReadPairs map reads to restriction fragments, so first of all we are going to need a list of those. The fanc.regions.genome_regions() function has been made for that purpose. If you provide it with a FASTA file and a restriction enzyme name, it will build the list of fragments for you:

from fanc.regions import genome_regions
genome_file = 'hg19_chr18_19.fa'
fragments = genome_regions(genome_file, restriction_enzyme='HindIII')

You can also give it a RegionBased compatible file with fragments directly (BED, GFF and similar work). Note that the fragments should cover the entire genome, so don’t just use a list of restriction sites!

The returned object is of class RegionBased. See the main article if you want to learn how to interact with objects of that class.

Setting up filters at the read level

Filters can be used to remove problematic reads, such as those with poor quality (QualityFilter) or with multiple mapping locations throughout the genome (UniquenessFilter). These filters (of class ReadFilter) have to be added already at the read pair import step, as the necessary information is otherwise lost in later processing stages:

from fanc.pairs import QualityFilter, UniquenessFilter
quality_filter = QualityFilter(30, mask='MAPQ')
uniqueness_filter = UniquenessFilter(strict=True, mask='unique')

Warning

Both filters are specific to reads mapped with bowtie2, and there are versions of these filters specific for BWA-mapped reads (BwaMemQualityFilter and BwaMemUniquenessFilter). Make sure you use the appropriate filter for your chosen mapper, as the quality and uniqueness definitions used by each application differ substantially!

The QualityFilter removes all read pairs where one or both reads have a MAPQ value lower than 30 (in this case). The UniquenessFilter has two settings. If strict=True it will remove any read pair where either of the two reads has the XS tag, which indicates the presence of a secondary alignment. If strict=False, it further requires that the secondary alignment score is higher or equal to the primary alignment score (AS). The latter setting is therefore more permissive.

Now we have set up the necessary filters, we can start importing read pairs and match them to restriction fragments.

Importing and accessing read pairs

For importing read pairs from SAM files we can use generate_pairs_split(). It requires the two paired-end SAM (or BAM) files and fragment definitions. In addition, we provide it with the read filters we created above, so that only reads meeting our requirements are imported into the object. If we do not pass a file name with output_file, the resulting object is created in memory, which is useful for testing, but highly inadvisable for actual Hi-C libraries. Here, we request four threads to be used for the loading in parallel:

from fanc.pairs import generate_pairs_split as generate_pairs
pairs_folder = mkdir(os.path.join(output_folder, 'pairs'))
pairs = generate_pairs(sam_1_file, sam_2_file, fragments,
                       read_filters=(quality_filter, uniqueness_filter),
                       output_file=os.path.join(pairs_folder, 'example.pairs'),
                       check_sorted=True, threads=4)

The resulting object is of the type ReadPairs, which implements RegionPairsContainer. It contains all valid read pairs matched to their respective restriction fragments.

Let’s look at some basic information:

chromosomes = pairs.chromosomes()
print(chromosomes)
# ['chr18', 'chr19']

pair = pairs[0]
print(pair)
# chr18: 3187827-(3191583[1])-3192106 -- chr18: 3187827-(3192073[-1])-3192106
type(pair)
# fanc.pairs.FragmentReadPair

print(pair.left)
# chr18: 3187827-(3191583[1])-3192106

print(pair.right)
# chr18: 3187827-(3192073[-1])-3192106
type(pair.right)
# fanc.pairs.FragmentRead
print(pair.right.fragment)
# chr18:3187827-3192106
type(pair.right.fragment)
# genomic_regions.regions.GenomicRegion
print(pair.right.position)
# 3192073
print(pair.right.re_distance())
# 33
print(pair.right.strand)
# -1

As you can see, ReadPairs is a container for FragmentReadPair objects. Each object contains information about its two associated reads in the left and right properties, respectively, which return a FragmentRead object.

FragmentRead objects contain information about the read’s mapping location (pair.right.position and pair.right.strand), and the fragment it maps to (pair.right.fragment). For convenience, you can also directly calculate the distance of the read to the end of the fragment (which is the nearest restriction site) using pair.right.re_distance(). The pair.right.fragment attribute is of type GenomicRegion, which has lots of useful properties which you can find in the genomic_regions documentation. Incidentally, the above pair is a self-ligated” fragment, which will be filtered out in post-processing.

Typically you would not access each pair individually, but iterate over all pairs, or pair subsets with the fanc.pairs.ReadPairs.pairs() function:

# all pairs
for pair in pairs.pairs:
    print(pair)
# pair subset (only pairs with both fragments on chromosome 18)
for pair in pairs.pairs(('chr18', 'chr18')):
    print(pair)

The fanc.pairs.ReadPairs.pairs() function supports lazy evaluation with the lazy=True argument, but make sure to read the caveats in the main article!

Read/fragment pair filters

In addition to the quality and uniqueness filters, which had to be used during read pair import, FAN-C offers a number of FragmentReadPairFilter to remove problematic pairs. First of all, however, we’ll introduce one little function that you can use to reset all filters (and restore the original pairs) in case you found your filtering t be too harsh or if you want a fresh start, which is called reset_filters():

pairs.reset_filters()

We will illustrate how filters are used with the SelfLigationFilter.

from fanc.pairs import SelfLigationFilter
sl_filter = SelfLigationFilter(mask='self-ligation')
# alternative:
# from fanc.general import Mask
# sl_filter = SelfLigationFilter(mask=Mask(name='self-ligation', description="Filter for self-ligated fragments")
pairs.filter(sl_filter)

In general, we first create a filter instance. The mask parameter allows us to specify a filter name. We can also pass it a more elaborate Mask object with properties for name and description (see commented code). Name and description of the mask will be stored in the pairs object upon filtering. The filter is run simply by calling filter() with the corresponding filter instance.

If you want to run multiple filters, it is more efficient to “queue” filters and then execute them all in one go:

from fanc.pairs import SelfLigationFilter, ReDistanceFilter
sl_filter = SelfLigationFilter(mask='self-ligation')
rd_filter = ReDistanceFilter(500, mask='re-site-distance')
pairs.filter(sl_filter, queue=True)
pairs.filter(rd_filter, queue=True)
pairs.run_queued_filters()

Once the filter() command has completed, the filtered pairs will appear to have been deleted from the object.

pair = pairs[0]
print(pair)
# chr18: 899140-(899308[-1])-899476 -- chr18: 1509911-(1510021[1])-1510076

In reality, filtering only masks (“hides”) edges, so we have the opportunity to reset filters or selectively disable them when iterating:

for pair in pairs.pairs(excluded_filters=['self-ligation']):
    print(pair)

To obtain a dictionary with the filter statistics, use filter_statistics()

statistics = pairs.filter_statistics()

Filter types

Here are the filters available in FAN-C for read/fragment pairs:

  • SelfLigationFilter removes all pairs where both reads map to the same fragment
  • ReDistanceFilter can filter for an expected DNA (not restriction) fragment size in Hi-C libraries that result from fragmentation (typically sonication). It sums up the distance of both reads to their nearest restriction site. If that sum exceeds the cutoff set at filter creation, it will be marked as invalid
  • PCRDuplicateFilter will find pairs mapping to the exact same genomic locations (both reads), and will only retain one copy of the exact duplicates.
  • InwardPairsFilter and OutwardPairsFilter are removing pairs where both reads map within a specific distance on the same chromosome and are oriented towards or away from each other (in terms of strand), respectively. This is designed to remove ligation products that have likely arisen from uncut restriction sites or that are unligated.

Next, we will convert the filtered pairs to a Hic object.