TADs and TAD boundaries¶

To follow this tutorial, download the FAN-C example data, for example through our Keeper library. Then run the example fanc auto command in Example analysis to generate all the necessary files, and set up your Python session like this:

import fanc
import fanc.plotting as fancplot
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

hic_100kb = fanc.load("output/hic/binned/fanc_example_100kb.hic")

If you want to try the tutorial with an equivalent Cooler file, load the Hi-C file like this instead:

hic_100kb = fanc.load("architecture/other-hic/fanc_example.mcool@100kb")

or like this if you want to work with a Juicer file built from the same data:

hic_100kb = fanc.load("architecture/other-hic/fanc_example.juicer.hic@100kb")

Note that there may be minor differences in the results due to the “zooming” and balancing applied by the different tools.

Also have a look at the command line documentation at Hi-C domain analysis, which explains a lot of the principles behind insulation scores and directionality indexes with a couple of helpful illustrations.

Insulation scores¶

Insulation scores are calculated and saved to disk using the InsulationScores class. To run a basic insulation score calculation on a Hi-C object, use from_hic:

insulation = fanc.InsulationScores.from_hic(hic_100kb,
                                            [5000000, 1000000, 1500000,
                                             2000000, 2500000, 3000000,
                                             3500000, 4000000],
                                            file_name="architecture/domains/fanc_example_100kb.insulation")

The file created in file_name can later be easily loaded with load().

The resulting object stores all insulation score tracks for the window_sizes provided. It is derived from the RegionsTable class, which is RegionBased. There is a lot of convenient functionality bundled in these objects, which you can read about in RegionBased.

You can access the calculated scores by using the scores() function, which requires the window size of the scores you want to retrieve, and returns a list of scores, one for each genomic region in the object:

scores = insulation.scores(1000000)

However, a more flexible way to retrieve scores, which allows things like subsetting and retrieving region information at the same time, is the regions() iterator. Scores are stored as a region attribute insulation_<window size>:

for region in insulation.regions('chr18'):
    score = region.insulation_1000000
    # do something with score
    # ...

If you prefer to have a dedicated object for a particular window size, you can use score_regions(). The score is then accessible via the score attribute:

insulation_1mb = insulation.score_regions(1000000)
for r in insulation_1mb.regions:
    score = r.score

Insulation calculation parameters¶

from_hic has several parameters that allow you to customise the insulation score calculation. You can read everything them in the linked API reference, and we will highlight the most important ones here.

impute_missing allows you to replace missing or masked pixels with their expected values prior to the insulation score calculation. While this avoids dealing with missing pixels, this can be misleading in case larger regions are unmappable or have been filtered out for some other reason, so enable this setting with caution. The related parameter na_threshold instead controls how much of a sliding window can be covered by missing pixels before assigning it NaN by default. This is set to 0.5 by default - by increasing this value you will obtain more robust scores, but also the number of NaNs will increase in your results. Whatever value you choose, it is a tradeoff between information content and accuracy. In most cases you should be okay with the default setting.

normalise, which is enabled by default, divides insulation scores by their chromosome mean, so every score is expressed relative to the expected score on the same chromosome. By disabling this option, you will get raw sums of weights in the sliding window. Normalisation is stringly recommended, however. You can also normalise to a more local region instead of the whole chromosome, which may account for local variations in contact intensity other than insulation. For this purpose, set normalisation_window to a number of bins, for example 300, which is then the number of scores surrounding each region used for normalisation by their mean.

After their computation and normalisation, insulation scores are log2-transformed. This makes scores (roughly) symmetrical around 0. To disable this, for example if you have also disabled normalise, use log=False.

The original insulation score definition uses the arithmetic mean to normalise the scores. If you expect insulation score outliers in your data, which might affect the normalisation by the mean, you can use a trimmed arithmetic mean instead by setting trim_mean_proportion to a value between 0 and 1. The top fraction of insulation scores is then discarded for calculating the mean.

If you intend to compare scores from different samples (and have the normalise and log options enabled), a mathematically more accurate normalisation would be to use the geometric mean instead. You can enable this by setting geometric_mean=True.

Insulation score export¶

You can export scores to a human-readable file format using one of to_bed(), or to_gff(). Export to a fast BigWig track using to_bigwig().

insulation.to_bed("architecture/domains/fanc_example_100kb.insulation_1mb.bed", 1000000)

Insulation score plots¶

You can plot all calculated insulation scores using GenomicVectorArrayPlot:

p = fancplot.GenomicVectorArrayPlot(insulation, colormap='RdBu_r', vmin=-1, vmax=1,
                                    genomic_format=True)
p.plot('chr18:18mb-28mb')

If you only want to plot a single score track, use LinePlot:

p = fancplot.LinePlot(insulation, ylim=(-1, 1), colors=['darkturquoise'], style="mid",
                      attribute="insulation_1000000")
p.plot('chr18:18mb-28mb')

You can read more about different plot types in Plotting (API).

Boundary calls¶

Once you have the insulation object, you can call insulating boundaries in the genome using Boundaries, specifically from_insulation_score():

boundaries = fanc.Boundaries.from_insulation_score(insulation, window_size=1000000)

This finds all minima in the insulation score vector and returns the corresponding regions as a RegionsTable object, which is RegionBased. The boundary strength is stored in the score attribute of each region:

for boundary in boundaries.regions:
    score = boundary.score

By default, all minima are reported, including very weak and likely false-positive boundaries. We recommend plotting the boundaries and their strength alongside the Hi-C matrix and then decide on a score cutoff to only select a robust set of boundaries!

ph = fancplot.TriangularMatrixPlot(hic_100kb, max_dist=5000000, vmin=0, vmax=0.05)
pb = fancplot.BarPlot(boundaries)
f = fancplot.GenomicFigure([ph, pb])
fig, axes = f.plot('chr18:18mb-28mb')

You can filter out false-positives manually like this

min_score = 1.0
robust_boundaries = []
for boundary in boundaries.regions:
    score = boundary.score
    if score >= min_score:
        robust_boundaries.append(boundary)

or by simply using the min_score argument:

robust_boundaries = fanc.Boundaries.from_insulation_score(insulation, window_size=1000000,
                                                          min_score=1.0)

Directionality index¶

For a short explanation of the directionality index, please read Directionality Index. The principle to compute the directionality index is highly similar to the insulation score. The dedicated class DirectionalityIndexes handles all relevant actions. Use from_hic() to compute it for multiple window sizes:

directionality = fanc.DirectionalityIndexes.from_hic(hic_100kb,
                                                     [1000000, 1500000,
                                                      2000000, 2500000, 3000000,
                                                      3500000, 4000000],
                                                     file_name="architecture/domains/fanc_example_100kb.di")

Plotting also works the same way, for all indexes:

p = fancplot.GenomicVectorArrayPlot(directionality, colormap='RdBu_r', vmin=-0.1, vmax=0.1,
                                    genomic_format=True)
p.plot('chr18:18mb-28mb')

And for single a single directionality index:

p = fancplot.LinePlot(directionality, ylim=(-0.1, 0.1), colors=['darkturquoise'], style="mid",
                      attribute="directionality_4000000")
p.plot('chr18:18mb-28mb')