Hi-C domain analysis

Note

The following examples use the matrix files in FAN-C format. If you want to try the same commands using Juicer .hic files, replace output/hic/binned/fanc_example_100kb.hic with architecture/other-hic/fanc_example.juicer.hic@100kb. You will also need to adjust the -vmax value in the triangular matrix plot to 50. If you want to work with Cooler files in this tutorial, use architecture/other-hic/fanc_example.mcool@100kb. The results will be minimally different due to the “zooming” and balancing applied by each package.

Like compartments, topologically associating domains, or TADs, for a fundamental level of genome organisation.

fancplot -o architecture/domains/fanc_example_100kb_tads.png chr18:18mb-28mb \
     -p triangular output/hic/binned/fanc_example_100kb.hic -m 4000000 \
     -vmin 0 -vmax 0.05
../../_images/fanc_example_100kb_tads.png

FAN-C provides multiple “scores” that are designed to find the boundaries between domains.

Insulation Score

The insulation score (Crane et al. 2015) adds up contacts in a sliding window align the Hi-C matrix diagonal.

../../_images/fanc_example_100kb_tads_insulation_score_example.png

Regions with low score are “insulating”, i.e. regions between domains. Regions with high scores are most likely found inside domains.

Use fanc insulation to calculate the insulation score from the command line:

usage: fanc insulation [-h] [-o OUTPUT_FORMAT]
                       [-w WINDOW_SIZES [WINDOW_SIZES ...]] [-r REGION] [-i]
                       [--offset OFFSET] [-L] [-N]
                       [--normalisation-window NORMALISATION_WINDOW] [-s] [-g]
                       [--trim-mean TRIM_MEAN] [-tmp]
                       input [output]

Positional Arguments

input Input matrix (Hi-C, fold-change map, …)
output Output file. Format will be determined by “-o”. By default, this is a FAN-C InsulationScores object, for maximum compatibility with other analyses. If you choose a text-based output format (BED, GFF, BigWig), this parameter will be the file prefix, and the window size will be appended.If not specified and output format is one of bed, gff, or bigwig, the input file name forms the output file prefix.

Named Arguments

-o, --output-format
 Format of the output file. By default, this is a FAN-C InsulationScore object, for maximum compatibility with other analyses. Other options are “bed”, “bigwig”, and “gff”
-w, --window-sizes
 Window sizes in base pairs. You can also use abbreviated number format (i.e. 1.5M, 250kb, etc). If not specified, will choose the window sizes based on the matrix resolution r when calculating scores. Specifically: r*3, r*5, r*7, r*10, and r*15
-r, --region Region selector (<chr>:<start>-<end>) to only calculate II for this region.
-i, --impute Impute missing values in matrix. If set, missing matrix values (where an entire Hi-C bin has 0 contacts) will be replaced by the expected value at the given distance.
--offset Window offset in base pairs from the diagonal.
-L, --no-log Do not log2-transform insulation index after normalisation. Log-transformation roughly centers values around 0, but if you need this to be exactly centered, use the “–geom-mean” option.
-N, --no-norm Do not normalise index to insulation average Default is whole chromosome normalisation - to normalise to smaller regions, use –normalisation-window.
--normalisation-window
 Size of the normalisation window (moving average) in bins. Default: whole chromosome.
-s, --subtract-mean
 Subtract mean instead of dividing by it when “-n” is enabled. You probably do not want this, unless you are working with log-transformed matrices (e.g. fold-change matrices)
-g, --geom-mean
 Use geometric mean for normalisation (rather than arithmetic mean). Useful in conjunction with –log to center the distribution at 0. This is very important when comparing insulation scores, for example using the “fanc compare” command!
--trim-mean Use a trimmed mean for insulation index normalisation with this cutoff (fraction of scores)
-tmp, --work-in-tmp
 Work in temporary directory

Example

fanc insulation is typically used to calculate insulation scores with multiple window sizes at the same time, as a single window size might be prone to local matrix differences:

fanc insulation output/hic/binned/fanc_example_100kb.hic \
                architecture/domains/fanc_example_100kb.insulation \
                -w 1000000 1500000 2000000 2500000 3000000 3500000 4000000

Window sizes are chosen using the -w parameter.

We can easily plot all insulation scores at the same time using fancplot:

fancplot -o architecture/domains/fanc_example_50kb_tads_insulation.png chr18:18mb-28mb \
     -p triangular output/hic/binned/fanc_example_100kb.hic -m 4000000 -vmin 0 -vmax 0.05 \
     -p scores architecture/domains/fanc_example_100kb.insulation
../../_images/fanc_example_50kb_tads_insulation.png

Output formats

The default output format has maximum compatibility within FAN-C, but other tools like genome browsers won’t be able to read it. Export of insulation scores to a different format is simple. Just choose -output-format bed to export to BED file, -output-format gff to export to GFF format, or -output-format bigwig to export to an indexed BigWig file. The latter is a binary file, and hence not readable as text format, but is the fastest for accessing subsets of scores.

When using an output format other than the default, the second positional argument (output) becomes an output file prefix. It is appended by the window size (in abbreviated form, i.e. 1000000 becomes “1mb”) and gets the file ending of the chosen format. Example:

fanc insulation output/hic/binned/fanc_example_100kb.hic \
                architecture/domains/fanc_example_100kb.insulation \
                -w 1000000 1500000 2000000 2500000 3000000 3500000 4000000 \
                -o bed

This produces the output files:

architecture/domains/fanc_example_100kb.insulation_1.5mb.bed
architecture/domains/fanc_example_100kb.insulation_1mb.bed
architecture/domains/fanc_example_100kb.insulation_2.5mb.bed
architecture/domains/fanc_example_100kb.insulation_2mb.bed
architecture/domains/fanc_example_100kb.insulation_3.5mb.bed
architecture/domains/fanc_example_100kb.insulation_3mb.bed
architecture/domains/fanc_example_100kb.insulation_4mb.bed

Of course, you can also simply convert existing insulation scores to another format without having to recalculate everything. Simply run:

fanc insulation architecture/domains/fanc_example_100kb.insulation \
                -o bed

and the insulation scores for all window sizes in the object will be converted to BED files using the input file name as prefix. If you only want to convert specific window sizes, use the -w parameter. To find out which window sizes are available in a previously calculated scores object, simply run fanc insulation without any parameters:

fanc insulation architecture/domains/fanc_example_100kb.insulation

This prints:

Window sizes available in object:
1mb 1.5mb 2mb 2.5mb 3mb 3.5mb 4mb

You can plot scores from one or more window sizes using the line plot in fancplot:

fancplot --width 6 -o architecture/domains/fanc_example_100kb_tads_insulation_1mb.png \
               chr18:18mb-28mb \
               -p triangular output/hic/binned/fanc_example_100kb.hic -m 4000000 \
               -vmin 0 -vmax 0.05 \
               -p line architecture/domains/fanc_example_100kb.insulation_1mb.bed \
               architecture/domains/fanc_example_100kb.insulation_2mb.bed \
               -l "1mb" "2mb"
../../_images/fanc_example_50kb_tads_insulation_1mb.png

Normalisation

By default, fanc insulation will normalise the insulation scores to the chromosomal average and the log-transform them. You can get raw, untransformed scores using -N and --L, respectively. If you want to normalise the scores, but to a smaller region on the chromosome (to take into account local variability in insulation), you can choose the normalisation window size with --normalisation-window. The window is specified in bins.

Normally, fanc insulation will use the arythmetic mean of the chromosomal scores to normalise. This has the effect that scores upon log2-transformation are not perfectly centred around 0. To remedy this, you can use the geometric mean instead, with the -g option.

When you are working with matrices that are already log2-transformed, you may want to use the -s option to normalise the scores by subtracting, instead of dividing the chromosomal average.

If you have a lot of outliers and sharp score changes, you may use a trimmed mean to calculate average scores with --trim-mean <f>, which will ignore the top and bottom fraction f of scores for calculating the average.

Impute missing values

In the above examples, you will notice the region on the left that is unmappable in the Hi-C matrix. In the insulation score calculation, if the insulation window is covered by more than 50% of unmappable regions, the score will be NaN. fanc insulation offers the option to impute the unmappable values from the expected values of the chromosome using --impute.

fanc insulation output/hic/binned/fanc_example_100kb.hic \
                architecture/domains/fanc_example_100kb.insulation_imputed \
                -w 1000000 1500000 2000000 2500000 3000000 3500000 4000000 \
                --impute

This will result in score without NaN (at least in the center of chromosomes), but can also be misleading if the region of interest happens to lie in an unmappable region. Therefore use this capability with caution!

fancplot -o architecture/domains/fanc_example_100kb_tads_insulation.png chr18:18mb-28mb \
     -p triangular output/hic/binned/fanc_example_100kb.hic -m 4000000 -vmin 0 -vmax 0.05 \
     -p scores architecture/domains/fanc_example_100kb.insulation_imputed -vmin -1 -vmax 1
../../_images/fanc_example_50kb_tads_insulation_imputed.png

Insulating boundaries (TAD boundaries)

Regions in the genome where the insulation score reaches a local minimum represent the region between two self-interacting domains, or TADs. You can use fanc boundaries to identify these regions:

usage: fanc boundaries [-h] [-w WINDOW] [-d DELTA] [-s MIN_SCORE] [-x] [-l]
                       [-m]
                       input output

Positional Arguments

input Input InsulationScores or regions file
output Path for boundary BED file. When specifying multiple window sizes or if input file has multiple scores in it, this forms the output file prefix and will be appended by ‘<window size>.bed’

Named Arguments

-w, --window-sizes
 Insulation index window size to calculate boundaries on. Separate multiple window sizes with comma, e.g. 1mb,500kb,100kb
-d, --delta Window size for calculating the delta vector (in bins). Calculation takes into account d bins upstream and d bins downstream for a total window size of 2*d + 1 bins. Default 3.
-s, --min-score
 Report only peaks where the two surrounding extrema of the delta vector have at least this difference in height. Default: no threshold.
-x, --sub-bin-precision
 Report boundary positions with sub-bin precision. This works because the minimum or the the insulation score track can be determined with sub-bin precision. Default: False
-l, --log log-transform index values before boundary calling.
-m, --maxima Call maxima of the insulation score instead of minima.

When we run fanc boundaries on the above example using 1mb and 2mb as the window sizes:

fanc boundaries architecture/domains/fanc_example_100kb.insulation \
                architecture/domains/fanc_example_100kb.insulation_boundaries \
                -w 1mb 2mb

We get two output files with all insulation score minima and associated scores (the depth of the minimum compared to the two neighboring maxima):

fanc_example_100kb.insulation_boundaries_1mb.bed
fanc_example_100kb.insulation_boundaries_2mb.bed

Let’s plot the boundaries from the 1mb scores:

fancplot --width 6 -o architecture/domains/fanc_example_100kb_tads_insulation_1mb_boundaries.png \
     chr18:18mb-28mb \
     -p triangular output/hic/binned/fanc_example_100kb.hic -m 4000000 -vmin 0 -vmax 0.05 \
     -p line architecture/domains/fanc_example_100kb.insulation_1mb.bed -l "1mb" \
     -p bar architecture/domains/fanc_example_100kb.insulation_boundaries_1mb.bed
../../_images/fanc_example_50kb_tads_insulation_1mb_boundaries.png

As you can see, lower minima get higher scores. By default, fanc boundaries outputs all minima, but you may set a threshold using --min-score <s> to report only boundaries with scores greater than s.

fanc boundaries architecture/domains/fanc_example_100kb.insulation \
                architecture/domains/fanc_example_100kb.insulation_boundaries_score0.7 \
                -w 1mb 2mb --min-score 0.7

By default, fanc boundaries will return minima as matrix bins. However, since the boundary calls rely on a smoothed insulation score track, it can attempt to identify the boundary location with sub-bin resolution. Use -x to try this, but be aware that this is not precise.

Directionality Index

The directionality index (Dixon et al. 2012) measures the bias in contact frequency up- and downstream of an Hi-C region. When inside TADs, this measure tends towards zero, as interactions in either direction are equally frequent. However, when approaching a TAD boundary this measure changes drastically, as one direction will remain inside the TAD, where there is a high contact intensity, whereas the other direction will lie in a low intensity region outside the TAD.

Use fanc directionality to calculate the directionality index from the command line:

usage: fanc directionality [-h] [-o OUTPUT_FORMAT]
                           [-w WINDOW_SIZES [WINDOW_SIZES ...]] [-r REGION]
                           [-tmp]
                           input [output]

Positional Arguments

input Input matrix (Hi-C, fold-change map, …)
output Output file. Format will be determined by “-o”. By default, this is a FAN-C DirectionalityIndexes object, for maximum compatibility with other analyses. If you choose a text-based output format (BED, GFF, BigWig), this parameter will be the file prefix, and the window size will be appended.If not specified and output format is one of bed, gff, or bigwig, the input file name forms the output file prefix.

Named Arguments

-o, --output-format
 Format of the output file. By default, this is a FAN-C DirectionalityIndex object, for maximum compatibility with other analyses. Other options are “bed”, “bigwig”, and “gff”
-w, --window-sizes
 Window sizes in base pairs. You can also use abbreviated number format (i.e. 1.5M, 250kb, etc). If not specified, will choose the window sizes based on the matrix resolution r when calculating scores. Specifically: r*3, r*5, r*7, r*10, and r*15
-r, --region Region selector (<chr>:<start>-<end>) to only calculate directionality index for this region.
-tmp, --work-in-tmp
 Work in temporary directory

Example

fanc directionality is very similar in syntax to fanc insulation. It is typically used to calculate directionality indexes with multiple window sizes at the same time, as a single window size might be prone to local matrix differences:

fanc directionality output/hic/binned/fanc_example_100kb.hic \
                    architecture/domains/fanc_example_100kb.directionality \
                    -w 1000000 1500000 2000000 2500000 3000000 3500000 4000000

Window sizes are chosen using the -w parameter.

We can easily plot all directionality indexes at the same time using fancplot:

fancplot -o architecture/domains/fanc_example_100kb_tads_directionality.png chr18:18mb-28mb \
     -p triangular output/hic/binned/fanc_example_100kb.hic -m 4000000 -vmin 0 -vmax 0.05 \
     -p scores architecture/domains/fanc_example_100kb.directionality
../../_images/fanc_example_50kb_tads_directionality.png

To export the directionality index to other genomic formats using fanc directionality follow the instructions as for fanc insulation.

A note on TAD calling

There are a lot of tools available for calling TADs in Hi-C matrices, including one that we have written called TADtool. However, and this is a point we are also making with TADtool specifically, TAD calling algorithms often depend critically on their input parameters, and different TAD callers can lead to very different results. We are therefore currently not bundling a TAD calling tool with FAN-C, and refer the user to one of the many available tools for TAD calling that offer a wide range of features.