RegionMatrixContainer

This interface simplifies and unifies working with matrix data in the context of genomic region pairs, such as you would find in a Hi-C matrix. It builds on the RegionPairsContainer (see previous section RegionPairsContainer), which dealt with scores and other attributes between genomic regions, and extends it by adding functions for representing the scores in a numeric matrix.

After loading a dataset using load(), you can check for support of the RegionMatrixContainer interface with:

hic = fanc.load("examples/output/hic/binned/fanc_example_500kb.hic")
isinstance(hic, fanc.matrix.RegionMatrixContainer)  # True if interface supported

The current list of FAN-C classes supporting RegionMatrixContainer is: CoolerHic, JuicerHic, Hic, ABCompartmentMatrix, DifferenceMatrix, FoldChangeMatrix, PeakInfo, and RaoPeakInfo.

The matrix function

To obtain the whole-genome, normalised matrix from an object, use the matrix() function:

m = hic.matrix()

Of course, the matrix() function supports matrix subsets:

m_chr19 = hic.matrix(('chr19', 'chr19'))

When using tuples as keys, the first entry will select the rows, and the second entry the columns of the matrix:

m_inter1 = hic.matrix(('chr18', 'chr19'))
m_inter1.shape  # (157, 119)
m_inter2 = hic.matrix(('chr19', 'chr18'))
m_inter2.shape  # (119, 157)

The returned object is of type RegionMatrix, which is a subclass of Numpy’s masked array with added perks for genomic region handling.

A RegionMatrix can be used like any other numpy matrix, for example calculating marginals by summing up values in rows or columns:

m_chr19.shape  # (119, 119)
marginals = np.sum(m_chr19, axis=0)
marginals.shape  # (119,)
marginals[:5]
# [1.0000000074007467, 0.9999999585562779,
# 1.0000000102533806, 0.999999987196381, 1.0000000140165086]

(this Hi-C object is normalised on a per-chromosome basis, so each marginal will be close to 1)

Rows and columns in a matrix can be masked, i.e. their entries are considered invalid and are ignored for most downstream analysis to prevent artifacts. By default, FAN-C masks regions that have no edges (after filtering), typically due to mappability issues. You can turn off masking using the mask=False parameter:

m_unmasked = hic.matrix(mask=False)

However, we recommend working with masked matrices to ensure no unwanted edges are part of your analyses.

RegionMatrix objects also keep track of the regions corresponding to columns and rows of a matrix:

m_inter1.row_regions
# [chr18:1-500000,
#  chr18:500001-1000000,
#  chr18:1000001-1500000,
#  chr18:1500001-2000000,
# ...
m_inter1.col_regions
# [chr19:1-500000,
#  chr19:500001-1000000,
#  chr19:1000001-1500000,
#  chr19:1500001-2000000,
# ...

You can subset a RegionMatrix using indexes or region intervals:

# subset by index
m_chr19_sub1 = m_chr19[0:3, 0:3]
m_chr19_sub1.row_regions
# [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000]
m_chr19_sub1.col_regions
# [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000]

# subset by region interval
m_chr19_sub2 = m_chr19['chr19:2mb-3mb', 'chr19:500kb-1mb']
m_chr19_sub2.row_regions
# [chr19:1500001-2000000, chr19:2000001-2500000, chr19:2500001-3000000]
m_chr19_sub2.col_regions

Note that region interval definitions are always interpreted as 1-based, inclusive, and any overlapping region is returned (in the above example the region chr19:150001-200000 has a 1 base overlap with the requested interval).

matrix() supports all arguments also available for edges(), but it is not necessary to use lazy loading. You can, for example, output an uncorrected matrix with

m_chr19_uncorr = hic.matrix(('chr19', 'chr19'), norm=False)

In addition, there are several parameters specific to matrix(). Most notably, you can use the oe=True parameter to return an observed/expected (O/E) matrix:

m_chr19_oe = hic.matrix(('chr19', 'chr19'), oe=True)

Internally, oe=True uses expected_values to calculate the expected (average) weight of all edges connecting regions at a certain distance. The matrix is then simply divided by the expected matrix. You may want to log2-transform the matrix for a symmetric scale of values:

m_chr19_log_oe = hic.matrix(('chr19', 'chr19'), oe=True, log=True)