This interface simplifies and unifies working with matrix data in the context of genomic
region pairs, such as you would find in a Hi-C matrix. It builds on the
RegionPairsContainer (see previous section RegionPairsContainer),
which dealt with scores and other attributes between genomic regions, and extends it by
adding functions for representing the scores in a numeric matrix.
hic = fanc.load("examples/output/hic/binned/fanc_example_500kb.hic") isinstance(hic, fanc.matrix.RegionMatrixContainer) # True if interface supported
The matrix function¶
To obtain the whole-genome, normalised matrix from an object, use the
m = hic.matrix()
Of course, the
matrix() function supports matrix subsets:
m_chr19 = hic.matrix(('chr19', 'chr19'))
When using tuples as keys, the first entry will select the rows, and the second entry the columns of the matrix:
m_inter1 = hic.matrix(('chr18', 'chr19')) m_inter1.shape # (157, 119) m_inter2 = hic.matrix(('chr19', 'chr18')) m_inter2.shape # (119, 157)
The returned object is of type
RegionMatrix, which is a subclass
of Numpy’s masked
array with added perks for genomic region handling.
RegionMatrix can be used like any other numpy matrix,
for example calculating marginals by summing up values in rows or columns:
m_chr19.shape # (119, 119) marginals = np.sum(m_chr19, axis=0) marginals.shape # (119,) marginals[:5] # [1.0000000074007467, 0.9999999585562779, # 1.0000000102533806, 0.999999987196381, 1.0000000140165086]
(this Hi-C object is normalised on a per-chromosome basis, so each marginal will be close to 1)
Rows and columns in a matrix can be masked, i.e. their entries are considered invalid and
are ignored for most downstream analysis to prevent artifacts. By default, FAN-C masks
regions that have no edges (after filtering), typically due to mappability issues.
You can turn off masking using the
m_unmasked = hic.matrix(mask=False)
However, we recommend working with masked matrices to ensure no unwanted edges are part of your analyses.
RegionMatrix objects also keep track of the regions corresponding to
columns and rows of a matrix:
m_inter1.row_regions # [chr18:1-500000, # chr18:500001-1000000, # chr18:1000001-1500000, # chr18:1500001-2000000, # ... m_inter1.col_regions # [chr19:1-500000, # chr19:500001-1000000, # chr19:1000001-1500000, # chr19:1500001-2000000, # ...
You can subset a
RegionMatrix using indexes or region intervals:
# subset by index m_chr19_sub1 = m_chr19[0:3, 0:3] m_chr19_sub1.row_regions # [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000] m_chr19_sub1.col_regions # [chr19:1-500000, chr19:500001-1000000, chr19:1000001-1500000] # subset by region interval m_chr19_sub2 = m_chr19['chr19:2mb-3mb', 'chr19:500kb-1mb'] m_chr19_sub2.row_regions # [chr19:1500001-2000000, chr19:2000001-2500000, chr19:2500001-3000000] m_chr19_sub2.col_regions
Note that region interval definitions are always interpreted as 1-based, inclusive, and any
overlapping region is returned (in the above example the region
has a 1 base overlap with the requested interval).
m_chr19_uncorr = hic.matrix(('chr19', 'chr19'), norm=False)
In addition, there are several parameters specific to
matrix(). Most notably, you can use the
oe=True parameter to return an observed/expected (O/E) matrix:
m_chr19_oe = hic.matrix(('chr19', 'chr19'), oe=True)
expected_values to calculate the expected
(average) weight of all edges connecting regions at a certain distance. The matrix
is then simply divided by the expected matrix. You may want to log2-transform the
matrix for a symmetric scale of values:
m_chr19_log_oe = hic.matrix(('chr19', 'chr19'), oe=True, log=True)