PCA analysis

When working with multiple Hi-C libraries, it is often useful to assess the variability between replicates and samples from different conditions. This can provide valuable information about potential experimental biases and whether samples from different replicates can be safely merged.

usage: fanc pca [-h] [-p PLOT] [-s SAMPLE_SIZE] [--inter-chromosomal]
                [-r REGION] [-e EXPECTED_FILTER] [-b BACKGROUND_FILTER]
                [--min-distance MIN_DISTANCE] [--max-distance MAX_DISTANCE]
                [-n NAMES [NAMES ...]] [--strategy STRATEGY]
                [-c COLORS [COLORS ...]] [-m MARKERS [MARKERS ...]]
                [-v EIGENVECTORS EIGENVECTORS] [-Z] [-S] [-f] [-tmp]
                input [input ...] output

Positional Arguments

input Input Hic files
output Output file with PCA results.

Named Arguments

-p, --plot Output plot. Path to PDF file where the PCA plot will be saved.
-s, --sample-size
 Sample size for contacts to do the PCA on.Default: 50000
--inter-chromosomal
 Also include inter-chromosomal contacts in PCA. By default, only intra-schromosomal contacts are considered.
-r, --region Region to do PCA on. You could put a specific chromosome here, for example. By default, the whole genome is considered. Comma-separate multiple regions.
-e, --expected-filter
 Minimum fold-enrichment over expected value. Contacts with a strength lower than <b>*E(d), where d is the distance between two loci and E is the corresponding expected contact strength, are filtered out before PCA. Default: no filter.
-b, --background-filter
 Minimum fold-enrichment over average inter-chromosomal contacts. Default: no filter.
--min-distance Minimum distance of matrix bins in base pairs. You can use abbreviated formats such as 1mb, 10k, etc.
--max-distance Maximum distance of matrix bins in base pairs. You can use abbreviated formats such as 1mb, 10k, etc.
-n, --names Sample names for plot labelling.
--strategy Mechanism to select pairs from Hi-C matrix. Default: variance. Possible values are: variance (select contacts with the largest variance in strength across samples first), fold-change (select pairs with the largest fold-change across samples first), and passthrough (no preference on pairs).
-c, --colors Colors for plotting.
-m, --markers Markers for plotting. Follows Matplotlib marker definitions: http://matplotlib.org/api/markers_api.html
-v, --eigenvectors
 Which eigenvectors to plot. Default: 1 2
-Z, --no-zeros Ignore pixels with no contacts in any sample.
-S, --no-scaling
 Do not scale input matrices to the same number of valid pairs. Use this only if you are sure matrices are directly comparable.
-f, --force-overwrite
 
If the specified output file exists, it will be
overwritten without warning.
-tmp, --work-in-tmp
 Work in temporary directory

Example

As an example, we ran a PCA analysis on the 1mb resolution mESC Hi-C matrices from our Low-C paper using different restriction enzymes (MboI and HindIII), as well as different input cell numbers.

fanc pca -n "HindIII 100k" "HindIII 5M" "MboI 100k" "MboI 1M" "MboI 50k" \
         -Z -s 100000 -r chr19 -p architecture/pca/lowc.pca.png \
         architecture/other-hic/lowc_hindiii_100k_1mb.hic \
         architecture/other-hic/lowc_hindiii_5M_1mb.hic \
         architecture/other-hic/lowc_mboi_100k_1mb.hic \
         architecture/other-hic/lowc_mboi_1M_1mb.hic \
         architecture/other-hic/lowc_mboi_50k_1mb.hic \
         architecture/pca/lowc.pca

The result looks like this, where you can clearly see the division between the different restriction enzymes:

../../_images/lowc.pca.png

Filters

By default, PCA is run on the whole genome. In the example above, we have restricted the analysis to chromosome 19 using the -r chr19 argument. -Z instructs fanc pca to use only non-zero matrix entries for the PCA - this can help mitigate the effect of very weak contacts on the variability.

In the example, we are limiting the number of contacts used for the PCA to 100,000 using -s 100000. The type of contacts in the sample are chose according to the --strategy option. By default, the contacts with the largest variability across the genome are chosen, but you can also use --strategy fold-change to choose contacts with the largest fold-change or --strategy passthrough to make no prior selection of contacts.

If you only want to include contacts up to (or above a) a certain distance, you can specify that distance using the --max-distance (or min-distance) option.

Finally, fanc pca offers two filters designed to remove uninformative contacts before PCA. The first, --expected-filter <f>, removes all contacts in which all samples have a signal below f x E(d), where E is the expected value function depending on the distance d. The second, --background-filter <f> removes contacts in which all samples have an f -fold enrichment over the expected inter-chromosomal contacts.