Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies

Abstract
Data availability
Article and author information
Metrics

Abstract

Current Hi-C analysis approaches are unable to account for reads that align to multiple locations, and hence underestimate biological signal from repetitive regions of genomes. We developed and validated mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and detected interactions across biological replicates. The impact of the multi-reads on the detection of significant interactions is influenced marginally by the relative contribution of multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality as reflected by the proportion of mappable reads of datasets. Computational experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide promoter-enhancer interactions and topologically associating domains involving repetitive genomic regions, thereby unlocking a previously masked portion of the genome for conformation capture studies.

Data availability

GEO and ENCODE accession codes for all the data analyzed in this manuscript are provided in the manuscript.Source data files have been provided for Figures 1, 3, 4, and 5 (some via Dryad http://dx.doi.org/10.5061/dryad.v7k3140).The mHiC software is made available on github https://github.com/keleslab/mHiC with proper documentation.

The following data sets were generated

1. Zheng Y
2. Ay F
3. Keles S
(2018) Data from: Generative Modeling of Multi-mapping Reads with mHi-C Advances Analysis of High Throughput Genome-wide Conformation Capture Studies
Dryad Digital Repository, doi:10.5061/dryad.v7k3140.

http://dx.doi.org/10.5061/dryad.v7k3140

The following previously published data sets were used

1. Jin F
2. Li Y
3. Dixon JR
4. Selvaraj S
5. Ye Z
6. Lee AY
7. Yen CA
8. Schmitt AD
9. Espinoza C
10. Ren B
(2013) IMR90 Hi-C Dataset
NCBI Gene Expression Omnibus, GSE43070.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43070
1. Ay F
2. Bunnik EM
3. Varoquaux N
4. Bol SM
5. Prudhomme J
6. Vert JP
7. Noble WS
8. Le Roch KG
(2014) Plasmodium Hi-C Dataset
NCBI Gene Expression Omnibus, GSE50199.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50199
1. Rao SSP
2. Huntley MH
3. Durand NC
4. Stamenova EK
5. Bochkov ID
6. Robinson JT
7. Sanborn AL
8. Machol I
9. Omer AD
10. Lander ES
11. Aiden EL A
(2014) GM12878 Hi-C Dataset
NCBI Gene Expression Omnibus, GSE63525.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525
1. Dixon JR
2. Selvaraj S
3. Yue F
4. Kim A
5. Li Y
6. Shen Y
7. Hu M
8. Liu JS
9. Ren B
(2012) ESC(2012) Hi-C Dataset
NCBI Gene Expression Omnibus, GSE35156.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35156
1. Dixon JR
2. Xu J
3. Dileep V
4. Zhan Y
5. Song F
6. Le VT
7. Galip Gurkan Yardımcı AC
8. Bann DV
9. Wang Y
10. Clark R
11. Zhang L
12. Yang H
13. Liu T
14. Iyyanki S
15. An L
16. Pool C
17. Sasaki T
18. Rivera-Mulia JC
19. Özadam H
20. Lajoie BR
21. et al
(2018) A549 Hi-C Dataset
NCBI Gene Expression Omnibus, GSE92819.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92819
1. Bonev B
2. Cohen NM
3. Szabo Q
4. Fritsch L
5. Papadopoulos GL
6. Lubling Y
7. Xu X
8. Lv X
9. Hugnot JP
10. Tanay A
11. et al
(2017) ESC(2017) & Cortex Hi-C Datasets
NCBI Gene Expression Omnibus, GSE96107.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96107

Article and author information

Author details

Ye Zheng

Department of Statistics, University of Wisconsin-Madison, Madison, United States

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0002-8806-2761
Ferhat Ay

La Jolla Institute for Allergy and Immunology, La Jolla, United States

Competing interests
The authors declare that no competing interests exist.
Sunduz Keles

Department of Statistics, University of Wisconsin-Madison, Madison, United States

For correspondence
keles@stat.wisc.edu

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0001-9048-0922

Funding

National Human Genome Research Institute (HG009744)

Sunduz Keles

La Jolla Institute for Allergy and Immunology (Institute Leadership Funds)

Ferhat Ay

National Human Genome Research Institute (HG007019)

Sunduz Keles

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.