Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies

  1. Ye Zheng
  2. Ferhat Ay
  3. Sunduz Keles  Is a corresponding author
  1. University of Wisconsin-Madison, United States
  2. La Jolla Institute for Allergy and Immunology, United States

Abstract

Current Hi-C analysis approaches are unable to account for reads that align to multiple locations, and hence underestimate biological signal from repetitive regions of genomes. We developed and validated mHi-C, a multi-read mapping strategy to probabilistically allocate Hi-C multi-reads. mHi-C exhibited superior performance over utilizing only uni-reads and heuristic approaches aimed at rescuing multi-reads on benchmarks. Specifically, mHi-C increased the sequencing depth by an average of 20% resulting in higher reproducibility of contact matrices and detected interactions across biological replicates. The impact of the multi-reads on the detection of significant interactions is influenced marginally by the relative contribution of multi-reads to the sequencing depth compared to uni-reads, cis-to-trans ratio of contacts, and the broad data quality as reflected by the proportion of mappable reads of datasets. Computational experiments highlighted that in Hi-C studies with short read lengths, mHi-C rescued multi-reads can emulate the effect of longer reads. mHi-C also revealed biologically supported bona fide promoter-enhancer interactions and topologically associating domains involving repetitive genomic regions, thereby unlocking a previously masked portion of the genome for conformation capture studies.

Data availability

GEO and ENCODE accession codes for all the data analyzed in this manuscript are provided in the manuscript.Source data files have been provided for Figures 1, 3, 4, and 5 (some via Dryad http://dx.doi.org/10.5061/dryad.v7k3140).The mHiC software is made available on github https://github.com/keleslab/mHiC with proper documentation.

The following data sets were generated
The following previously published data sets were used

Article and author information

Author details

  1. Ye Zheng

    Department of Statistics, University of Wisconsin-Madison, Madison, United States
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8806-2761
  2. Ferhat Ay

    La Jolla Institute for Allergy and Immunology, La Jolla, United States
    Competing interests
    The authors declare that no competing interests exist.
  3. Sunduz Keles

    Department of Statistics, University of Wisconsin-Madison, Madison, United States
    For correspondence
    keles@stat.wisc.edu
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9048-0922

Funding

National Human Genome Research Institute (HG009744)

  • Sunduz Keles

La Jolla Institute for Allergy and Immunology (Institute Leadership Funds)

  • Ferhat Ay

National Human Genome Research Institute (HG007019)

  • Sunduz Keles

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

© 2019, Zheng et al.

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,897
    views
  • 505
    downloads
  • 29
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Ye Zheng
  2. Ferhat Ay
  3. Sunduz Keles
(2019)
Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies
eLife 8:e38070.
https://doi.org/10.7554/eLife.38070

Share this article

https://doi.org/10.7554/eLife.38070

Further reading

    1. Cell Biology
    2. Computational and Systems Biology
    Sarah De Beuckeleer, Tim Van De Looverbosch ... Winnok H De Vos
    Research Article

    Induced pluripotent stem cell (iPSC) technology is revolutionizing cell biology. However, the variability between individual iPSC lines and the lack of efficient technology to comprehensively characterize iPSC-derived cell types hinder its adoption in routine preclinical screening settings. To facilitate the validation of iPSC-derived cell culture composition, we have implemented an imaging assay based on cell painting and convolutional neural networks to recognize cell types in dense and mixed cultures with high fidelity. We have benchmarked our approach using pure and mixed cultures of neuroblastoma and astrocytoma cell lines and attained a classification accuracy above 96%. Through iterative data erosion, we found that inputs containing the nuclear region of interest and its close environment, allow achieving equally high classification accuracy as inputs containing the whole cell for semi-confluent cultures and preserved prediction accuracy even in very dense cultures. We then applied this regionally restricted cell profiling approach to evaluate the differentiation status of iPSC-derived neural cultures, by determining the ratio of postmitotic neurons and neural progenitors. We found that the cell-based prediction significantly outperformed an approach in which the population-level time in culture was used as a classification criterion (96% vs 86%, respectively). In mixed iPSC-derived neuronal cultures, microglia could be unequivocally discriminated from neurons, regardless of their reactivity state, and a tiered strategy allowed for further distinguishing activated from non-activated cell states, albeit with lower accuracy. Thus, morphological single-cell profiling provides a means to quantify cell composition in complex mixed neural cultures and holds promise for use in the quality control of iPSC-derived cell culture models.

    1. Computational and Systems Biology
    2. Structural Biology and Molecular Biophysics
    Bin Zheng, Meimei Duan ... Peng Zheng
    Research Article

    Viral adhesion to host cells is a critical step in infection for many viruses, including monkeypox virus (MPXV). In MPXV, the H3 protein mediates viral adhesion through its interaction with heparan sulfate (HS), yet the structural details of this interaction have remained elusive. Using AI-based structural prediction tools and molecular dynamics (MD) simulations, we identified a novel, positively charged α-helical domain in H3 that is essential for HS binding. This conserved domain, found across orthopoxviruses, was experimentally validated and shown to be critical for viral adhesion, making it an ideal target for antiviral drug development. Targeting this domain, we designed a protein inhibitor, which disrupted the H3-HS interaction, inhibited viral infection in vitro and viral replication in vivo, offering a promising antiviral candidate. Our findings reveal a novel therapeutic target of MPXV, demonstrating the potential of combination of AI-driven methods and MD simulations to accelerate antiviral drug discovery.