Regulatory genome annotation of 33 insect species

  1. Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo NY 14203
  2. Department of Biology, Miami University, Oxford OH 45056
  3. Department of Biochemistry, University at Buffalo-State University of New York, Buffalo NY 14203
  4. Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo NY 14203
  5. Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo NY 14260

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Yamini Dalal
    National Cancer Institute, Bethesda, United States of America
  • Senior Editor
    Yamini Dalal
    National Cancer Institute, Bethesda, United States of America

Reviewer #1 (Public Review):

Summary:

The authors provide a genome annotation resource of 33 insects using a motif-blind prediction method for tissue-specific cis-regulatory modules. This is a welcome addition that may facilitate further research in new laboratory systems, and the approach seems to be relatively accurate, although it should be combined with other sources of evidence to be practical.

Strengths:

The paper clearly presents the resource, including the testing of candidate enhancers identified from various insects in Drosophila. This cross-species analysis, and the inherent suggestion that training datasets generated in flies can predict a cis-regulatory activity in distant insects, is interesting. While I can not be sure this approach will prevail in the future, for example with approaches that leverage the prediction of TF binding motifs, the SCRMShaw tool is certainly useful and worth consideration for the large community of genome scientists working on insects.

Weaknesses:

While the authors made the effort to provide access to the SCRMShaw annotations via the RedFly database, the usefulness of this resource is somewhat limited at the moment. First, it is possible to generate tables of annotated elements with coordinates, but it would be more useful to allow downloads of the 33 genome annotations in GFF (or equivalent) format, with SCRMshaw predictions appearing as a new feature. Also, I should note that unlike most species some annotations seem to have issues in the current RedFly implementation. For example, Vcar and Jcoen turn empty.

Reviewer #2 (Public Review):

Summary:

The ability of researchers to identify and compare enhancers across different species is an important facet of understanding gene regulation across development and evolution. Many traditional methods of enhancer identification involve sequence alignments and manual annotations, limiting the ability to expand the scope of regulatory investigations into many species. In order to overcome this obstacle, the authors apply a previously published machine learning method called SCRMshaw to predict enhancers across 33 insect species, using D. melanogaster as a reference. SCRMshaw operates through the selection of a few dozen training loci in a reference genome, marking genomic loci in other species that are significantly enriched with similar k-mer distributions relative to randomly selected genomic backgrounds. Upon identification of predicted enhancer regions, the authors perform post-processing step filtering and identify the most likely predicted enhancer candidates based on the proximity of an orthologous target gene. They then perform reporter gene analysis to validate selected predicted enhancers from other species in D. melanogaster. The analysis of the expression patterns returned variable results across the selected predicted regions.

Strengths:

The authors provide annotations of predicted regions across dozens of insect species, with the intention of expanding and refining the annotations for use by the scientific field. This is useful, as researchers will be able to use the identified annotations for their own work or as a benchmark for future methods. This work also showcases the flexible and versatile nature of SCRMshaw, which can readily obtain predictions using training sets of genomic loci requiring only a few dozen annotations as input. SCRMshaw does not require sequence alignments of the enhancers and can operate without prior knowledge of the cis-regulatory sequence rules such as transcription factor binding motifs, making it a useful tool to explore the evolution of enhancers in further distant and less well-studied species.

Weaknesses:

This work provides predicted enhancer annotations across many insect species, with reporter gene analysis being conducted on selected regions to test the predictions. However, the code for the SCRMshaw analysis pipeline used in this work is not made available, making reproducibility of this work difficult. Additionally, while the authors claim the predicted enhancers are available within the REDfly database, the predicted enhancer coordinates are currently not downloadable as Supplementary Material or from a linked resource.

The authors do not validate or benchmark the application of SCRMshaw against other published methods, nor do they seek to apply SCRMshaw under a variety of conditions to confirm the robustness of the returned predicted enhancers across species. Since SCRMshaw relies on an established k-mer enrichment of the training loci, its performance is presumably highly sensitive to the selection of training regions as well as the statistical power of the given k-mer counts. The authors do not justify their selection of training regions by which they perform predictions.

While there is an attempt made to report and validate the annotated predicted enhancers using previously published data and tools, the validation lacks the depth to conclude with confidence that the predicted set of regions across each species is of high quality. In vivo, reporter assays were conducted to anecdotally confirm the validity of a few selected regions experimentally, but even these results are difficult to interpret. There is no large-scale attempt to assess the conservation of enhancer function across all annotated species.

Lastly, it is suggested that predicted regions are derived from the shared presence of sequence features such as transcription factor binding motifs, detected through k-mer enrichment via SCRMshaw. This assumption has not been examined, although there are public motif discovery tools that would be appropriate to discover whether SCRMshaw is assigning predicted regions based on previously understood motif grammar, or due to other sequence patterns captured by k-mer count distributions. Understanding the sequence-derived nature of what drives predictions is within the scope of this work and would boost confidence in the predicted enhancers, even if it is limited to a few training examples for the sake of clarity of interpretation.

Reviewer #3 (Public Review):

Summary:

In this ambitious paper, the authors develop an unparalleled community resource of insect genome regulatory annotations spanning five insect orders. They employ their previously-developed SCRMshaw method for computational cross-species enhancer prediction, drawing on available training datasets of validated enhancer sequence and expression from Drosophila melanogaster, which had been previously shown to perform well across select holometabolous insects (representing 160-345MY divergence). In this work, they expand regulatory sequence annotation to 33 insect genomes spanning Holometabola and Hemiptera, which is even more distantly related to the fly model. They perform multiple downstream analyses of sets of predicted enhancers to assess the true-positive rate of predictions; the independent comparisons of real predictions with simulated predictions and with chromatin accessibility data, as well as the functional validation through reporter gene analysis, strengthen their conclusions that their annotation pipeline achieves a high true-positive rate and can be used across long divergence times to computationally annotate regulatory genome regions, an ability that has been previously inaccessible for non-model insects and now is possible across the many newly-sequenced insect scaffold-level genomes.

Strengths:

This work fills a large gap in current methods and resources for predicting regulatory regions of the genome, a task that has long lagged behind that of coding region prediction and analysis.

Despite technical constraints in working outside of well-developed model insect systems, the authors creatively draw on existing resources to scaffold a pipeline and independently assess the likelihood of prediction validity.

The established database will be a welcome community resource in its current state, and even more so as the authors continue to expand their annotations to more insect genomes as they indicate. Their available analysis pipeline itself will be useful to the community as well for research groups that may want to undertake their own regulatory genome annotation.

Weaknesses:

The rates of predicted true positive enhancer identification vary widely across the genomes included here based on the simulations and comparison to datasets of accessible chromatin in a manner that doesn't map neatly onto phylogenetic distance. At this point, it is unclear why these patterns may arise, although this may become more clear as regulatory annotation is undertaken for more genomes.

Functional assessment of predicted enhancers was performed through reporter gene assays primarily in Drosophila melanogaster imaginal discs, a system amenable to transgenics. Unfortunately, this mode of canonical imaginal disc development is only representative of a subset of all holometabolous insects; therefore, it is difficult to interpret reporter gene expression in a fly imaginal disc as evidence of a true positive enhancer that would be active in its native species whose adult appendages develop differently through the larval stage (for example, Coleopteran and Lepidopteran legs). However, the reporter gene assays from other tissues do offer strong evidence of true positive enhancer detection, and constraints on transgenic experiments in other systems mean that this approach is the best available.

Author response:

We thank the reviewers for their thoughtful and insightful comments. We were pleased to see that the reviewers and editors consider our work a “welcome addition” that “fills a large gap” in comparative genomics methods and provides “an unparalleled community resource of insect genome regulatory annotations.”

Many of the reviewers’ comments reflect weaknesses in our description of the methodology. As the basic SCRMshaw methodology has been published previously, we had opted for brevity over detail in the current manuscript. We recognize now that we went too far in that direction, and we will include more methodological detail in our revised submission, along with easier access to the code we used. The reviewers also offered some helpful suggestions regarding data availability which we intend to address, including direct download of the results in GFF format and adding to the results database several species that were inadvertently omitted.

Reviewer 2 expressed concerns about benchmarking SCRMshaw against other methods. We respectfully feel this lies outside the scope of the current study, which focuses on application of SCRMshaw to generate a multi-species annotation resource rather than on an attempt to show that SCRMshaw is superior to other approaches. We provide evidence in this manuscript, as well as in previous publications, that supports the effectiveness of SCRMshaw as an approach for regulatory element discovery that is suitable for the task at hand. Benchmarking for regulatory element discovery brings many challenges, as there are no comprehensive “truth” sets to serve as a comparison baseline. We therefore do not attempt strong claims here about the relative merits of SCRMshaw vs. other methods (although we have explored this in previous publications). Note that we also previously demonstrated commonality of transcription factor binding sites in cross-species SCRMshaw predictions, in particular in Kazemian et al. 2014 (Genome Biol. Evol. 6:2301).

Finally, because it has important implications for understanding our results, we would like to point out a small misconception in Reviewer 2’s Summary of our study. The reviewer states that we “identify the most likely predicted enhancer candidates based on the proximity of an orthologous target gene.” We stress, however, that putative target gene assignments and identities have no impact at all on our prediction of regulatory sequences. Predictions are solely based on sequence-dependent SCRMshaw scores, with no regard to the nature or identities of nearby annotated features. Putative target genes are mapped to Drosophila orthologs purely as a convenience to aid in interpreting and prioritizing the predicted regulatory elements. We will take care to clarify this important point in our revised submission.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation