The SCRMshaw method and analysis pipeline.

(A) Supervised motif-blind CRM discovery (SCRMshaw). (a) SCRMshaw uses a training set of known D. melanogaster enhancers (“training sequences”), drawn from REDfly, that are defined by common functional characterization, and a 10-fold larger background set of similarly sized common functional characterization, non-enhancer sequences (“background sequences”). (b) The short DNA subsequence (kmer) count distributions of these sequences are then used to train a statistical model. The trained model (c) is used to score overlapping windows in the “target genome.” (d) High-scoring regions are predicted to be functional regulatory sequences (asterisks). Figure adapted from (73). (B) The workflow used for the regulatory genome annotation described in this paper. The left side shows pre-processing steps, the right side, post-processing. Input to SCRMshaw consists of the genome sequence and gene annotation. A protein sequence annotation is supplied later for the orthology mapping step. Final results are made available as part of the REDfly regulatory annotation knowledgebase.

Annotation of 33 insect genomes.

(A) Genomes from five insect orders were annotated in this study (more are ongoing). (B) Percentage of genes with Drosophila orthologs as mapped via our orthology pipeline (see Methods), for the 15 mapped species. For complete species names, see Table 1. (C) Total SCRMshaw predictions for each species. For each species, the lefthand column shows cumulative results for each SCRMshaw sub-method summed over each of the 48 training sets. The righthand column shows the number of unique predictions after merging overlapping predictions from both sub-methods and training sets. Species are displayed alphabetically by taxonomic order. (D) Size distribution of SCRMshaw predictions, prior to merging overlapping predictions but after removing outlier predictions > 2 kb in length. Species are ordered identically to panel C.

Species used in this study

SCRMshaw makes multiple predictions per locus.

The number of SCRMshaw predictions per locus (y-axis) are shown as boxplots for loci falling within the given size ranges (x-axis). Black boxes cover the 25th-75th percentiles, bars indicate median values and dots indicate values exceeding 1.5 times the interquartile range (boxes are not visible for all bins due to very low degrees of variation). Values in pink represent expected values drawn from randomization, while values in blue represent observed values from SCRMshaw. All values are from results with the training set “mapping1.visceral_mesoderm”; results from other training sets were similar (see Supplemental Table S3). Shown are results from the genomes of (A) D. melanogaster, (B) C. pipiens, and (C) A. aegypti representing small, medium, and large genomes, respectively.

SCRMshaw predicts CRMs in orthologous loci across species.

(A) The number of loci in common that contain at least one SCRMshaw prediction, for 10 or more species. (B) z- scores demonstrating that the number of loci in common with one or more SCRMshaw predictions is significantly higher than expectation, based on 360 randomizations. The small number of common predictions for 14-16 species make these statistics unreliable. Dotted lines indicate z-score values representing significance at the (unadjusted) P<0.005 and P<0.05 levels. (C) Fold enrichment values illustrating the excess of loci in common with one or more SCRMshaw predictions compared to expectation. Dotted line shows 1.5x enrichment.

Overlap of SCRMshaw predictions with FAIRE-seq and ATAC-seq peaks

Previously described gene expression and enhancer activity for select D. melanogaster sequences predicted by SCRMshaw.

The lefthand column shows native D. melanogaster gene expression in imaginal discs (green), while the righthand column shows described enhancer activity (magenta). Gray shading indicates that expression has not been described. Moving clockwise from the left side of each panel are the wing, haltere, leg, and eye- antennal discs. The enhancers whose activities are described in the table are: (B) ex_BCDE (37), (F) hth_GMR46D04 (42), (H) Ubx_GMR39A02 (42), (J) psq_GMR41E12 (42).

Gene loci chosen for in vivo validation

SCRMshaw predictions chosen for in vivo validation

Reporter gene expression for tested ex, klu, and ush predicted enhancer sequences.

Each row shows expression for the indicated construct in (i) wing discs, (ii) haltere discs, (iii) T1 (prothoracic) leg discs, (iv) T2 (mesothoracic) leg discs, (v) T3 (metathoracic) leg discs, and (vi) eye-antennal discs (with eye portion to the left). Enhancer activities were visualized by UAS-tdTomato that was included in the reporter construct. Scale bar is 50 µm for each column.

Reporter gene expression for tested hth, Ubx, and psq predicted enhancer sequences.

Each row shows expression for the indicated construct in (i) wing discs, (ii) haltere discs, (iii) T1 (prothoracic) leg discs, (iv) T2 (mesothoracic) leg discs, (v) T3 (metathoracic) leg discs, and (vi) eye-antennal discs (with eye portion to the left). Enhancer activities were detected by the G-TRACE system; magenta represents direct enhancer activity detected by dsRed expression, while green indicates lineage-based GFP expression. Scale bar is 50 µm for each column.

Summary of in vivo validation results