HOT loci are prevalent in the genome.

A) Distribution of the number of loci by the number of overlapping peaks 400bp loci. Loci are binned on a logarithmic scale (Table 1. Methods). The shaded region represents the HOT loci. B) Prevalence of DAPs in HOT loci. Each dot represents a DAP. X-axis: percentage of HOT loci in which DAP is present (e.g. MAX is present in 80% of HOT loci). Y-axis: percentage of total peaks of DAPs that are located in HOT loci (e.g. 45% of all the ChIP-seq peaks of MAX is located in the HOT loci). Dot color and size are proportional to the total number of ChIP-seq peaks of DAP. C) Breakdown of HepG2 HOT loci to the promoter, intronic and intergenic regions. D) Fractions of HOT enhancer and promoter loci located in ATAC-seq. E) Overlaps between the HOT enhancer, HOT promoter, super-enhancer, regular enhancer, H3K27ac, and H4K4me1 regions. Horizontal bars on bottom left represent the total number of loci of the corresponding class of loci. All of the visualized data is generated from the HepG2 cell line.

Schema of classifying loci according to the number of bound DAPs.

The initial 4 bins are loci bound by DAPs increasing linearly from 1 to 5 (gray fields). The remaining 10 bins are defined by edge values increasing on a logarithmic scale from 5 to the maximum number of available DAPs in each cell line (orange and red fields) using the Numpy formula np.logspace(np.log10(5), np.log10(max_tfs), 11, dtype=int). HOT loci correspond to the last 5 bin edges (red fields).

PCA plots of HOT loci based on the DAP presence vectors.

Each dot represents a HOT locus: A) PC1 and PC2, marked promoters and enhancers. B) PC1 and PC2, marked p300-bound HOT loci. C) PC1 and PC4, marked CTCF-bound HOT loci. The dashed lines in A,B,C are logistic regression lines. auROC values are results of logistic regression. D) DAPs hierarchically clustered by their involvement in HOT promoters and HOT enhancers. Heatmap colors indicate the % of HOT enhancers or promoters that a given DAP overlaps with. All of the visualized data is generated from the HepG2 cell line.

A) Densities of long-range Hi-C chromatin contacts between the DAP-bound loci.

Each horizontal and vertical bin represents the loci with the number of bound DAPs between the edge values. The density values of each cell are normalized by the maximum value across all pairwise bins. Green boxes represent HOT loci. B) Distribution of HOT loci in Hi-C contact regions. X-axis is the number of Hi-C contacts. Numbers in the top row indicate the total number of genomic loci engaging in the given number of Hi-C contacts. Bars indicate the % of Hi-C loci that contain at least one HOT locus. C) Distribution of the number of HOT loci in regions with a given number of Hi-C contacts. X-axis is the same as B. All of the visualized data is generated from the HepG2 cell line.

HOT regions induce strong ChIP-seq signals.

A) Distribution of the signal values of the ChIP-seq peaks by the number of bound DAPs. The shaded region represents the HOT loci. B,C) DAPs sorted by the ratio of ChIP-seq signal strength of the peaks located in HOT loci and non-HOT loci. 20 most HOT-specific (red bars) and 20 most non-HOT-specific (blue bars) DAPs are depicted. B) Fold change (log2) of the HOT and non-HOT loci ChIP-seq signals. C) Distribution of the average ChIP-seq signal in the loci binned by the number of bound DAPs. Rows represent the loci with the bound DAPs indicated by the values of the edges (y-axis). Green box regions demarcate the HOT regions. D) Signal values of ssDAPs, nssDAPs (see the text for description), H3K27ac, CTCF, P300 peaks in HOT promoters and enhancers. All of the visualized data is generated from the HepG2 cell line.

Sequence features of HOT loci.

A) Distribution of conservation score in loci bound by DAPs in HepG2 and K562. The logarithmic part of the bins is expressed in terms of the percentages of loci that each bin covers, averaged over two cell lines. The shaded region represents HOT loci. B) phastCons conservation scores of regular enhancer, HOT loci, and exon regions. The values are normalized by the average scores of regular enhancers. C) Classification performances (auROC) of HOT loci against the backgrounds of DHS, promoter, and regular enhancer regions. The x-axis values are the methods used for classifications. Methods starting with “seq-” are based on sequences (CNNs and gkmSVM). Starting with “feat-” are methods where all sequence features are used (GC, CpG, GpC, CpG island).

HOT promoters are ubiquitous and HOT enhancers are tissue-specific.

A) Fractions of housekeeping genes regulated by the given category of loci (blue). Fractions of the loci which regulate the housekeeping genes (orange) B) Tissue-specificity (tau) scores of the target genes of different types of regulatory regions C) GO enriched terms of HOT promoters and enhancers of HepG2. 0 values in the p-values columns indicate that the GO term was not present in the top 50 enriched terms as reported by the GREAT tool. All of the visualized data is generated from the HepG2 cell line.

H1-hESC HOT loci A) Overlaps between the HOT loci of three cell lines.

B) Overlaps between the HOT loci of cell lines defined using the set of DAPs available in all three cell lines. C) Fractions of H1 HOT loci overlapping that of the HepG2 and K562 using the complete set of DAPs, common DAPs, and DAPs randomly subsampled in HepG2/K562 to match the size of H1 DAPs set D) phastCons scores of HOT loci in HepG2, K562, and H1.

Densities of variants A) common INDELs (MAF>5%)

B) common SNPs (MAF>5%) C) eQTLs, D) caQTLs E) raQTLs, and F) GWAS and LD (r2>0.8) variants in HOT loci and regular promoters and enhancers. G) Enriched GWAS traits in HOT enhancers and promoters. All of the visualized data is generated from the HepG2 cell line.

HOT loci as transcriptional condensates.

A) fraction of DAPs annotated as LLPS proteins in CD-CODE database. B) (upper) Distribution of DAPs in HOT loci binned by the % of HOT loci they overlap with. (lower) % of DAPs in the bins annotated as LLPS. Green points are the expected percentage values obtained by randomly shuffling the peaks in HOT loci 10 times. C) Z-scores of ChIP-seq signal values of LLPS proteins and the rest of the DAPs in HOT loci. D) % of the protein lengths predicted as IDRs (MobiDB) in LLPS proteins and the rest of the DAPs. E) Enrichment of ChIP-seq peaks of RNA-binding proteins and the rest of the DAPs. F) Enrichment of FANTOM, PINTS, and CAGE regions in HOT, regular enhancers, and regular promoters. G) Enrichment of eCLIP RBP-RNA interactions in HOT, exons, regular enhancers, and regular promoters. E,F,G) Enrichment values are quantified as log2(fold-change) with ATAC-seq regions as a background. C,D,E,G) red dots represent the mean values of the boxplots.