Systematic analysis of naturally occurring insertions and deletions that alter transcription factor spacing identifies tolerant and sensitive transcription factor pairs

  1. Zeyang Shen
  2. Rick Z Li
  3. Thomas A Prohaska
  4. Marten A Hoeksema
  5. Nathan J Spann
  6. Jenhan Tao
  7. Gregory J Fonseca
  8. Thomas Le
  9. Lindsey K Stolze
  10. Mashito Sakai
  11. Casey E Romanoski
  12. Christopher K Glass  Is a corresponding author
  1. Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, United States
  2. Department of Bioengineering, Jacobs School of Engineering, University of California San Diego, United States
  3. Department of Medicine, School of Medicine, University of California San Diego, United States
  4. Department of Medical Biochemistry, Experimental Vascular Biology, Amsterdam Infection and Immunity, Amsterdam Cardiovascular Sciences, Amsterdam UMC, University of Amsterdam, Netherlands
  5. Department of Medicine, McGill University, Canada
  6. Division of Biological Sciences, University of California San Diego, United States
  7. Department of Cellular and Molecular Medicine, College of Medicine, University of Arizona, United States
  8. Department of Biochemistry and Molecular Biology, Nippon Medical School, Japan

Abstract

Regulation of gene expression requires the combinatorial binding of sequence-specific transcription factors (TFs) at promoters and enhancers. Prior studies showed that alterations in the spacing between TF binding sites can influence promoter and enhancer activity. However, the relative importance of TF spacing alterations resulting from naturally occurring insertions and deletions (InDels) has not been systematically analyzed. To address this question, we first characterized the genome-wide spacing relationships of 73 TFs in human K562 cells as determined by ChIP-seq (chromatin immunoprecipitation sequencing). We found a dominant pattern of a relaxed range of spacing between collaborative factors, including 45 TFs exclusively exhibiting relaxed spacing with their binding partners. Next, we exploited millions of InDels provided by genetically diverse mouse strains and human individuals to investigate the effects of altered spacing on TF binding and local histone acetylation. These analyses suggested that spacing alterations resulting from naturally occurring InDels are generally tolerated in comparison to genetic variants directly affecting TF binding sites. To experimentally validate this prediction, we introduced synthetic spacing alterations between PU.1 and C/EBPβ binding sites at six endogenous genomic loci in a macrophage cell line. Remarkably, collaborative binding of PU.1 and C/EBPβ at these locations tolerated changes in spacing ranging from 5 bp increase to >30 bp decrease. Collectively, these findings have implications for understanding mechanisms underlying enhancer selection and for the interpretation of non-coding genetic variation.

Editor's evaluation

Transcription factors (TFs) bind to the DNA in a sequence-specific manner at TF binding sites (TFBSs) to control gene transcription. Hence, characterizing how TFs interact with DNA is key to uncover how gene regulation occurs and how this process can be disrupted in diseases. While the binding properties of a large portion of human TFs are well characterized, a remaining challenge lies in our knowledge of how TFs interact cooperatively at regulatory elements, either forming dimers or co-binding the same regions. In this manuscript, Shen et al. explored spacing patterns between TFBSs using previously published data sets and revealed that the dominant pattern is a relaxed range of spacing between collaborative factors and tolerance for InDels that change the TFBS spacing.

https://doi.org/10.7554/eLife.70878.sa0

Introduction

Genome-wide association studies (GWASs) have identified thousands of genetic variants associated with diseases and other traits (MacArthur et al., 2017; Visscher et al., 2017). Single nucleotide polymorphisms (SNPs) and short insertions and deletions (InDels) represent common forms of these variants. The majority of GWAS variants fall at non-protein-coding regions of the genome, suggesting their effects on gene regulation (Farh et al., 2015; Ward and Kellis, 2012). Gene expression is regulated by transcription factors (TFs) in a cell-type-specific manner. A TF can bind to a specific set of short, degenerate DNA sequences at promoters and enhancers, often referred to as TF binding motif. Active promoters and enhancers are selected by combinations of sequence-specific TFs that bind in an inter-dependent manner to closely spaced motifs. SNPs and InDels can create or disrupt TF binding sites by mutating motifs and are a well-established mechanism for altering gene expression and biological function (Behera et al., 2018; Deplancke et al., 2016; Grossman et al., 2017; Heinz et al., 2013). InDels can additionally change spacing between TF binding sites, but it remains unknown the extent to which altered spacing is relevant for interpreting genetic variation in human populations or between animal species.

Previous studies reported two major categories of motif spacing between inter-dependent TFs (Slattery et al., 2014). One category refers to the enhanceosome model (Slattery et al., 2014) that requires specific or ‘constrained’ spacing. It is mainly provided by TFs that form ternary complexes recognizing composite binding motifs, exemplified by GATA, Ets, and E-box TFs in mouse hematopoietic cells (Ng et al., 2014), MyoD and other cell-type-specific factors in muscle cells (Nandi et al., 2013), and Sox2 and Oct4 in embryonic stem cells (Rodda et al., 2005). In vitro studies of the binding of pairwise combinations of ~100 TFs to a diverse library of DNA sequences identified 315 out of 9400 possible interactive TF pairs that select composite elements with constrained positions of the respective recognition motifs (Jolma et al., 2015). Constrained spacing required for the optimal binding and function of interacting TFs can also occur between independent motifs, such as occurs at the interferon-β enhanceosome (Panne, 2008). In comparison to constrained spacing, another category of motif spacing allows TFs to interact over a relatively broad range (e.g., 100–200 bp), which we call ‘relaxed’ spacing and is equivalent to the billboard model (Slattery et al., 2014). This type of spacing relationship is observed in collaborative or co-occupied TFs that do not target promoters or enhancers as a ternary complex (Heinz et al., 2010; Jiang and Singh, 2014; Sönmezer et al., 2021).

Substantial evidence indicates that the two categories of spacing requirement can experience a different level of impact from genetic variation. Reporter assays examining synthetic alterations of motif spacing revealed examples of TFs that require constrained spacing and have high sensitivity of TF binding and gene expression on spacing (Farley et al., 2015; Ng et al., 2014; Panne, 2008). On the contrary, flexibility in motif spacing has been demonstrated using reporter assays in Drosophila (Menoret et al., 2013) and HepG2 cells (Smith et al., 2013). However, these studies did not distinguish the impact of altered spacing on TF binding or subsequent recruitment of co-activators required for gene activation. Moreover, it remains unknown the extent to which these findings are relevant to spacing alterations resulting from naturally occurring genetic variation.

To investigate the effects of altered spacing on TF binding and function, we first characterized the genome-wide binding patterns of 73 TFs based on their binding sites determined by chromatin immunoprecipitation sequencing (ChIP-seq). We developed a computational framework that assigned each spacing relationship to ‘constrained’ or ‘relaxed’ category and associated spacings to the naturally occurring InDels observed in human populations to study the selective constraints of different spacing relationships. As specific case studies, we leveraged natural genetic variation in numerous human samples and from five strains of mice to study the effect size of spacing alterations on TF binding activity and local histone acetylation. These findings suggested that InDels altering spacing are generally less constrained and well tolerated when they occur between TF pairs with relaxed spacing relationships. Finally, we experimentally validated substantial tolerance in spacing for macrophage lineage-determining TFs (LDTFs), PU.1 and C/EBPβ, by introducing a wide range of InDels between their respective binding sites at representative endogenous genomic loci using CRISPR/Cas9 mutagenesis in mouse macrophages.

Results

TFs primarily co-bind with relaxed spacing

We characterized spacing relationships for 73 TFs of K562 cells covering diverse TF families (Hu et al., 2019) based on the ChIP-seq data from ENCODE data portal (Davis et al., 2018). After obtaining reproducible ChIP-seq peaks, we used the corresponding position weight matrix (PWM) of each TF (Supplementary file 1) to identify the locations of high-affinity binding sites that are less than 50 bp from peak centers (Figure 1A; Figure 1—figure supplement 1); 42% of peaks on average contain at least one binding site of corresponding TF (Supplementary file 1). The peaks of every pair of TFs were then merged, and at the overlapping peaks indicating co-binding events, the edge-to-edge spacings were calculated between TF binding sites and then aggregated to show a distribution within ±100 bp. To categorize spacing relationships, we used Monte Carlo procedures to obtain an empirical p-value to find significant spacing constraints and used Kolmogorov–Smirnov (KS) test to test for a relaxed spacing relationship against random distribution.

Figure 1 with 7 supplements see all
Characterization of spacing relationships for transcription factor (TF) pairs.

(A) Schematic of data analysis pipeline for characterizing the spacing relationships based on TF chromatin immunoprecipitation sequencing (ChIP-seq) data. (B) Dissection of TF binding sites for TFs in K562 cells based on spacing relationships with co-binding TFs. Each dot represents a TF pair. The bar heights indicate medians. (C) Circos plot summarizing spacing relationships for all the TF pairs analyzed. Orange and blue bands represent significant constrained and relaxed spacing relationships, respectively. Color opacity indicates the level of significance. TFs are grouped and colored by TF family. (D) The spacing distributions of example TF pairs with constrained spacing or relaxed spacing relationships. Dashed lines indicate the significant constrained spacings. Since TAL1 motif is completely palindromic, the motif orientation is only differentiated by its co-binding partners.

Figure 1—source data 1

The numbers of co-binding sites for every pair of 73 transcription factors (TFs).

A number represents chromatin immunoprecipitation sequencing (ChIP-seq) peaks of the TF on row that overlap with at least one ChIP-seq peak of the TF on column. Therefore, the number for (TF1, TF2) may not equal but should be close to the number for (TF2, TF1).

https://cdn.elifesciences.org/articles/70878/elife-70878-fig1-data1-v2.csv
Figure 1—source data 2

Statistical test results for significant transcription factors (TF) pairs.

https://cdn.elifesciences.org/articles/70878/elife-70878-fig1-data2-v2.txt

We applied this computational framework to all possible pairs of TFs. By dissecting each TF’s binding sites based on their spacing relationships with co-binding TFs, we found that 45 of the 73 TFs examined exclusively exhibited relaxed spacing relationships with other TFs (Figure 1B). Twenty-five factors could participate in either relaxed or constrained interactions, depending on the specific co-binding TFs. Only three TFs interacted with only constrained spacing, some of which might show additional relaxed spacing relationships by expanding the current set of TFs. The significant pairwise patterns of relaxed and constrained spacing relationships are illustrated in Figure 1C. Among 29 TF pairs with constrained spacing relationships, most bind closely to each other within 15 bp spacing (Figure 1—figure supplement 2). Some of these TF pairs have been reported to recognize composite motifs such as GATA1-TAL1 and NFATC3-FOSL1 (Macián et al., 2001; Ng et al., 2014; Figure 1D; Figure 1—figure supplement 3), and some are novel constrained spacing patterns discovered by our analysis such as MEF2A-JUND and CEBPB-TEAD4 (Figure 1—figure supplement 3). TFs exhibiting relaxed spacing are exemplified by ETV1-TAL1 and JUND-KLF16, in which the frequency of co-binding progressively declines with distance from the center of the reference TF (Figure 1D). We also saw frequent relaxed spacing between TFs in the same family. For instance, despite the similar motifs recognized by AP-1 factors, many of these TFs were found to co-localize at non-overlapping nearby positions. In addition, the same type of spacing relationship is usually observed in different motif orientations (Figure 1D), consistent with previous findings (Lis and Walther, 2016).

We downloaded the ChIP-seq data of HepG2 cells from ENCODE and processed them with the same pipeline as for K562 cells. The same TF pairs can have similar spacing relationships in different cell types, exemplified by CEBPB and JUND in K562 and HepG2 cells (Figure 1—figure supplement 4). Despite more frequent binding events occurring at specific spacings for constrained TF pairs or at closer spacings for relaxed TF pairs, the binding activities quantified by ChIP-seq tags were indifferent at various spacings, suggesting that the spacing preference is not a determinant of TF binding activity (Figure 1—figure supplement 5).

Since DNA repetitive regions such as transposable elements are known to harbor TF binding sites and specific TF co-binding (Bourque et al., 2008; Kunarso et al., 2010), we further examined whether the spacing relationships of TFs could be different in repetitive and nonrepetitive regions. To study this, we applied the same pipeline to the subsets of TF ChIP-seq peaks in repetitive and nonrepetitive regions. As a result, most of the relaxed spacing relationships remained regardless of repetitive or nonrepetitive regions (Figure 1—figure supplement 6). Some constrained TF pairs, however, showed constrained spacing only in repetitive regions and not in nonrepetitive regions (Figure 1—source data 2). For example, EGR1 and JUND exhibited a constrained spacing at 29 bp (Figure 1D), but this relationship is observed specifically in SINEs (Figure 1—figure supplement 7). Such observation is consistent with previous studies that discovered specific motif pairs in repetitive regions (Wang et al., 2012).

Natural genetic variants altering spacing between relaxed TFs are associated with less deleteriousness in human populations

Based on a global view of the TF spacing relationships, we then studied whether these relationships associate with different levels of sensitivity to spacing alterations. Here, we leveraged more than 60 million InDels from gnomAD data (Karczewski et al., 2020), which were based on more than 75,000 genomes from unrelated individuals. We mapped these InDels to the TF binding sites of representative TF pairs with constrained and relaxed spacing identified in K562 cells. We found that InDels between TF binding sites have similar sizes compared to those at binding sites and those in background regions, the majority of which are less than 5 bp (Figure 2A). Next, we divided these InDels based on their allele frequency (AF) and allele count (AC) into high-frequency variants (AF>0.01%), rare variants (AF<0.01%, AC>1), and singletons (AC = 1). Most of the InDels are singletons or rare variants (Figure 2—figure supplement 1; Figure 2—source data 1). We compared the enrichment of different categories of InDels between or at TF binding sites (Figure 2B; Figure 2—figure supplement 2). The InDel compositions at TF binding sites were not significantly different between constrained and relaxed spacing groups. On the contrary, singletons were significantly more enriched between the binding sites of TFs with constrained spacing, whereas high-frequency variants were significantly more depleted between these binding sites. We also computed for several TF pairs with random spacing relationships as negative controls and found similar enrichments of InDels like those with relaxed spacing. Since common variants are associated with less deleteriousness and rare variants with more deleteriousness (Lek et al., 2016), these findings suggest that InDels between TF binding sites with constrained spacing could be just as damaging as those at binding sites, whereas InDels between TF binding sites with relaxed spacing might have a much weaker effect. This observation is consistent with prior studies that validated significant effects of spacing alterations between TFs with constrained spacing relationships (Ng et al., 2014). However, few studies have discussed the effects of InDels on TFs with relaxed spacing, so we specifically focused on relaxed spacing relationships in the rest of the current study.

Figure 2 with 2 supplements see all
Naturally occurring insertions and deletions (InDels) in human populations.

(A) Size distributions of human InDels within different regions. (B) Log2 odds ratios for different categories of InDels. Each dot represents a transcription factor (TF) pair with corresponding spacing relationship. Mann–Whitney U test was used to compare the odds ratios between different spacing relationships. Non-significant (n.s.) if p-value is larger than 0.01.

Figure 2—source data 1

The numbers and odds ratios of different categories of insertions and deletions (InDels) at or between transcription factor (TF) binding sites.

https://cdn.elifesciences.org/articles/70878/elife-70878-fig2-data1-v2.txt

Spacing alterations across mouse strains are generally tolerated by relaxed TF binding and promoter and enhancer function

To investigate the regulatory effects of naturally occurring InDels that alter spacing between TFs with relaxed spacing relationships, we leveraged more than 50 million SNPs and 5 million InDels from five genetically diverse mouse strains, including C57BL/6J (C57), BALB/cJ (BALB), NOD/ShiLtJ (NOD), PWK/PhJ (PWK), and SPRET/EiJ (SPRET). The ChIP-seq data of key TFs and histone acetylation and genome-wide transcriptional run-on (GRO-seq) data are available for the bone marrow-derived macrophages (BMDMs) from every mouse strain (Link et al., 2018a). We first characterized the spacing relationship between the macrophage LDTFs, PU.1 (encoded by Spi1) and C/EBPβ (encoded by Cebpb), which have been found to bind in a collaborative manner at the regulatory regions of macrophage-specific genes (Heinz et al., 2010). Based on our computational framework for characterizing spacing relationships (Figure 1A), these two TFs follow a relaxed spacing relationship independent of their motif orientations (Figure 3A; KS p-value < 1e-6). Moreover, both PU.1 and C/EBPβ binding activities quantified by the ChIP-seq tags were indifferent at various spacings (Figure 3B).

Figure 3 with 8 supplements see all
Effects of spacing alterations resulting from natural genetic variation across mouse strains.

(A) Spacing distributions of PU.1 and C/EBPβ binding sites at co-binding sites. (B) Density plots showing the relationship between transcription factor (TF) binding activity and motif spacing for the co-binding sites. Log2 chromatin immunoprecipitation sequencing (ChIP-seq) tags were calculated within 300 bp to quantify the binding activity of PU.1 and C/EBPβ. The color gradients represent the number of sites. Spearman’s correlation coefficients together with p-values are displayed. (C, E, G) Absolute log2 fold changes of ChIP-seq tags between C57 and another strain for (C) PU.1 binding, (E) C/EBPβ binding, or (G) nascent transcripts measured by GRO-seq. Boxplots show the median and quartiles of every distribution. Cohen’s d effect sizes comparing against variant-free regions are displayed on top. (D, F, H) Correlations between change of spacing or position weight matrix (PWM) score and change of (D) PU.1 binding, (F) C/EBPβ binding, or (H) nascent transcript level. Spearman’s correlation coefficients together with p-values are displayed.

Figure 3—source data 1

Tag fold changes at individual sites for PU.1 chromatin immunoprecipitation sequencing (ChIP-seq).

https://cdn.elifesciences.org/articles/70878/elife-70878-fig3-data1-v2.csv
Figure 3—source data 2

Tag fold changes at individual sites for C/EBPβ chromatin immunoprecipitation sequencing (ChIP-seq).

https://cdn.elifesciences.org/articles/70878/elife-70878-fig3-data2-v2.csv
Figure 3—source data 3

Tag fold changes at individual sites for GRO-seq.

https://cdn.elifesciences.org/articles/70878/elife-70878-fig3-data3-v2.csv
Figure 3—source data 4

Tag fold changes at individual sites for H3K27ac chromatin immunoprecipitation sequencing (ChIP-seq).

https://cdn.elifesciences.org/articles/70878/elife-70878-fig3-data4-v2.csv

We then conducted independent comparisons between C57 and one of the other four strains to investigate the effects of spacing alterations caused by natural genetic variation. Most of the natural InDels are less than 5 bp similar to those found in the human population (Figure 3—figure supplement 1). We first identified the co-binding sites of PU.1 and C/EBPβ for every strain and then, for each pairwise analysis, pooled the co-binding sites of C57 and a comparison strain to obtain the testing set of regions. Based on the impacts of SNPs and InDels on binding affinity quantified by PWM score or the impacts of InDels on spacing, we categorized the testing regions into the following independent groups: (1) mutated PU.1 motif, (2) mutated C/EBPβ motif, (3) mutated other potentially functional motifs, (4) altered spacing, (5) no motif/spacing effect, and (6) variant free. Potentially functional motifs were identified from PU.1 and C/EBPβ binding sites using MAGGIE (Shen et al., 2020), which is a computational tool that finds motifs associated with changes in TF binding (Figure 3—figure supplement 2). Considering that PU.1 and C/EBPβ binding could experience changes due to genetic variation mutating other motifs, we grouped these genetic variations to examine their overall effects and simultaneously reach a cleaner group of spacing alterations. The effect of genetic variation was quantified by the log2 fold difference of ChIP-seq tag counts between strains at orthogonal sites (Figure 3C). All the four independent comparisons showed that PU.1 binding is most strongly affected by PU.1 motif mutation, followed by C/EBPβ motif mutation and other motif mutation. Spacing alterations have a smaller effect size than any of these motif mutations, but still a relatively larger effect than variants affecting neither binding affinity nor spacing. Despite the moderate effect size of spacing alterations, we found such effect was independent of the size or direction of InDels (Figure 3D). On the contrary, changes of PU.1 ChIP-seq tags are strongly correlated with changes of binding affinity measured by changes of PWM scores (Figure 3D). In addition, the effects of motif mutation and spacing alteration are not varied by the initial spacing between PU.1 and C/EBPβ motifs (Figure 3—figure supplement 3). Similar findings were observed in C/EBPβ binding, except that C/EBPβ motif mutation had the largest effect size and the strongest correlation with C/EBPβ binding activity as expected (Figure 3E and F; Figure 3—figure supplement 3). Despite that most of the informative genetic variants are located at enhancers and relatively few within promoters, we saw consistent relationships in promoters and enhancers (Figure 3—figure supplement 4).

To investigate whether the effects of altered spacing on PU.1 and C/EBPβ binding can be generalized to hierarchical interactions with signal-dependent TFs (SDTFs), we leveraged the ChIP-seq data of PU.1, the NFκB subunit p65 (encoded by Rela), and the AP-1 subunit c-Jun (encoded by Jun) for BMDMs treated with the TLR4-specific ligand Kdo2 lipid A (KLA) in the same five strains of mice (Link et al., 2018a). Upon macrophage activation with KLA, p65 enters the nucleus and primarily binds to poised enhancer elements that are selected by LDTFs including PU.1 and AP-1 factors (Heinz et al., 2015). We observed a relaxed spacing relationship between PU.1 and p65 and between c-Jun and p65 (Figure 3—figure supplement 5). In addition, InDels altering motif spacing had a much smaller effect size on TF binding than motif mutations (Figure 3—figure supplement 6), consistent with our findings from PU.1 and C/EBPβ.

Although alterations in motif spacing had generally weak effects at the level of DNA binding, it remained possible that changes in motif spacing could influence subsequent steps in enhancer and promoter activation. To examine this, we extended our analysis to nascent transcription measured by GRO-seq (Core et al., 2008). Importantly, nascent transcription occurs both at active promoters and enhancers, with enhancer transcription serving as an indicator of enhancer activity (De Santa et al., 2010; Kim et al., 2019). We leveraged GRO-seq data of untreated BMDMs from the five strains of mice (Link et al., 2018a) and calculated the log fold changes of tags at the PU.1 and C/EBPβ co-binding sites for the same pairwise comparisons of strains. Like for TF binding, altered spacing demonstrated weaker effects on nascent transcription than motif mutations (Figure 3G), which is consistent with the significant correlations between changes in TF binding and changes in the level of nascent transcripts (Figure 3—figure supplement 7). The relative tolerance of spacing alteration was further supported by a weak correlation between changes in GRO-seq tags and the size of InDels, in contrast with a much stronger correlation with changes in binding affinity (Figure 3H). Thus, these findings extend the concept of spatial tolerance to the entire ensemble of factors that must be assembled to mediate nascent transcription. Similar relationships were observed for effects of InDels on local acetylation of histone H3 lysine 27 (H3K27ac) (Figure 3—figure supplement 7; Figure 3—figure supplement 8), which provides an alternative surrogate for enhancer and promoter activity (Creyghton et al., 2010).

Human quantitative trait loci altering spacing between relaxed TFs have small effect sizes

To study the effects of spacing alteration on TF binding and local histone acetylation in human cells, we leveraged the ChIP-seq data of ERG, p65, and H3K27ac in endothelial cells from dozens of individuals (Stolze et al., 2020). ERG is an ETS factor that functions as an LDTF in endothelial cells that selects poised enhancers where p65 binds in a hierarchical manner upon interleukin-1β (IL-1β) stimulation (Hogan et al., 2017). ERG and p65 follow a relaxed spacing relationship according to our method (Figure 4A). Next, we obtained 557 TF binding quantitative trait loci (bQTLs) for ERG, 5,791 bQTLs for p65, 25,621 histone modification QTLs (hQTLs) for H3K27ac in untreated cells, and 21,635 hQTLs for H3K27ac in IL-1β-treated cells (Stolze et al., 2020). We further classified bQTLs and hQTLs based on their impacts on binding affinity or spacing: (1) mutated both ERG and p65 (i.e., RELA) motif, (2) mutated ERG motif only, (3) mutated p65 motif only, (4) mutated other potentially functional motifs identified by MAGGIE (Shen et al., 2020), (5) altered spacing between ERG and p65 motif, (6) none of the above. To find potentially functional motifs, we fed MAGGIE with 100 bp sequences around QTLs before and after swapping alleles at the center (Figure 4—figure supplement 1). As a result, only a small portion of bQTLs and hQTLs directly mutates an ERG or p65 motif (Figure 4B; Figure 4—figure supplement 2). However, such motif mutations are enriched in bQTLs compared to non-QTLs (Fisher’s exact p < 1e-4). On the contrary, InDels that alter motif spacing are significantly depleted in p65 bQTLs (Fisher’s exact p = 1.3e-15). These InDels from the dozens of individuals are predominantly shorter than 5 bp by following a similar size distribution of those in human populations (Figure 4—figure supplement 3). A large proportion of QTLs affect other motifs, implicating the complexity of TF interactions. More than a quarter of the QTLs affect neither binding affinity nor spacing, which can be explained by the high correlation of non-functional variants with functional variants due to linkage disequilibrium.

Figure 4 with 4 supplements see all
Effects of chromatin quantitative trait loci (QTLs) in human endothelial cells.

(A) Spacing distributions of ERG and p65 binding sites at co-binding sites. (B) Classification of chromatin QTLs based on the impacts on motif and spacing. (C) Absolute correlation coefficients of different QTLs. Cohen’s d and Mann–Whitney U test p-values comparing against the ‘other’ group are displayed on top. *p < 0.01, **p < 0.001, ***p < 0.0001. (D) Example QTLs for large effect size due to ERG motif mutation (upper) and trivial effect due to spacing alteration (lower).

Figure 4—source data 1

Effect sizes and categorization of p65 binding quantitative trait loci (bQTLs).

https://cdn.elifesciences.org/articles/70878/elife-70878-fig4-data1-v2.csv
Figure 4—source data 2

Effect sizes and categorization of H3K27ac histone modification quantitative trait loci (hQTLs) at IL-1β.

https://cdn.elifesciences.org/articles/70878/elife-70878-fig4-data2-v2.csv
Figure 4—source data 3

Effect sizes and categorization of ERG binding quantitative trait loci (bQTLs).

https://cdn.elifesciences.org/articles/70878/elife-70878-fig4-data3-v2.csv
Figure 4—source data 4

Effect sizes and categorization of H3K27ac histone modification quantitative trait loci (hQTLs) at basal.

https://cdn.elifesciences.org/articles/70878/elife-70878-fig4-data4-v2.csv

We further compared the effect sizes of different categories of QTLs. Despite being the minority among QTLs, variants that mutate both ERG and p65 motifs have the strongest effects on both p65 binding and histone acetylation in IL-1β-treated endothelial cells (Figure 4C). In comparison, ERG binding and the basal level of histone acetylation are significantly affected by ERG motif mutations in untreated endothelial cells and not by p65 motif mutations, consistent with the hierarchical interaction of p65 only upon IL-1β stimulation (Figure 4—figure supplement 4). In both conditions of endothelial cells, spacing alterations have the smaller effect size than motif mutation categories and are not significantly different from likely non-functional variants in the ‘other’ group. The examples showed a variant being both a p65 bQTL and a H3K27ac hQTL under the IL-1β state due to its impact on an ERG motif, and a 4 bp insertion between ERG and p65 motifs associated with no change in p65 binding or H3K27ac (Figure 4D).

Relaxed TF binding is highly tolerant to synthetic spacing alterations

The generally small effects of InDels occurring between TF pairs exhibiting relaxed spacing relationships raised the question of the robustness and the extent of such tolerance at genomic locations lacking such variation. We addressed this question by using CRISPR/Cas9 editing to introduce synthetic InDels between binding sites identified for the LDTFs PU.1 and C/EBPβ in mouse macrophages (Figure 5A). We used lentiviral transduction in Cas9-expressing ER-HoxB8 cells, which are conditionally immortalized monocyte progenitors, to introduce gRNAs targeting genomic sequences between the locations of PU.1 and C/EBPβ co-binding. The successfully transduced ER-HoxB8 cells were then sorted and differentiated into macrophages. Since non-homologous DNA repair resulting from the Cas9 nuclease activity would generate a spectrum of InDels in a population of transduced cells, we first measured input DNAs to obtain the distribution of InDels and then compared with TF ChIP-seq tags from deep sequencing, in which the effect of an InDel is reported as the odds ratio of ChIP tags to the input tags. Importantly, the ChIP-seq libraries were prepared by selective amplification of ChIP tags containing the targeted region of interest. Thus, for each region-specific sequence tag that was immunoprecipitated, we could simultaneously determine whether an InDel had been created and its specific length. Each tag is thus cell- and allele-specific.

Figure 5 with 2 supplements see all
Effects of variable sizes of synthetic spacing alterations.

(A) Schematic for generating and analyzing synthetic spacing alterations. (B) The distributions of valid read counts from the input sample based on the InDel sizes of the reads. Negative InDel size indicates deletion, and positive size means insertion. (C) Log2 odds ratios by comparing C/EBPβ chromatin immunoprecipitation sequencing (ChIP-seq) reads and input sample reads. Y = 0 indicates where transcription factor (TF) binding has an expected amount of activity. p-Values were based on two-sample t-tests by comparing the InDel groups of each test region. (D) Sequencing data of ER-HoxB8 cells at co-binding site of PU.1 and C/EBPβ. Highlighted is test region #6 whose DNA sequence from PU.1 binding site to C/EBPβ binding site is shown. (E) Log2 odds ratios of test regions #6 as a function of InDel size.

Figure 5—source data 1

Raw chromatin immunoprecipitation sequencing (ChIP-seq) tag counts associated with different sizes of insertions and deletions (InDels).

https://cdn.elifesciences.org/articles/70878/elife-70878-fig5-data1-v2.txt

We tested six PU.1 and C/EBPβ co-binding sites with their original spacing ranging from 26 to 55 bp (Supplementary file 1) and quantified the effects of InDels on C/EBPβ binding. Among the six test regions, three of them have supportive evidence from naturally occurring InDels of mouse strains (regions #1, #3, #5) and the other three don’t (regions #2, #4, #6). Based on the bioinformatic analysis of the ultra-deep sequencing reads from the input DNA samples, we saw that the CRISPR/Cas9 system generated a wide range of InDels with most deletions being <30 bp and short insertions usually less than 5 bp (Figure 5B). It provides longer deletions than natural genetic variations found across mouse strains (Figure 3—figure supplement 1) and in human populations (Figure 2A). After classifying ChIP-seq reads based on the InDel size and whether the InDel overlaps with any of the PU.1 and C/EBP binding sites, we estimated the effect size of InDels on C/EBPβ binding by calculating the odds ratio between C/EBPβ ChIP-seq reads and input DNA sample reads for every InDel group. We found that InDels altering spacing have significantly weaker effects on C/EBPβ binding in comparison to those overlapping with at least one of the binding sites (Figure 5C). For some test regions, the effects of pure spacing alterations are almost negligible, exemplified by test region #6 (Figure 5D and E) and test region #1 (Figure 5—figure supplement 1). Test region #6 is located near a highly expressed gene Prdx1 and has strong binding of PU.1 and C/EBPβ binding and strong signals of H3K27ac and chromatin accessibility indicated by ATAC-seq in ER-Hoxb8 cells, which all support its potential regulatory function (Figure 5D). The PU.1 and C/EBPβ binding sites at this region are 26 bp apart. In general, spacing alterations ranging from 5 bp increase to 22 bp decrease did not have a strong effect on TF binding, indicated by a log2 odds ratio close to 0 (Figure 5E). A small number of outliers were observed at each region where specific InDels resulted in substantial loss of binding (e.g., –20 bp, Figure 5E). C/EBPβ binding at these specific InDels was generally discontinuous with 1 bp increments (e.g., –19 and –21 bp, Figure 5E). The basis for these highly localized changes in the odds ratio in a small fraction of InDels that alter spacing is unclear. On the contrary, deletions overlapping with the TF binding sites resulted in a general decrease in TF binding activity. Similar results were found at test region #1 where PU.1 and C/EBPβ binding sites are 41 bp apart (Figure 5—figure supplement 1A). This Ly9 enhancer also has a 5 bp insertion between PU.1 and C/EBPβ binding sites in BALB, NOD, and PWK mice, and shows unaffected binding of PU.1 and C/EBPβ in the BMDMs of these strains (Figure 5—figure supplement 1B). As a result of the synthetic InDels, the C/EBPβ binding activity was generally unaffected by spacing alterations only, whereas deletions overlapping TF binding sites substantially diminished TF binding (Figure 5—figure supplement 1C). We further measured PU.1 binding using ChIP-seq at three out of six test regions and saw general tolerance of synthetic spacing alterations in contrast with significantly weaker PU.1 binding resulted from motif alterations (Figure 5—figure supplement 2).

Discussion

By classifying the genome-wide spacing relationships of 73 co-binding TFs as ‘constrained’ or ‘relaxed’, we revealed that relaxed spacing relationships were the dominant pattern of interaction for majority of these factors. Among these factors, approximately half could also participate in constrained spacing relationships with specific TF partners. We confirmed TF pairs known to exhibit constrained relationships (e.g., GATA1-TAL1) and identified previously unreported constrained relationships for additional pairs, including EGR1 and JUND. Overall, this finding of a subset of constrained TF interactions on a genome-wide level is consistent with the locus-specific examples provided by functional and structural studies of the interferon-β enhanceosome (Panne, 2008) and in vivo studies of synthetically modified enhancer elements in Ciona (Farley et al., 2015). Each of these examples represents genomic regulatory elements in which key TF binding sites are tightly spaced in their native contexts (i.e., 0–9 bp between binding sites). Direct protein-protein interactions are observed between bound TFs at the interferon-β enhanceosome, analogous to interactions defined for cooperative TFs that form ternary complexes (Morgunova and Taipale, 2017; Reményi et al., 2003). However, unlike the previous in vitro study that identified over 300 TF-TF interactions (Jolma et al., 2015), the spacing analyses in our study did not directly consider the possible overlap between TF binding sites. Thus, we are not able to discover constrained TFs that recognize overlapping motifs or distinguish effects of spacing alterations from effects of InDels on overlapping composite motifs.

Our findings based on ChIP-seq data were consistent with the recent in vivo profiling of TF co-occupancy on single DNA molecules, which discovered a lack of association between TF co-occupancy and precise spacing or orientation of motifs (Sönmezer et al., 2021). The observation that most TF pairs exhibited relaxed spacing relationships has intriguing implications for the mechanisms by which functional enhancers and promoters are selected from chromatinized DNA. In contrast to ternary complexes of TFs that cooperatively bind to composite elements as a unit, relaxed spacing relationships appear to not require specific protein-protein interactions between TFs for collaborative binding at most genomic locations. Although pioneering TFs necessary for selection of cell-specific enhancers have been reported to recognize their motifs within the context of nucleosomal DNA (Zaret and Carroll, 2011), the basis for collaborative binding interactions between TFs with relaxed spacing remains poorly understood.

While the current studies relying on natural genetic variation and mutagenesis experiments concluded clear tolerance of spacing alterations between binding sites of TFs with relaxed spacings, the extent to which this set of binding sites is representative of all regulatory elements is unclear. For example, we observed outliers in which significant differences in TF binding between mouse strains were associated with InDels occurring between TF binding sites. However, the proportion of outliers was generally similar to that observed at genomic regions lacking such InDels, and such strain differences may be driven by distal effects of genetic variation on interacting enhancer or promoter regions (Hoeksema et al., 2021; Link et al., 2018a). The remarkable tolerance of synthetic InDels at two independent endogenous genomic locations between PU.1 and C/EBPβ binding sites strongly support the generality of relaxed binding interactions for these two proteins. Intriguingly, while the densities of C/EBP binding sites increase with decreasing distance to PU.1 binding sites over a 100 bp range (Figure 3A), deletions from 1 to >30 bp between PU.1-C/EBPβ pairs did not result in improved binding. Instead, relatively constant binding was observed with progressive deletions bringing two binding sites close together until the deletions started to cause mutations in one or both motifs. The lack of requirement for exact spacing and remarkable tolerance of spacing alterations by TFs with relaxed spacing could potentially associate with the high turnover of TF binding sites found by previous studies (Vierstra et al., 2014), although further investigation would be needed to establish this association. A limitation of our studies is that few and relatively short insertions were obtained, preventing conclusions as to the extent to which increases in spacing are tolerated.

In concert, the present studies provide a basis for estimation of the potential phenotypic consequences of naturally occurring InDels in non-coding regions of the genome. The majority of naturally occurring InDels are less than 5 bp in length. In nearly all cases, InDels of this size range between binding sites for TFs that have relaxed binding relationships are unlikely to alter TF binding and function, and InDels of much greater length are frequently tolerated. In contrast, InDels between binding sites for TFs that have constrained binding relationships have the potential to result in biological consequences. Application of these findings to the interpretation of non-coding InDels that are associated with disease risk will require knowledge of the relevant cell type in which the InDel exerts its phenotypic effect and the types of TF interactions driving the selection and function of the affected regulatory elements.

Materials and methods

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Strain, strain background (Mus musculus, male)B6(C)-Gt(ROSA)
26Sorem1.1(CAG-cas9*,-EGFP)Rsky/J
Jackson LaboratoryStock No: 028555RRID:IMSR_JAX:028555
Cell line (Mus musculus)Cas9-expressing
ER-HoxB8 cells
This paperGifted from Dr David Sykes
Cell line (human)Lenti-X 293T cellsClontechCat#: 632180RRID:CVCL_4401
Transfected construct (retrovirus)Murine stem cell
virus-based vector
for ER-HoxB8
Massachusetts General Hospital, Boston, MAGifted from Dr David Sykes
Transfected construct (retrovirus)lentiGuide-puroAddgeneCat#: 52963
Transfected construct (retrovirus)psPAX2AddgeneCat#: 12260
Transfected construct (retrovirus)pVSVGAddgeneCat#: 138479
AntibodyPU.1/Spi1 (rabbit polyclonal)Santa CruzCat#: sc-352XRRID:AB_632289(1 µL)
AntibodyC/EBPβ (rabbit polyclonal)Santa CruzCat#: sc-150RRID:AB_2260363(10 µL)
AntibodyH3K27ac (rabbit polyclonal)Active MotifCat#: 39135RRID:AB_2614979(2 µL)
Recombinant DNA reagentNEBNext 2× High Fidelity PCR Master MixNEBCat#: M0541
Sequence-based reagentLocus-specific Nextera hybrid
primer
This paperPCR primersSequences included in Supplementary file 1
Sequence-based reagentNextera index
primer
This paperPCR primersSequences included in Supplementary file 1
Peptide, recombinant proteinRecombinant Mouse IL-3PeprotechCat#: 213–13
Peptide, recombinant proteinRecombinant Mouse IL-6PeprotechCat#: 216–16
Peptide, recombinant proteinRecombinant
Mouse SCF
PeprotechCat#: 250–03
Peptide, recombinant proteinRecombinant
Mouse GM-CSF
PeprotechCat#: 315–03
Peptide, recombinant proteinMouse M-CSFShenandoah BiotechCat#: 200–08
Commercial assay or kitDirect-zol RNA
MicroPrep kit
Zymo ResearchCat#: R2062
Commercial assay or kitQubit dsDNA HS Assay KitThermo Fisher ScientificCat#: Q32851
Commercial assay or kitNextera DNA Library Preparation KitIlluminaCat#: 15028212
Commercial assay or kitChIP DNA Clean & ConcentratorZymo ResearchCat#: D5205
Commercial assay or kitNEBNext Ultra II Library Preparation KitNEBCat#: E7645L
Chemical compound, drugLentiBlast Transduction ReagentOZ BiosciencesCat#: LB00500
Chemical compound, drugFicoll-Paque-PlusSigma-AldrichCat#: GE17-1440-02
Chemical compound, drugRPMI-1640CorningCat#: 10–014-CV
Chemical compound, drugDMEM high glucoseCorningCat#: 10–013-CV
Chemical compound, drugFBSOmega BiosciencesCat#: FB-12
Chemical compound, drug100× Penicillin/
Streptomycin + L-glutamine
GibcoCat#: 10378–016
Chemical compound, drugβ-EstradiolSigma-AldrichCat#: E2758
Chemical compound, drugG418Thermo FisherCat#: 10131035
Chemical compound, drugPolybreneSigma-AldrichCat#: H9268
Chemical compound, drugFibronectinSigma-AldrichCat#: F0895
Chemical compound, drugPoly-D-lysinSigma-AldrichCat#: DLW354210
Chemical compound, drugX-tremeGENE HP DNA Transfection ReagentSigma-AldrichCat#: 6366546001
Chemical compound, drugFormaldehydeThermo Fisher ScientificCat#: BP531-500
Chemical compound, drugDynabeads Protein AInvitrogenCat#: 10002D
Chemical compound, drugSpeedBeads
magnetic carboxylate modified
particles
Sigma-AldrichCat#: GE65152
105050250
Chemical compound, drugDynabeads MyOne Streptavidin T1InvitrogenCat#: 65602
Software, algorithmCHOPCHOPCHOPCHOP
(https://chopchop.cbu.uib.no/)
RRID:SCR_015723
Software, algorithmBowtie2Bowtie2
(http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
RRID:SCR_016368Version 2.3.5.1
Software, algorithmSTARSTAR
(https://github.com/alexdobin/STAR)
RRID:SCR_004463Version 2.5.3
Software, algorithmHOMERHOMER
(https://homer.ucsd.edu/homer/)
RRID:SCR_010881Version 4.9.1
Software, algorithmMAGGIEMAGGIE
(https://github.com/zeyang-shen/maggie)
RRID:SCR_021903Version 1.1
Software, algorithmIDRIDR
(https://www.encodeproject.org/software/idr/)
RRID:SCR_017237Version 2.0.3
Software, algorithmMMARGEMMARGE
(https://github.com/vlink/marge)
RRID:SCR_021902Version 1.0

Sequencing data processing

Request a detailed protocol

We downloaded two replicates for each TF ChIP-seq data from ENCODE data portal (Davis et al., 2018). The mouse BMDM data and the human endothelial cell data were downloaded from the GEO database with accession number GSE109965 (Link et al., 2018a) and GSE139377 (Stolze et al., 2020), respectively. We mapped ChIP-seq and ATAC-seq reads using Bowtie2 v2.3.5.1 with default parameters (Langmead and Salzberg, 2012) and mapped RNA-seq reads using STAR v2.5.3 (Dobin et al., 2013). All the human data downloaded from ENCODE were mapped to the hg38 genome. Data from C57BL/6J mice were mapped to the mm10 genome. Data from other mouse strains and endothelial cell data from different individuals were mapped to their respective genomes built by MMARGE v1.0 (Link et al., 2018b). More details are described below.

Based on the mapped ChIP-seq data, we called peaks using HOMER v4.9.1 (Heinz et al., 2010). For data with replicates including ENCODE data and mouse data, we first called unfiltered 200 bp peaks using HOMER ‘findPeaks’ function using parameters ‘-style factor -L 0 C 0 -fdr 0.9 -size 200’ and then ran IDR v2.0.3 with default parameters (Li et al., 2011) to obtain reproducible peaks. For data without replicates including human endothelial cell data and ER-HoxB8 ChIP-seq data, we called peaks using HOMER ‘findPeaks’ with the default setting and parameters ‘-style factor -size 200’.

Activity of TF binding and nascent transcription was quantified by the ChIP-seq and GRO-seq tag counts, respectively, within 300 bp around peak centers and normalized by library size using HOMER ‘annotatePeaks.pl’ script with parameters ‘-norm 1e7 -size –150,150’. Activity of promoter and enhancer was quantified by normalized H3K27ac ChIP-seq tags within 1000 bp regions around TF peak centers using parameters ‘-norm 1e7 -size –500,500’.

TF binding site identification

Request a detailed protocol

Given a DNA sequence with the same length of a TF binding motif, we calculated a PWM score, or sometimes called motif score, by the dot product between the motif PWM and the sequence vector using Biopython package (Cock et al., 2009). The PWMs of all the TFs in this study were obtained from either the JASPAR database (Fornes et al., 2020) or de novo motif analysis using HOMER ‘findMotifsGenome.pl -size 200 -mask’ with random backgrounds (Heinz et al., 2010) if unavailable in the JASPAR database. For de novo motifs, the ultimate motif of each TF used for identifying TF binding sites was manually selected from the top three significant motifs that look like motifs of other TFs within the same TF family available in JASPAR. The final motifs are listed in Supplementary file 1. TFs without a confident motif based on the criteria above or ending up with less than 2000 ChIP-seq peaks with at least one confident TF binding site based on the pipeline below have been excluded from the current study.

The original PWMs were first trimmed to keep only the core motifs starting from the first position of information content greater than 0.3 to the last position of information content greater than 0.3 (Ng et al., 2014). A valid TF binding site was identified by having a PWM score passing a false positive rate (FPR) of 0.1% based on a score distribution generated from 25% composition of A, C, G, T using Biopython (Cock et al., 2009) and a location within 50 bp close to the ChIP-seq peak center. Spacing is computed edge-to-edge between two TF binding sites at co-binding sites. If there are multiple valid motifs for one or both TFs, we computed the spacing between all possible combinations of valid motifs. For TFs that share a core motif, we applied the same pipeline to each TF separately, which may or may not identify the same binding sites, but we only included non-overlapping binding sites in our downstream calculation of spacing.

Characterization of spacing relationships

Request a detailed protocol

To test for the constrained spacing relationship between a given pair of TFs, we first generated their spacing distribution at single-base-pair resolution ranging from –100 to +100 bp and then used Monte Carlo procedures to identify significant ‘spikes’ in the spacing distribution based on point-to-point slopes. The slope of position i is computed using the following formula:

Si=2Ni-Ni-1-Ni+12, i-99,99

Si is the average of single-step forward and backward slope at spacing equal to i bp, and Ni represents the value of position i on the spacing distribution (i.e., the number of TF binding sites with spacing i bp). We then used Monte Carlo procedures to obtain an empirical p-value to represent the statistical significance of a slope to be considered as a ‘spike’. Specifically, we randomly sampled 1000 integers between 0 and 100, calculated a distribution, and obtained all the slopes using the formula above. We repeated this process by 1000 times and summarized all the slopes in an aggregated distribution. p-Value is determined based on the percentile rank of testing value on the aggregated distribution. p-Value smaller than 6.25e-05 (familywise error rate = 0.05/200/4) is called significant, indicating a ‘spike’ is found.

To test for the relaxed spacing relationship, we used KS test to compare a spacing distribution to the random distribution. We randomly sampled integers between –100 and 100 to match the same size of the testing spacing distribution and then tested the spacing distribution against the distribution of the random integers to obtain a p-value. We repeated the above process 100 times and reported the average p-value. To compare spacing relationships in repetitive versus nonrepetitive regions, we first used repeat annotations from HOMER to divide our ChIP-seq peaks into repetitive and nonrepetitive regions and then repeated the above procedures.

Categorization of gnomAD variants based on AF

Request a detailed protocol

We obtained InDels from gnomAD v3.1 (Karczewski et al., 2020). These gnomAD variants were mapped to TF co-binding sites, specifically with two TF binding sites and their intermediate sequences. For TF pairs with constrained spacing relationships, we only kept the co-binding sites with the identified constrained spacing ±2 bp. To account for region-by-region variation in selective pressure, we also overlapped variants with 100 bp upstream and 100 bp downstream background regions outside of TF co-binding sites. For each co-binding site, we categorized InDels into high-frequency variants (AF>0.01%), rare variants (AF <0.01%, AC >1), and singletons (AC = 1) and computed the odds ratios to find associations between a certain category of InDels (e.g., singletons) and certain regions (e.g., between motifs with constrained spacing).

Genetic variation processing and genome building

Request a detailed protocol

Genetic variation of the five mouse strains was obtained from Keane et al., 2011, and that of the human individuals from which endothelial cell data were generated was derived from Stolze et al., 2020. We used MMARGE v1.0 with default variant filters (Link et al., 2018b) to build separate genomes for each mouse strain and human individual. The sequencing data from different samples were respectively mapped to the corresponding genomes and were then shifted to a common reference genome using MMARGE ‘shift’ function to facilitate comparison at homologous regions. The reference genome is mm10 for mouse strains and hg19 for human individuals.

Motif mutation analysis

Request a detailed protocol

We used MAGGIE v1.0 (Shen et al., 2020) to identify functional motifs for TF binding. To prepare the inputs into MAGGIE based on the mouse strains data, we adapted a similar strategy as described in Shen et al., 2020. In brief, we conducted pairwise orthogonal comparisons of TF peaks between each possible pair of the five strains to find strain-differential peaks. We then extracted pairs of 200 bp sequences around the centers of the differential peaks from the genomes of two comparative strains, the ones with TF binding as positive sequences paired with those without TF binding as negative sequences. For the QTLs of human endothelial cells, MAGGIE can directly work with a VCF file of QTLs with effect size and effect direction indicated in a column of the file. We ran MAGGIE separately for each type of QTLs and reported the significant motifs together with their p-values, which passed false discovery rate (FDR) < 0.05 after the Benjamini–Hochberg controlling procedure.

Categorization of genetic variation based on impacts on motif or spacing

Request a detailed protocol

We categorized genetic variation based on its impact on binding affinity or spacing. Motif mutations caused by SNPs or InDels were defined by at least two-bit difference in the PWM score, which is equivalent to approximately 4-fold difference in the binding likelihood. To classify genetic variation that mutates other functional motifs identified by MAGGIE, we required at least one of the MAGGIE motifs differed by at least two bits in the PWM score. InDels were first classified into motif mutation categories if eligible before being considered for spacing alteration. Therefore, spacing alterations were InDels between target motifs without any motif mutations. Variants fitting neither motif mutation nor spacing alteration were gathered in a separate group as a control. Another control category during analysis of mouse strains data was defined by ChIP-seq peaks that have no genetic variation between strains.

Statistical testing of effect size

Request a detailed protocol

To estimate the effects of genetic variation on different readouts, we computed the log2 fold changes of corresponding sequencing tags for every pairwise comparison between C57 and one of the other strains. The effect sizes of QTLs were directly pulled from the source literature. We conducted Mann–Whitney U tests between the control variant category and another category to find significant differences in their distributions. Cohen’s d (Sullivan and Feinn, 2012) was further calculated between two distributions. For an easier comparison of general effect size, we took the absolute values before calculating Cohen’s d. Spearman’s correlation coefficient together with statistical significance was calculated to find correlations between log2 fold changes of tags and TF features.

ER-HoxB8 cell-derived macrophage culture and CRISPR knockout

Request a detailed protocol

Bone marrow cells were isolated from femurs and tibias of a Cas9-expressing transgenic mouse (Jackson Laboratory, No. 028555). Murine stem cell virus-based expression vector for ER-HoxB8 was gifted from Dr David Sykes (Massachusetts General Hospital, Boston, MA). Cas9-expressing ER-HoxB8 conditionally immortalized myoid progenitor cells were generated following established protocols (Wang et al., 2006). In brief, bone marrow cells were purified with a Ficoll gradient (Ficoll-Paque-Plus, Sigma-Aldrich) and resuspended in RPMI 1640 containing 10% FBS, 1% penicillin/streptomycin, and 10 ng/ml each of murine SCF, IL-3, and IL-6 (PeproTech). After 48 hr culture, 2.5 × 105 cells in 1 ml were transduced with 2 ml of ER-HoxB8 retrovirus (in DMEM with 30% FBS) containing 0.5 μl/ml LentiBlast A (OZ Biosciences), 2.5 μl/ml LentiBlast B (OZ Biosciences) and 8 μg/ml polybrene (Sigma-Aldrich) in a well of fibronectin (Sigma-Aldrich)-coated six-well culture plates and centrifuged at 1000 g for 90 min at 22°C. After transduction, 6 ml of ER-HoxB8 cell culture media (RPMI 1640 supplemented with 10% FBS, 1% penicillin/streptomycin, 0.5 μM β-estradiol (Sigma-Aldrich), and 20 ng/ml murine GM-CSF (PeproTech)) were added and an additional half-media exchange with ER-Hoxb8 media performed the next day. Transduced cells were selected with G418 (Thermo Fisher) at 1 mg/ml for 48 hr. Thereafter, cells were maintained in ER-HoxB8 cell culture media and confirmed its identity by RNA-seq. For baseline ATAC-seq and ChIP-seq of ER-HoxB8 cells prior to gRNA transduction, cells were washed twice with PBS, plated at a density of 3 × 106 cells per 10 cm culture plate, and differentiated into macrophages in DMEM with 10% FBS, 1% pencillin/streptomycin, and 17 ng/ml M-CSF (Shenandoah) for 7 days with two culture media exchanges. Differentiated cells were washed twice with PBS and collected for sequencing experiments.

Guide RNA lentiviruses were prepared as previously described (Fonseca et al., 2019) with modifications as follows. LentiGuide-mCherry was generated by modifying lentiGuide-puro (Addgene) to remove a puromycin-resistant gene and replace it with mCherry. gRNA sequences directed between the PU.1 and C/EBP binding sites (and one each directed toward the binding site itself) were designed with CHOPCHOP web tool for genome engineering (Labun et al., 2019). One CRISPR gRNA oligonucleotide was inserted for each target via PCR into a BsmBI cleavage site. A list of gRNA targets used in this article is shown in Supplementary file 1. Lenti-X 293T cells (Clontech) were seeded in poly-D-lysin (Sigma-Aldrich) coated 10 cm tissue culture plates at a density of 3.5 million cells per plate in 10 ml of DMEM containing 10% FBS and 1% penicillin/streptomycin, and then incubated overnight at 37°C. After replacement of the media to 6 ml of DMEM containing 30% FBS, plasmid DNAs (5 μg of LentiGuide-mCherry, 3.75 μg of psPAX2, and 1.25 μg of pVSVG) were transfected into LentiX-293T cells using 20 μl of X-tremeGENE HP DNA Transfection Reagent (Roche) at 37°C overnight. The media was replaced with DMEM containing 30% FBS and 1% penicillin/streptomycin, and then cultured at 37°C overnight. The supernatant was filtrated with 0.45 μm syringe filters and used as lentivirus media. Cell culture media was replaced, and virus was collected again after 24 hr. 1 × 106 Cas9-expressing ER-HoxB8 cells were transduced with virus in 2 ml of lentivirus media and 1 ml of ER-HoxB8 cell media containing 0.5 μl/ml LentiBlast A, 2.5 μl/ml LentiBlast B, and 8 μg/ml polybrene in a well of fibronectin-coated six-well culture plates and centrifuged at 1000 g for 90 min at 22°C. After the transduction, 6 ml of ER-HoxB8 cell media was added to each well. Half of the media was exchanged the next day and in the following days, cells were expanded and passaged. After 5 days, 250,000 successfully transduced cells (indicated by mCherry fluorescence) for each gRNA were sorted by FACS using a Sony MA900. After FACS, cells were expanded in ER-HoxB8 culture media. Differentiation into macrophages was carried out as above in DMEM supplemented with M-CSF.

RNA-seq library preparation

Request a detailed protocol

Total RNA was isolated from cells and purified using Direct-zol RNA Microprep columns according to the manufacturer’s instructions (Zymo Research); 500 ng of total RNA were used to prepare sequencing libraries from polyA enriched mRNA as previously described (Link et al., 2018a). Libraries were PCR-amplified for 14 cycles, size selected using SpeedBeads (Sigma-Aldrich), quantified by Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific), and 75 bp single-end sequenced on a HiSeq 4000 (Illumina).

ATAC-seq library preparation

Request a detailed protocol

ATAC-seq libraries were prepared as previously described (Hoeksema et al., 2021). In brief, 5 × 105 cells were lysed at room temperature in 50 µl ATAC lysis buffer (10 mM Tris-HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% IGEPAL CA-630) and 2.5 µl DNA Tagmentation Enzyme mix (Nextera DNA Library Preparation Kit, Illumina) was added. The mixture was incubated at 37°C for 30 min and subsequently purified using the ChIP DNA Clean & Concentrator kit (Zymo Research) as described by the manufacturer. DNA was amplified using the Nextera Primer Ad1 and a unique Ad2.n barcoding primers using NEBNext High-Fidelity 2× PCR MM for 8–14 cycles. PCRs were size selected using TBE gels for 175–350 bp and DNA eluted using gel diffusion buffer (500 mM ammonium acetate, pH 8.0, 0.1% SDS, 1 mM EDTA, 10 mM magnesium acetate) and purified using ChIP DNA Clean & Concentrator (Zymo Research). Samples were quantified by Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) and 75 bp single-end sequenced on HiSeq 4000 (Illumina).

Crosslinking for ChIP-seq

Request a detailed protocol

For PU.1, C/EBPβ, and H3K27ac ChIP-seq, culture media was removed, and plates were washed once with PBS and then fixed for 10 min with 1% formaldehyde (Thermo Fisher Scientific) in PBS at room temperature. Reaction was then quenched by adding glycine (Thermo Fisher Scientific) to 0.125 M. After fixation, cells were washed once with cold PBS and then scraped into supernatant using a rubber policeman, pelleted for 5 min at 400× g at 4°C. Cells were transferred to Eppendorf DNA LoBind tubes and pelleted at 700× g for 5 min at 4°C, snap-frozen in liquid nitrogen, and stored at –80°C until ready for ChIP-seq protocol preparation.

Chromatin immunoprecipitation

Request a detailed protocol

ChIP was performed in biological replicates as described previously (Hoeksema et al., 2021). Samples were sonicated using a probe sonicator in 500 µl lysis buffer (10 mM Tris/HCl pH 7.5, 100 mM NaCl, 1 mM EDTA, 0.5 mM EGTA, 0.1% deoxycholate, 0.5% sarkozyl, 1× protease inhibitor cocktail). After sonication, 10% Triton X-100 was added to 1% final concentration and lysates were spun at full speed for 10 min; 1% was taken as input DNA, and immunoprecipitation was carried out overnight with 20 µl Protein A Dynabeads (Invitrogen) and 2 µg specific antibodies for C/EBPβ (Santa Cruz, sc-150), PU.1 (Santa Cruz, sc-352X), and H3K27ac (Active Motif, 39135). Beads were washed three times each with wash buffer I (20 mM Tris/HCl, 150 mM NaCl, 0.1% SDS, 1% Triton X-100, 2 mM EDTA), wash buffer II (10 mM Tris/HCl, 250 mM LiCl, 1% IGEPAL CA-630, 0.7% Na-deoxycholate, 1 mM EDTA), TE 0.2% Triton X-100 and TE 50 mM NaCl and subsequently resuspended 25 µl 10 mM Tris/HCl pH 8.0% and 0.05% Tween-20. ChIP-seq libraries were prepared on the Dynabeads as described below. For locus-specific enrichment ChIP-seq, bead complex was resuspended in 50 µl 1% SDS-TE. Four µl ProtK, 4 µl RNase A, 3 µl 5 M NaCl were added to these and the input samples and incubated at 50°C for 1 hr, reverse crosslinked at 65°C overnight and then eluted from the beads.

ChIP-seq library preparation

Request a detailed protocol

ChIP libraries were prepared while bound to Dynabeads using NEBNext Ultra II Library preparation kit (NEB) using half reactions. DNA was polished, polyA-tailed, and ligated after which dual UDI (IDT) or single (Bioo Scientific) barcodes were ligated to it. Libraries were eluted and crosslinks reversed by adding to the 46.5 µl NEB reaction 16 µl water, 4 µl 10% SDS, 4.5 µl 5 M NaCl, 3 µl 0.5 M EDTA, 4 µl 0.2 M EGTA, 1 µl RNAse (10 mg/ml), and 1 µl 20 mg/ml proteinase K, followed by incubation at 55°C for 1 hr and 75°C for 30 min in a thermal cycler. Dynabeads were removed from the library using a magnet and libraries were cleaned up by adding 2 µl SpeedBeads (Sigma-Aldrich) in 124 µl 20% PEG 8000/1.5 M NaCl, mixing well, then incubating at room temperature for 10 min. SpeedBeads were collected on a magnet and washed two times with 150 µl 80% ethanol for 30 s. Beads were collected and ethanol removed following each wash. After the second ethanol wash, beads were air dried and DNA eluted in 12.25 µl 10 mM Tris/HCl pH 8.0% and 0.05% Tween-20. DNA was amplified by PCR for 14 cycles in a 25 µl reaction volume using NEBNext Ultra II PCR master mix and 0.5 µM each Solexa 1GA and Solexa 1 GB primers. Libraries were size selected using TBE gels for 200–500 bp and DNA eluted using gel diffusion buffer (500 mM ammonium acetate, pH 8.0, 0.1% SDS, 1 mM EDTA, 10 mM magnesium acetate) and purified using ChIP DNA Clean & Concentrator (Zymo Research). Sample concentrations were quantified by Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) and 75 bp single-end sequenced on HiSeq 4000.

Biotin-mediated locus-specific enrichment ChIP-seq library preparation

Request a detailed protocol

After performing the target-specific ChIPs, we performed an initial PCR for locus-specific amplicon enrichment using NEBNext 2× High Fidelity PCR MM (NEB) and 5’-biotinylated stub adapter primers specific to appropriate genomic regions to be interrogated (Supplementary file 1). Initial hotstart/denaturation at 98°C for 30 s was followed by 10 cycles of amplification (98°C for 15 s, 65–67°C for 15 s, 72°C for 30 s) and then a final elongation at 72°C for 5 min. After this, we performed a 0.7× AmpureXP clean-up and eluted in 20 µl 0.5× TT (5 mM Tris pH 8.0 + 0.025% Tween-20). Dynabeads MyOne Streptavidin T1 beads were then washed in 1× Wash Binding Buffer (WBB, 2× WBB: 10 mM Tris-HCl [pH 7.5], 1 mM EDTA, 2 M NaCl, 0.1% Tween) and resuspended beads at 20 µl per sample in 2× WBB. Twenty µl prepared Dynabeads MyOne Streptavidin T1 beads (in 2× WBB) were then added to clean up 20 µl 0.5× TT PCR fragments, mixed and incubated for 60 min at room temperature with mild shaking. After this, beads were collected on a magnet and washed twice with 150 µl 1× WBB and once with 180 µl TET (TE +0.05% Tween-20). Finally, beads were resuspended in 25 µl 0.5× TT and on bead PCR for addition of Illumina-specific adapters and 10 bp Unique Dual Indexes (UDIs) using NEBNext 2× High Fidelity PCR MM (NEB) and 25 PCR cycles was performed (Supplementary file 1). Libraries were size selected using TBE gels for 300–500 bp and DNA eluted using gel diffusion buffer and purified using ChIP DNA Clean & Concentrator (Zymo Research). Samples were quantified by Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) and 150 bp paired-end sequenced on NextSeq 500 (Illumina).

Analysis of variable InDels from CRISPR experiments

Request a detailed protocol

We mapped the reads to the target regions using the local alignment mode of Bowtie2 v2.3.5.1 (Langmead and Salzberg, 2012). To allow for InDels with tens of bases, we reduced the gap extend penalty and increased the gap open penalty so that the gaps could be long but not occur at multiple locations. Here are the adjusted parameters used in our mapping process: --local --rdg 10,1 --rfg 10,1. The mapped reads with gaps or InDels at unexpected locations rather than the Cas9 cut sites were removed. This step filtered out approximately 1% of the total reads (Supplementary file 1). The remaining reads were grouped based on the InDel size and whether the InDel overlaps with any of the PU.1 and C/EBPβ binding sites. Tag counts were used to quantify binding activity. InDel groups taking up less than 0.05% of the input sample reads were filtered out. TF binding associated with each InDel group was computed by the odds ratio between TF ChIP-seq tags and input DNA sample tags: (TF tags for an InDel group/rest of TF tags)/(input tags for the same InDel group/rest of input tags).

Cell lines

Request a detailed protocol

These studies made use of two cell lines. The major cell line used was Cas9-expressing ER-HoxB8 cells. This cell line was generated in our laboratory from primary bone marrow progenitor cells and its identity and phenotype are confirmed by RNA-seq. Given the origin of these cells and continuous validation by RNA-seq, we do not routinely test them for mycoplasma. In addition, because of the nature of the experimental design for the use of these cells, in which we are analyzing ChIP tags within a single population of cells, the presence of mycoplasma infection would not alter the outcome unless it changed the expression of the TFs being ChIP'd. This is clearly not the case based on our RNA-seq and ChIP-seq results.

The second cell line corresponded to 293T cells. These cells were used to generate lentivirus for transduction of the ER-HoxB8 cells. We acknowledge that these cells have the potential to be infected with mycoplasma and for viral supernatants to transmit this to the ER-HoxB8 cells. However, even if this were to be the case, it would not alter the conclusions of the experiment using the ER-HoxB8 cells for the reasons noted above.

Data availability

All sequencing data generated during this study have been deposited in GEO under accession code GSE178080. The codes for data analysis and the processed files of the ENCODE data are available on our Github repository: https://github.com/zeyang-shen/spacing_pipeline. copy archived at swh:1:rev:e9a2a65795b8349843f1d2b7128395eb8e365015.

The following data sets were generated
    1. Prohaska TA
    2. Hoeksema MA
    3. Spann NJ
    4. Sakai M
    5. Shen Z
    6. Glass CK
    (2021) NCBI Gene Expression Omnibus
    ID GSE178080. Systematic analysis of effects of naturally occurring insertions and deletions that alter transcription factor spacing identifies tolerant and sensitive transcription factor pairs.
The following previously published data sets were used
    1. Link VM
    2. Glass CK
    (2018) NCBI Gene Expression Omnibus
    ID GSE109965. Analysis of genetically diverse macrophages reveals local and domain-wide mechanisms that control transcription factor binding and function.
    1. Stolze LK
    2. Romanoski CE
    (2020) NCBI Gene Expression Omnibus
    ID GSE139377. Molecular Quantitative Trait Locus Mapping In Human Endothelial Cells Identifies Regulatory SNPs Underlying Gene Expression and Complex Disease Traits.

References

    1. Panne D
    (2008) The enhanceosome
    Current Opinion in Structural Biology 18:236–242.
    https://doi.org/10.1016/j.sbi.2007.12.002

Decision letter

  1. Jessica K Tyler
    Senior and Reviewing Editor; Weill Cornell Medicine, United States
  2. Jeff Vierstra
    Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "Natural genetic variation affecting transcription factor spacing at regulatory regions is generally well tolerated" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, including Chris P Ponting as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor.

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.

The question under study without doubt is important and the natural experiment approach has several strengths. Nevertheless, both reviewers would have been more persuaded of its conclusions if large-scale experimental validations could have been reported and if these conclusions were generalized to other TFs and systems. A large number of methodological and statistical questions were also raised that together reduced the enthusiasm of the reviewers for publication. We hope that the comments provided below are helpful to you when you consider how and when to submit this work for publication.

Reviewer #1:

This manuscript by Shen et al. exploits pre-existing data (Link et al., 2018a; Keane et al., 2011) to explore whether InDels observed among 5 diverse mouse strains alter transcription factor binding affinity, using the PU.1 and C/EBPβ LDTFs as paradigm.

These wholly computational analyses provide evidence that affinities of "co-binding sites" correlate poorly with their spacing. Variants lying in PWMs alter affinity for their cognate factor, but also the co-binding factor, but the authors propose that there is little change in affinity resulting from spacing changes between them. This conclusion is drawn from indirect observations: (a) "co-binding" is never demonstrated, rather inferred from close proximity of ChIP-seq peaks, (b) changes in effect sizes (Table 1) following filtering with collaborative or unrelated factors, although whether these results are statistically robust requires further clarity.

The study would have benefited from the results drawn from the "natural experiments" being then validated experimentally, for example using reporter assays in cells whose TF spacing has been changed. Overall, its results were highly correlative, and provided little in the way of substantive novel observations.

Issues:

(i) line 201 "The remaining sites with InDels between PU.1 and C/EBPβ motifs, which should represent a clean set of spacing alterations, showed a diminished effect on TF binding (Table 1 "filtered by collabor. factors"; Figure 3D)." I'm unsure whether this is true, or else whether the change in significance (p-value Table 1) reflects the change in number of binding regions being considered. Similarly, the H3K27ac analysis referred to on line 232. Please comment on whether these effect size differences are significant.

(ii) Figure 5 does not provide new findings that are sufficient to merit having a Main Figure. Figure 5A. This analysis should be repeated with a more sensitive local aligner, e.g. LASTZ. Panel A simply reflects a technical effect (that of altered gap penalties) rather than any biological phenomenon (similarly Figure Supplements) and panels 5B,C illustrate one example, rather than a generalizable phenomenon.

Reviewer #3:

In this study, Shen et al. perform a meta-analysis of previously published data to unravel the effects of SNPs and InDels in the binding of CEBPB and PU.1 in macrophages. The authors find/confirm that alterations in the TF motif have an important impact on TF binding, while alterations on the spacing between the motifs does not have an effect on their binding. While this is an intriguing question, the study is rather straightforward, mostly confirms previous observations of the impact of PU.1 binding site alterations (although the presented analysis techniques are state-of-the-art and inspiring; and InDels are often not included, while here they are carefully assessed). The study could be enhanced with experimental validation (e.g. enhancer reporter assays); comparison with other computational techniques (e.g., deep learning); inclusion of additional data layers (e.g., gene expression); and expanding it to other TFs and systems to investigate whether these findings are generally true for other cell types. As it currently stands, the conclusions in the paper are a bit too general from the (biased) analyses performed.

1. Are LDTFs pioneer TFs and SDTFs non-pioneer TFs? We have only found this terminology in papers from this group.

2. The basis of the study are ChIP-seq peaks, but ChIP-seq peaks without the TF motif exist (sometimes a large fraction), can this be discussed? Particularly how false positive (i.e., indirect binding sites, or phantom peaks) impact the results.

3. Calculating spacing between motifs is a big challenge, because often a CRM contains multiple matches to the same TF. If TF1 has 3 matches, and TF2 has 4 matches, then there are a lot of distances between TF1 and TF2 motifs possible. The authors calculate only the distance between the two best scoring matches. Can this decision be justified by a thorough analysis? (e.g., do other distances between the best match of PU.1 and the 2nd best match of CEBP destroy or maintain the correlations?). In conclusion, a landscape overview of possible distances vis-a-vis genomic variation would be more informative than only using the distance between the two best matches.

4. Related – in Figure 1c. How is the distance between CEBPB and PU.1 motifs calculated on the singly PU.1 binding sites, do they also have CEBPB binding sites (and vice versa)? If yes, what are the differences between these regions and regions bound by CEBPB (e.g. is it because CEBPB motifs are weaker in these regions compared to regions bound by CEBPB)?

5. Also related – how much does spacing change with indels (e.g. from ~12bp, what is the final size distribution with indels)?

6. Figure Suppl 4 (Figure 2). In regards to the effect of the spacing, it seems that when the initial spacing between the motifs is around 60 bp, effects are weaker compared to when the initial size is between 20-40 and 80-100 bp; what could be the reason?

7. Figure 1d. Have sites with log reads CEBPB/PU.1 < 4 been filtered out?

8. Line 115 – the negative correlation between PU.1 and CEBP motifs 'implicates' synergistic binding and degenerative motifs, but this is speculation (it is not known which ChIP-seq events reflect true enhancers).

9. If CEBPB ChIP-seq signal is negatively correlated with the PU.1 motif score (Figure 1d), mutation of the PU.1 motif should increase CEBPB binding (Figure 2). Based on panel D, it seems that mutation of the PU.1 motif actually decreases CEBPB binding.

10. The quality of the SNPs and InDels are not discussed. InDels are more prone to false positives, this was not taken into account?

11. The quality of the ChIP-seq peaks is also not taken into account. Can a FRIP analysis be provided, for example plot FRIP versus the number of peaks across the samples.

12. PU.1 and CEBP were somehow chosen. An unbiased analysis starting from a larger motif collection would have been more interesting, to see if the properties of PU.1, CEBP, and their distance, emerges from the background of other motifs. Later on there is a JASPAR analysis – what were other high-scoring hits?

13. For the twelve collaborative factors predicted, some kind of (computational) validation would be useful.

14. For identifying cofactors exploiting other ChIP-seq data (Figure 3A), which ChIP-seq tracks were used and in which tissue? In general, data/code availability sections are missing.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Systematic analysis of naturally occurring insertions and deletions that alter transcription factor spacing identifies tolerant and sensitive transcription factor pairs" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Jessica Tyler as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Jeff Vierstra (Reviewer #2).

The reviewers are in agreement that this paper is of potential interest for readers studying transcription factor (TF) binding site grammar/enhancer regulation. The work provides insight into the role of motif spacing alterations in TF binding/co-binding in the context of naturally occurring genetic variation. Overall, the data are properly analyzed and validated, although aspects of data analysis and presentation should be improved as outlined below, in a revised manuscript.

Reviewer #1:

Shen and colleagues investigated the role of motif spacing in regulating transcription factor (TF) binding or co-binding, specifically in the context of naturally occurring InDels in both human individuals and mouse strains, using previously published data (Hu et al., 2019; Link, Duttke, et al., 2018; Stolze et al., 2020).

The authors classified 75 TFs in K562 cells into constrained and relaxed spacing relationships with respect to their co-binding TFs, and found most of the TF pairs fall in the relaxed category. To illustrate whether spacing alterations affect TF binding and promoter-enhancer function, the authors analyzed the previously published data in mouse macrophages and human endothelial cells. They find that relaxed TF binding is highly tolerant to naturally occurring spacing alterations, which was further supported by CRISPR/Cas9 induced InDels in mouse macrophages.

The study would benefit from including other types of data available (i.e. gene expression and 3D interactions), to better examine the effects of spacing alterations and whether they are indeed related to promoter-enhancer functions.

1. The authors defined constrained and relaxed spacing relationships using public dataset including 75 TF ChIP-seq in K562 cells. K562 and HepG2 (analyzed in this study) are cancer cell lines. The karyotype of cancer cell lines, their own genetic variations and specific TF networks need to be carefully examined/ruled out before applying this to other data from healthy individuals.

2. Some TFs, especially the ones belong to the same families, share core motif sequences, and usually these TFs can co-localize to regulate gene expression. It is not clear how this case was handled in the pipeline.

3. (Figure 2A and 2B) The number of TF pairs considered/affected, the number of InDels at motifs/between motifs/in backgrounds, and the number of high-frequency/rare variants/singletons, need to be listed, which could further help illustrate the significance of any enrichment.

4. Line 165 – 168, "Since common variants are associated with less deleteriousness and rare variants with more deleteriousness (Lek et al., 2016), our data suggest that InDels between motifs of TFs with constrained spacing could be just as damaging as those at motifs whereas InDels between motifs of TFs with relaxed spacing might have a much weaker effect". This is speculation, and it would be better supported by at least some examples if not analysis/statistics, to show the deleterious effects. Probably this could be validated using CRISPR experiments as for relaxed ones.

5. The authors attempted to explore if these InDels eventually affect enhancer-promoter activity/function, but it's not clear whether enhancers and promoters were considered and whether they were considered separately in the analysis. Also, it would be great if the authors could assess whether spacing alterations investigated here in mouse macrophages affect gene expression, since RNA-seq/GRO-seq and PLAC-seq data are available from the published research. This information may help clarify the analysis, strengthen the conclusion and would be useful for readers interested in enhancer regulation.

Reviewer #2:

This paper attempts to address a long-standing question of how TFs collaborate to instantiate and maintain accessible and functional regulatory DNA. The authors make use of ENCODE data the investigate the extent TF spacing constraints in the genome and then integrate both human genetics data (gnomAD) and a compendium of ChIP-seq experiments performed in diverse mouse genetic backgrounds to test whether motif spacing has a significant effect on TF binding. While I appreciated the authors utilization of many datasets to test for spacing effects (large ENCODE data to identify motif pairs with spacing constraints, human genetic to look for signals of negative selection of natural variation effecting spacing and mouse TF binding data to 'test' for spacing effects), I find that computational analysis within this paper is quite shallow, leading to mostly obvious conclusion that variation within the TF-DNA interaction interface are critical for TF binding. Furthermore, While the editing experiments are a nice addition in the revision, I am not sure that they provide much in the way of validating or generalizing their claims. Finally, the authors should more thoroughly place their work in the context of previous studies.

1. The constrained vs. relaxed spacing analysis has a high likelihood to be confounded by latent genome architecture. Specific class of retrotransposable elements are known to 'template' regulatory DNA (see Bourque, G. et al. 2008 Genome Res. (10.1101/gr.139105.112), Kunarso et al. 2010 Nat. Genetics (10.1038/ng.600), and an analysis of co-binding/spacing constraints from ENCODE data: Wang et al., 2012 Genome Res.(10.1101/gr.139105.112)). The authors should perform an analysis that accounts for repetitive DNA that encode competent cis-regulatory DNA elements that have been templated across the genome. At a minimum these previous works should be cited.

2. The authors should comment on how rigid DNA-encoded TF spacing is not supported by evolutionary studies, which have shown an excess of TF motif turnover within regulatory DNA (see work from Duncan Odom's group, Vierstra el., 2014 Science for a direct mouse-human comparison).

3. While Leveraging CRISR/Cas9 editing to generate a broad spectrum of 'spacing' alleles is a clever approach to tackle the experimentally test the effect of motif spacing on TF (co-)binding, the experiment is very underpowered to test the generalizability of the authors claims. Did the authors select single cell clones from the editing experiment or just look at bulk edited populations? If the former, it is unclear how any conclusions can be made from a mixture of edited cells that have a spectrum of indels (also likely carrying two different alleles).

Reviewer #3:

Transcription factors (TFs) bind to the DNA in a sequence-specific manner at TF binding sites (TFBSs) to control gene transcription. Hence, characterizing how TFs interact with DNA is key to uncover how gene regulation occurs and how this process can be disrupted in diseases. While the binding properties of a large portion of human TFs are well characterized, a remaining challenge lies in our knowledge of how TFs interact cooperatively at regulatory elements, either forming dimers or co-binding the same regions. In this manuscript, Shen et al. explored spacing patterns between TFBSs. Relying on ChIP-seq data, they developed a new methodology to predict TF pairs harbouring constrained or relaxed spacing patterns between their TFBSs. The authors made their code available, which allows reproducibility and exploration; this should be a requirement in the field but is not always complied with so we thank the authors for this. When applied to a limited set of TFs with ChIP-seq data in K562 cells, the authors predicted that TF pairs primarily bind to DNA with relaxed spacing between their TFBSs. Nevertheless, they were able to highlight already known as well as novel specific pairs of TF harboring constrained spacing. Next, the authors leveraged naturally occurring small insertions and deletions in the human population and mouse strains to confirm that altering spacing between TFBSs of TF pairs with relaxed spacing patterns has limited effect. This observation was further supported by synthetic spacing alterations induced by CRISPR-Cas9 experiments. The study is overall well designed and addresses an important challenge in our understanding of TF-DNA interactions and TF cooperation.

Nevertheless, we believe that there are some methodological limitations that favor the identification of relaxed spacing patterns, which should be better outlined in the manuscript to allow the reader to fully comprehend the results. From the title and first sections of the manuscript the readers are given the impression that relaxed and constrained spacing instances are about to be described and analysed with an equal importance. However, more focus is given to the relaxed spacing with both the mouse and CRISPR analyses exclusively dedicated to this with no clear explanation why. It would be useful to the readers to have this explicitly outlined by the authors. Finally, the terminology associated with TF-DNA interactions is very often incorrect, which confuses the readers and should be addressed. Please see below for our detailed comments.

1. The terminology associated with TF binding events is inappropriate. The authors use "ChIP-seq peaks", "TFBSs", and "motifs" almost interchangeably, which is not correct. The inconsistency in the terminology makes it difficult to fully comprehend what the authors meant/did.

An example is one of the first sentences in the Introduction: "TFs bind to short, degenerate sequences at promoters and enhancers, often referred to as TF binding motifs." The sequences bound by TFs in promoters and enhancers are TFBSs while TF binding motifs are computational representations of TFBS sets, which can be represented in many ways such as consensus motifs, PFMs, etc.

The next sentence claims that "TFs bind in an inter-dependent manner to closely spaced motifs." Motifs cannot be closely spaced but TFBSs are. Another example is the subsection "Motif identification" in the Methods section while the authors describe the prediction of TFBSs (using motifs).

2. More details should be provided to the Methods section. We acknowledge that the authors provide their code for inspection but outlining all methodological details in the manuscript would help in the clear understanding of the methods used. For instance:

a. The authors do not describe how they selected de novo motifs using HOMER (only best motif?, any p-value threshold?, any specific background used?)

b. For the TFBS predictions, the authors used a FPR threshold of 0.1%, but which was the specific tool that they used for that? FPR computation depends on background expectation, what was used (e.g. 25% A, C, G, or T or nucleotide composition of the genome)?

c. P. 23, lines 466-470. The authors described that they conducted a permutation test but then described that the null distribution was obtained using random spacing values between 0 and 100. If the null distribution is obtained by randomly selecting values between 0 and 100, it does not correspond to a permutation. A permutation test would imply permuting the observed spacing values.

d. P. 25, lines 513-516. It is not clear to us why the authors considered subsets of mutations when overlapping TFBSs predicted for the ChIP'ed TFs (only if 2-bit difference in motif score) but not for mutations overlapping TFBSs predicted by MAGGIE (all mutations). Why not consider all mutations in all cases?

3. The methodology used has several limitations that are not described by the authors. We encourage the authors to clearly outline them to the readers. Furthermore, we believe that these limitations favor the identification of relaxed spacing, which should be acknowledged, especially since the majority of the work focuses on alterations of spacing for TF pairs with relaxed spacing patterns.

a. It is well documented that TF binding preferences for TFs binding in close proximity (e.g. as dimers) can be altered. For instance, Jolma et al. (https://www.nature.com/articles/nature15518) used CAP-SELEX to reveal that "Most TF pair sites identified involved a large overlap between individual TF recognition motifs, and resulted in recognition of composite sites that were markedly different from the individual TF's motifs." As the authors relied on TF binding profiles (or motifs) corresponding to the binding preferences of TFs binding as monomers, it is possible that they will miss cooperative binding inducing a change in binding preferences. Furthermore, the authors did not consider overlapping motifs, which is again precluding the identification of constrained spacing.

b. The authors rely on ChIP-seq data to identify TF cooperation. While this is fine overall, this data does not allow the authors to know whether two identified TFs bind to their TFBSs on the same molecules. Indeed, ChIP-seq being a bulk experiment, it does not allow to discriminate between true co-occupancy on the same molecule or not. This should be discussed to put the work in a larger context to the readers. See https://www.cell.com/molecular-cell/pdf/S1097-2765(20)30793-0.pdf for a reference.

c. The de novo motif enrichment does not ensure that the motif found is actually the one bound by the ChIP'ed TF. Indeed, the motifs found for ARID1B and ARID2 correspond to GATA motifs while ARID TFs bind to more general A+T rich motifs. It is unclear whether the signal observed for these TFs is due to their direct interaction with DNA.

4. It would have been nice to perform similar experiments as the mouse and CRISPR ones but considering TF pairs with fixed spacing of their TFBSs. If the authors decide not to, they should at least clearly state to the readers that they more specifically focused on relaxed spacing and why.

5. It would have been interesting to provide information about the prevalence of cobinding events in the different ChIP-seq datasets. For instance, what is the percentage of ChIP-seq peaks that contain a predicted TFBS for each TF? What is the percentage of such peaks that contain cobinding events? Is there a difference in numbers b/w constrained or relaxed spacing. Providing this information would help the readers to put these observations into a more general context of TF binding regions.

6. It would be nice to have similar plots as in Figure 3B for pairs of TFs with constrained spacing to show how it contrasts.

7. In all analyses, it seems that more deletions than insertions were observed (see Figure 2A for instance). It would be interesting to see if the results recapitulate when considering insertions and deletions independently.

8. It is unclear to us what is the analysis of MAGGIE-predicted TFBSs adding to the story. Moreover, the authors claim that MAGGIE identifies "functional" TFBSs but the authors do not provide any specific evidence of function (and which function?) in our opinion.

9. It seems that there is a periodic pattern in Figure 5E with log-odds ratio periodically equal to 0 (at least with indel sizes < -45). Could this correspond to the minor groove width periodicity of ~10bp (see for instance https://www.cell.com/cell/pdf/S0092-8674(18)31312-6.pdf for periodicity of mutations)? Could the authors comment on that?

10. In several figures, the authors should provide all points instead of summarizing with boxplots, which are frowned upon as they hide data distribution.

11. P. 9 line 178: the authors mentioned 50 millions SNPs but we do not find where this data is used as observing SNPs would not alter spacing.

12. P. 2 line 46: "implicating their effect…" we would rather write "suggesting their effect…".

13. The authors seem to have ignored homodimers. Maybe their methodology could be extended to consider spacing b/w TFBSs for the same TF.

14. TF naming is sometimes inconsistent in the text and figures. For example, SPI1 and PU.1 are both used in Figure 3. Similarly, we recommend revisiting the text for p65 and RELA, as well as C/EBPβ and CEBPB.

15. It is not clear where the authors retrieved HepG2 cell line data from (used in the Figure 1 —figure supplement 4). Was this data processed in the same way as K562 cell line data?

16. P. 8 line 152: we suggest replacing "we overlaid" with "we mapped".

17. P. 1 line 30: "ChIP-sequencing" should be replaced with "ChIP-seq" for consistency.

18. Figure 3D,F,H : instead of motif score this should be TFBS score.

https://doi.org/10.7554/eLife.70878.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Reviewer #1:

This manuscript by Shen et al. exploits pre-existing data (Link et al., 2018a; Keane et al., 2011) to explore whether InDels observed among 5 diverse mouse strains alter transcription factor binding affinity, using the PU.1 and C/EBPβ LDTFs as paradigm.

These wholly computational analyses provide evidence that affinities of "co-binding sites" correlate poorly with their spacing. Variants lying in PWMs alter affinity for their cognate factor, but also the co-binding factor, but the authors propose that there is little change in affinity resulting from spacing changes between them. This conclusion is drawn from indirect observations: (a) "co-binding" is never demonstrated, rather inferred from close proximity of ChIP-seq peaks, (b) changes in effect sizes (Table 1) following filtering with collaborative or unrelated factors, although whether these results are statistically robust requires further clarity.

The study would have benefited from the results drawn from the "natural experiments" being then validated experimentally, for example using reporter assays in cells whose TF spacing has been changed. Overall, its results were highly correlative, and provided little in the way of substantive novel observations.

We thank Reviewer #1 for these comments, which led us to perform additional experiments and analyses. To examine the robustness of our conclusion in another system, we extended the findings in mouse macrophages to investigate the spacing relationships of ERG and p65 in a previously studied cohort of human endothelial cells. These studies strongly reinforced the findings in macrophages with respect to tolerance of ERG and p65 binding on intervening InDels (New Figure 4). The major general criticism of Reviewer #1 is lack of experimental validation. Rather than using reporter assays which may not reflect natural chromatin environments, we performed new experiments in which we used CRISPR/Cas9 methodology to introduce a range of InDels between endogenous PU.1 and C/EBPβ binding sites at 6 different genomic locations (three containing natural InDels and three without) and measured the effects of different InDels altering spacing on transcription factor binding. To our knowledge, this is the first application of CRISPR technology for this purpose. These studies demonstrated remarkable stability of C/EBPβ binding at deletions up to the point of destruction of either the CEBP or PU.1 motif, strongly supporting the original conclusions of our manuscript (New Figure 5).

Issues:

(i) line 201 "The remaining sites with InDels between PU.1 and C/EBPβ motifs, which should represent a clean set of spacing alterations, showed a diminished effect on TF binding (Table 1 "filtered by collabor. factors"; Figure 3D)." I'm unsure whether this is true, or else whether the change in significance (p-value Table 1) reflects the change in number of binding regions being considered. Similarly, the H3K27ac analysis referred to on line 232. Please comment on whether these effect size differences are significant.

We thank Reviewer 1 for this comment. Since the original submission, we made changes to our categorization method for binding regions and were able to distinguish genetic variations affecting PU.1, CEBP, or other functional motifs identified by our recently developed method MAGGIE (Shen et al., 2020) and those altering spacing without affecting motifs. The statistical significances or effect sizes are reported in the new Figure 3 and new Figure 4.

(ii) Figure 5 does not provide new findings that are sufficient to merit having a Main Figure. Figure 5A. This analysis should be repeated with a more sensitive local aligner, e.g. LASTZ. Panel A simply reflects a technical effect (that of altered gap penalties) rather than any biological phenomenon (similarly Figure Supplements) and panels 5B,C illustrate one example, rather than a generalizable phenomenon.

We accepted reviewer’s suggestion on the original Figure 5 as not being sufficient to merit a main figure and excluded these findings in the new submission. Instead, we leveraged more than 60 million InDels from gnomAD data, which were based on more than 75,000 genomes from unrelated individuals, to investigate the potential for selective pressure against InDels altering transcription factor spacing. We found that the InDels between motifs of TFs with a relaxed spacing relationship are associated with less deleteriousness, whereas InDels that occur between motifs of constrained TFs are enriched with singletons, associated with stronger deleterious effects (New Figure 2).

Reviewer #3:

In this study, Shen et al. perform a meta-analysis of previously published data to unravel the effects of SNPs and InDels in the binding of CEBPB and PU.1 in macrophages. The authors find/confirm that alterations in the TF motif have an important impact on TF binding, while alterations on the spacing between the motifs does not have an effect on their binding. While this is an intriguing question, the study is rather straightforward, mostly confirms previous observations of the impact of PU.1 binding site alterations (although the presented analysis techniques are state-of-the-art and inspiring; and InDels are often not included, while here they are carefully assessed). The study could be enhanced with experimental validation (e.g. enhancer reporter assays); comparison with other computational techniques (e.g., deep learning); inclusion of additional data layers (e.g., gene expression); and expanding it to other TFs and systems to investigate whether these findings are generally true for other cell types. As it currently stands, the conclusions in the paper are a bit too general from the (biased) analyses performed.

We thank Reviewer #3 for these recommendations. To generalize our findings from mouse macrophages, we conducted four additional lines of analyses:

First, we developed a new analytical pipeline for determining whether the spacing relationships between transcription factor pairs are relaxed (relatively independent of spacing) or constrained (presence of a fixed spacing interval) and applied this method to ChIP-seq data sets for 75 transcription factors determined in K562 cells by the ENCODE consortium. Nearly all of the factors examined primarily exhibited relaxed spacing relationships with the majority of co-bound factors, consistent with the major conclusions of our original manuscript. However, about half of these factors also exhibited constrained relationships with specific TF partners. We confirmed previously identified constrained relationships, such as for GATA1 and TAL1 and identified several previously unrecognized constrained relationships (New Figure 1).

Second, as noted in our response to Reviewer #1, having defined the pairwise spacing relationships for these 75 factors, we then leveraged more than 60 million InDels from gnomAD data, which were based on more than 75,000 genomes from unrelated individuals, to investigate the potential for selective pressure against InDels that occur at or between motifs. We found that the InDels between motifs of TFs with a relaxed spacing relationship are associated with less deleteriousness, whereas InDels that occur between motifs of constrained TFs are enriched with singletons, associated with stronger deleterious effects (New Figure 2).

Third, as noted in our response to Reviewer #1, we extended the findings in mouse macrophages to investigate the spacing relationships of ERG and p65 in a previously studied cohort of human endothelial cells, in which it was possible to consider InDels as quantitative trait loci with respect to the binding of ERG and p65. These studies strongly reinforced the findings in macrophages with respect to tolerance of ERG and p65 binding on intervening InDels (New Figure 4).

Lastly, as noted in our response to Reviewer #1, we performed new experiments in which we used CRISPR/Cas9 methodology to introduce a range of InDels into six endogenous genomic loci of PU.1 and C/EBPβ co-binding in macrophages. This mutagenesis approach resulted in populations of cells containing InDels ranging from +5 bp (insertions) to >-30 bp (deletions) between the PU.1 and C/EBP binding sites. We then performed ChIP for C/EBPβ in control cells and in CRISPR/Cas9 mutated cells and deeply sequenced the immunoprecipitated DNA for the targeted locations using paired end sequencing that enabled determination of the InDel size. By comparing the ChIP tag counts in control cells to the ChIP tag counts at each interval of insertion or deletion, we could calculate an odds ratio to quantify the effect of the InDel on C/EBPβ binding. These studies demonstrated remarkable stability of C/EBPβ binding at deletions up to the point of destruction of either the CEBP or PU.1 motif, strongly supporting the original conclusions of our manuscript (New Figure 5).

1. Are LDTFs pioneer TFs and SDTFs non-pioneer TFs? We have only found this terminology in papers from this group.

Our interpretation of the term pioneer factor is that it is a sequence-specific transcription factor that has the ability to interact with its DNA recognition motif in the context of closed chromatin and initiate chromatin remodeling. This is the essential first step required for selection of new enhancers that drive transitions in cell fate. Many LDTFs have this capability, such as PU.1. Importantly, the ability of pioneer factors to interact with DNA recognition motifs in closed chromatin is necessary but not sufficient for enhancer selection and activation. From our studies of macrophages, for example, the binding of pioneer factors such as PU.1 always requires mutually dependent interactions with additional transcription factors, which we define as collaborative binding. We use the term SDTF to refer to a transcription factor that is acutely responding to an internal or external signal. It may or may not have properties of a pioneer factor or LDTF. As these details of the terminology are not central to the main message of the manuscript, we did not provide extensive explanations to their definitions, but can expand if the reviewer recommends that this would be helpful for broader accessibility.

2. The basis of the study are ChIP-seq peaks, but ChIP-seq peaks without the TF motif exist (sometimes a large fraction), can this be discussed? Particularly how false positive (i.e., indirect binding sites, or phantom peaks) impact the results.

We agree with this concern. Based on reviewer’s suggestion, our analyses in the new submission involving ChIP-seq peaks all focused on those with highly confident TF motifs to minimize the impacts of false positives or indirect binding sites on our conclusions. Confident motifs all had a PWM score passing false positive rate (FPR) < 0.1% and a location within 50 bp from the peak centers. We found that these two criteria were important for recovering true spacing relationships from TF binding sites based on our examination of known constrained spacing of GATA1 and TAL1 (new Figure 1-figure supplement 1).

3. Calculating spacing between motifs is a big challenge, because often a CRM contains multiple matches to the same TF. If TF1 has 3 matches, and TF2 has 4 matches, then there are a lot of distances between TF1 and TF2 motifs possible. The authors calculate only the distance between the two best scoring matches. Can this decision be justified by a thorough analysis? (e.g., do other distances between the best match of PU.1 and the 2nd best match of CEBP destroy or maintain the correlations?). In conclusion, a landscape overview of possible distances vis-a-vis genomic variation would be more informative than only using the distance between the two best matches.

We thank reviewer’s comment for inspiring us to improve our computational pipeline of characterizing spacing relationships for TF pairs. Based on our examination of GATA1 and TAL1 (new Figure 1—figure supplement 1), we found that best motifs could generally reflect true spacing relationship, but considering all the high-confidence motifs (FPR < 0.1%) could recover an even stronger signal. In addition, restricting the relative locations of motifs from the ChIP-seq peak centers could substantially reduce false positives. Therefore, in our pipeline of characterizing spacing relationships, we computed all possible pairs of confident TF motifs located within 50 bp from the peak centers. However, considering multiple matches would be trickier for genetic variation analysis. For example, one region can have an InDel altering spacing between two best matches but not altering spacing between the best match of PU.1 and the second best match of C/EBP. As the reviewer pointed out, since it can be a challenge to categorize genetic variation when considering multiple matches at the same time, we categorized genetic variation only based on the best match motifs. To examine the reproducibility of our conclusions, we did four independent comparisons between mouse strains and observed very similar effect size relationships of spacing alterations and motif mutations. These findings were further supported by analyses in human endothelial cells and experimental validations we included in the new submission.

4. Related – in Figure 1c. How is the distance between CEBPB and PU.1 motifs calculated on the singly PU.1 binding sites, do they also have CEBPB binding sites (and vice versa)? If yes, what are the differences between these regions and regions bound by CEBPB (e.g. is it because CEBPB motifs are weaker in these regions compared to regions bound by CEBPB)?

We thank Reviewer #3 for this comment. The spacing plots of singly PU.1 binding sites are not included in the revised manuscript due to the addition of many other experimental results and analyses that provide more substantial evidence for the original conclusions.

5. Also related – how much does spacing change with indels (e.g. from ~12bp, what is the final size distribution with indels)?

We included the size distribution of InDels for every dataset we analyzed in the new submission as shown in new Figure 2A, new Figure 3—figure supplement 1, new Figure 4—figure supplement 3, and new Figure 5B. Overall, the majority of naturally occurring InDels are less than 5 bp in length, and synthetic InDels generated by CRISPR/Cas9 system provides a wider range with deletions as large as >30 bp.

6. Figure Suppl 4 (Figure 2). In regards to the effect of the spacing, it seems that when the initial spacing between the motifs is around 60 bp, effects are weaker compared to when the initial size is between 20-40 and 80-100 bp; what could be the reason?

We thank Reviewer #3 for this comment. After re-doing the analysis with improved computational pipeline and aggregating the results from all the four pairwise comparisons of mouse strains, we no longer saw a weaker effect around 60 bp. This result is presented in new Figure 3-Supplement 3.

7. Figure 1d. Have sites with log reads CEBPB/PU.1 < 4 been filtered out?

Part of original Figure 1D is now shown as Figure 3B in the new submission. The regions used to plot new Figure 3B are the same ones used in new Figure 3A. These regions were not explicitly filtered by tag counts but by confident PU.1 and CEBP motifs. Most of these regions turned out to have more than 16 ChIP-seq tags, showing strong TF binding activity.

8. Line 115 – the negative correlation between PU.1 and CEBP motifs 'implicates' synergistic binding and degenerative motifs, but this is speculation (it is not known which ChIP-seq events reflect true enhancers).

Since this comment and comment #9 both refer to Figure 1D of the original manuscript, comment #8 will be responded together with comment #9 below.

9. If CEBPB ChIP-seq signal is negatively correlated with the PU.1 motif score (Figure 1d), mutation of the PU.1 motif should increase CEBPB binding (Figure 2). Based on panel D, it seems that mutation of the PU.1 motif actually decreases CEBPB binding.

The original Figure 1D is no longer included in the revised manuscript due to the addition of many other experimental results and analyses that strengthen the original conclusions. Although the p value was significant in the original Figure 1D, the Spearman correlation coefficient of -0.1 indicates almost no correlation of C/EBP binding with the PU.1 motif score. Importantly, this is not the same as an effect of a mutation in the PU.1 motif on C/EBP binding. Many independent lines of evidence, including evidence in the present study, indicate that mutation of a PU.1 motif (over a range of motif scores) results in a decrease in collaborative binding of C/EBP.

10. The quality of the SNPs and InDels are not discussed. InDels are more prone to false positives, this was not taken into account?

The InDels from mouse strains and humans in this study have all passed quality filters according to the source papers. For instance, according to the source paper for the genetic variation of inbred mouse strains (Keane et al., 2011), the false positive error per 10 kb for InDels is only 0.09, which is too low to bias our conclusions.

11. The quality of the ChIP-seq peaks is also not taken into account. Can a FRIP analysis be provided, for example plot FRIP versus the number of peaks across the samples.

All ChIP-seq peaks used in the new submission were identified significantly reproducible (IDR < 0.05) based on biological replicates from ENCODE data portal and source papers. The statistics of ChIP-seq data generated through this study are included in Figure 5-table supplement 2. All the ENCODE data meet the quality metrics described in Landt et al., 2012, including FRIP > 1%. All the mouse macrophage ChIPseq data analyzed in new Figure 3 also have FRIP greatly exceeding 1% with the minimum FRIP being 3.65%. Author response image 1 summarizes the FRIP in relationship with peak number across PU.1 and C/EBPβ samples of five mouse strains.

Author response image 1

12. PU.1 and CEBP were somehow chosen. An unbiased analysis starting from a larger motif collection would have been more interesting, to see if the properties of PU.1, CEBP, and their distance, emerges from the background of other motifs. Later on there is a JASPAR analysis – what were other high-scoring hits?

We included an unbiased systematic analysis of a large collection of TFs in the new submission (new Figure 1). We analyzed 75 transcription factors determined in K562 cells by the ENCODE consortium to obtain a global view of the spacing relationships among TFs.

13. For the twelve collaborative factors predicted, some kind of (computational) validation would be useful.

Different from our original way of identifying collaborative factors based on spacing relationships, in the new submission we applied our recently developed method MAGGIE (Shen et al., 2020) to identify a set of functional motifs. This method associates motif changes due to genetic variation and TF binding changes, and the motifs with significant associations are likely to be collaborative factors as validated in our MAGGIE paper.

14. For identifying cofactors exploiting other ChIP-seq data (Figure 3A), which ChIP-seq tracks were used and in which tissue? In general, data/code availability sections are missing.

In the revised manuscript, we transitioned to using our recently developed MAGGIE method to identify cofactors/collaborative factors. Data/code availability sections are now provided via the MAGGIE paper.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Reviewer #1:

Shen and colleagues investigated the role of motif spacing in regulating transcription factor (TF) binding or co-binding, specifically in the context of naturally occurring InDels in both human individuals and mouse strains, using previously published data (Hu et al., 2019; Link, Duttke, et al., 2018; Stolze et al., 2020).

The authors classified 75 TFs in K562 cells into constrained and relaxed spacing relationships with respect to their co-binding TFs, and found most of the TF pairs fall in the relaxed category. To illustrate whether spacing alterations affect TF binding and promoter-enhancer function, the authors analyzed the previously published data in mouse macrophages and human endothelial cells. They find that relaxed TF binding is highly tolerant to naturally occurring spacing alterations, which was further supported by CRISPR/Cas9 induced InDels in mouse macrophages.

The study would benefit from including other types of data available (i.e. gene expression and 3D interactions), to better examine the effects of spacing alterations and whether they are indeed related to promoter-enhancer functions.

We thank Reviewer #1 for these comments and the constructive recommendations. We now clarify in the revised manuscript that the major objectives of these studies are to determine the effects of spacing alterations on the local binding and function of TFs. We agree that placing these studies in the context of promoter-enhancer functions would be of interest. This is a question that we previously investigated in the broader context of SNPs and InDels that alter TF binding by directly altering TF binding sites (Heinz et al., 2013, Link et al., 2018, Hoeksema et al., 2021). From this prior work, we found that establishing direct relationships between TF binding at specific enhancers and target genes is often confounded by enhancer redundancy and the presence of co-occurring genetic variants of uncertain significance. Although many examples can be identified in which a motif mutation impairs TF binding at an enhancer and results in corresponding downregulation of a PLAC-seq connected gene, there are also many examples where this does not occur. At present, the ability to predict whether a variant-induced change in TF binding results in a change in gene expression is limited. Because of this limitation, we focused these studies on local consequences of TF activity. The prior version of the manuscript utilized H3K27ac as a readout based on the principle that TFs act to recruit co-activators with histone acetyltransferase activity. As suggested, we now include additional analyses of altered spacing on nascent transcription provided by GRO-seq data (see response to Comment 5), which more directly assesses transcriptional activity. These new analyses also document that relaxed spacing requirements for DNA binding are tolerated with respect to local transcriptional activity. We think that this result is interesting and significant because it extends the concept of spatial tolerance to the entire ensemble of factors that must be assembled to mediate transcriptional initiation. We thank Reviewer #1 for recommending this additional line of investigation, which is further described in response to Comment 5.

1. The authors defined constrained and relaxed spacing relationships using public dataset including 75 TF ChIP-seq in K562 cells. K562 and HepG2 (analyzed in this study) are cancer cell lines. The karyotype of cancer cell lines, their own genetic variations and specific TF networks need to be carefully examined/ruled out before applying this to other data from healthy individuals.

We thank Reviewer #1 for raising the potential for confounding effects of karyotypic variations of K562 cells and potentially K562-specific TF ChIP-seq peaks when overlaying these peaks with genetic variations in the population. To evaluate the extent to which K562-specific peaks could affect our results, we downloaded SNVs and indels of K562 cells (Zhou et al., 2019) and overlayed them with our TF ChIP-seq peaks called from ENCODE data. We found only ~19% of co-binding peaks contain at least one K562 SNV or InDel (proportions shown in Author response image 2), and we observed similar proportions in relaxed and constrained TF pairs. Importantly, Zhou et al. found 98% SNVs and 79% InDels of K562 cells overlapped with those in the general population based on dbSNP. Therefore, our conclusion of transcription factor binding is largely based on regions with no K562 variants or variants that are common in the general population, and the K562-specific variants only accounted for a very small fraction in our analyses.

Author response image 2

2. Some TFs, especially the ones belong to the same families, share core motif sequences, and usually these TFs can co-localize to regulate gene expression. It is not clear how this case was handled in the pipeline.

We thank Reviewer #1 for raising this significant point. We agree that TFs that belong to a conserved family and recognize the same motif require special consideration. For example, JUN and JUND are both AP-1 factors and recognize very similar motifs. If JUN and JUND ChIP-seq peaks don’t overlap at all, there is no way for their motif spacing to be within the range of +/- 100 bp. Therefore, we only calculated spacing for overlapping peaks (i.e., co-binding peaks). For an overlapping peak (200-300 bp), there can be two scenarios: JUN and JUND binding sites (10bp sequence matching their motifs) overlap, or they don’t. Since we identified the binding sites for each of the two TFs from their corresponding 200-bp peaks by restricting the distance to peak centers, JUN and JUND binding sites aren’t necessarily at the same position. From this analysis, we found that a lot of JUN and JUND binding sites don’t overlap. We computed spacing for those non-overlapping binding sites and plotted the spacing relationship. This analysis indicated that the JUN and JUND with these characteristics have a significant relaxed spacing relationship. To better visualize the spacing relationships between TFs of the same family, we decreased the heights of connections within the same family in revised Figure 1C. We discussed this specific problem in the Results section (line 131 under “No Markup” view) and clarified how we handled TFs with shared core motif sequences in the Methods section under “TF binding site identification”.

3. (Figure 2A and 2B) The number of TF pairs considered/affected, the number of InDels at motifs/between motifs/in backgrounds, and the number of high-frequency/rare variants/singletons, need to be listed, which could further help illustrate the significance of any enrichment.

We included these numbers as new Figure 2-table supplement 1.

4. Line 165 – 168, "Since common variants are associated with less deleteriousness and rare variants with more deleteriousness (Lek et al., 2016), our data suggest that InDels between motifs of TFs with constrained spacing could be just as damaging as those at motifs whereas InDels between motifs of TFs with relaxed spacing might have a much weaker effect". This is speculation, and it would be better supported by at least some examples if not analysis/statistics, to show the deleterious effects. Probably this could be validated using CRISPR experiments as for relaxed ones.

Previous studies (Ng et al., 2014) have validated the deleterious effects of InDels between TF binding sites of constrained TFs, so we included citations of these studies at the end of section “Natural genetic variants altering spacing between relaxed transcription factors are associated with less deleteriousness in human populations” in the Results section.

5. The authors attempted to explore if these InDels eventually affect enhancer-promoter activity/function, but it's not clear whether enhancers and promoters were considered and whether they were considered separately in the analysis. Also, it would be great if the authors could assess whether spacing alterations investigated here in mouse macrophages affect gene expression, since RNA-seq/GRO-seq and PLAC-seq data are available from the published research. This information may help clarify the analysis, strengthen the conclusion and would be useful for readers interested in enhancer regulation.

We thank Reviewer #1 for these suggestions, as noted in our response to the general comments above. Our original analyses considered all TF binding sites at promoters and enhancers. In response to this request, we conducted additional analysis to examine the effects of InDels on promoters and enhancers separately. Most of the informative genetic variants are located at enhancers, and relatively few are within promoters. Despite that, we saw consistent relationships in promoters and enhancers, as exemplified by the C/EBPβ ChIP-seq results, which are now included as Figure 3—figure supplement 4.

Reviewer #2:

This paper attempts to address a long-standing question of how TFs collaborate to instantiate and maintain accessible and functional regulatory DNA. The authors make use of ENCODE data the investigate the extent TF spacing constraints in the genome and then integrate both human genetics data (gnomAD) and a compendium of ChIP-seq experiments performed in diverse mouse genetic backgrounds to test whether motif spacing has a significant effect on TF binding. While I appreciated the authors utilization of many datasets to test for spacing effects (large ENCODE data to identify motif pairs with spacing constraints, human genetic to look for signals of negative selection of natural variation effecting spacing and mouse TF binding data to 'test' for spacing effects), I find that computational analysis within this paper is quite shallow, leading to mostly obvious conclusion that variation within the TF-DNA interaction interface are critical for TF binding. Furthermore, While the editing experiments are a nice addition in the revision, I am not sure that they provide much in the way of validating or generalizing their claims. Finally, the authors should more thoroughly place their work in the context of previous studies.

We thank Dr. Vierstra for highlighting aspects of the manuscript requiring further clarification and suggestions for better placing these studies in the context of prior work.

1. The constrained vs. relaxed spacing analysis has a high likelihood to be confounded by latent genome architecture. Specific class of retrotransposable elements are known to 'template' regulatory DNA (see Bourque, G. et al. 2008 Genome Res. (10.1101/gr.139105.112), Kunarso et al. 2010 Nat. Genetics (10.1038/ng.600), and an analysis of co-binding/spacing constraints from ENCODE data: Wang et al., 2012 Genome Res.(10.1101/gr.139105.112)). The authors should perform an analysis that accounts for repetitive DNA that encode competent cis-regulatory DNA elements that have been templated across the genome. At a minimum these previous works should be cited.

We thank Dr. Vierstra for bringing to our attention the possibility that specific classes of retrotransposable elements could be a confounding factor. To address this concern, we performed additional analyses to account for repetitive DNA regions and cited the suggested previous works. The results have been included in new Figure 1—figure supplement 6, Figure 1figure supplement 7, and Figure 1-table supplement 3. Codes newly generated for this analysis have been uploaded to our Github repository. Specifically, since DNA repetitive regions such as transposable elements are known to harbor TF binding sites and specific TF co-binding (Bourque et al., 2008; Kunarso et al., 2010), we further examined whether the spacing relationships of TFs could be different in repetitive and nonrepetitive regions. To study this, we applied the same pipeline to the subsets of TF ChIP-seq peaks in repetitive and nonrepetitive regions. As a result, most of the relaxed spacing relationships remained regardless of repetitive or nonrepetitive regions (Figure 1—figure supplement 6). Some constrained TF pairs, however, showed constrained spacing only in repetitive regions and not in non-repetitive regions (Figure 1-table supplement 3). For example, EGR1 and JUND exhibited a constrained spacing at 29 bp (Figure 1D), but this relationship is observed specifically in SINEs (Figure 1—figure supplement 7). Such observation is consistent with previous studies that discovered specific motif pairs in repetitive regions (J. Wang et al., 2012).

2. The authors should comment on how rigid DNA-encoded TF spacing is not supported by evolutionary studies, which have shown an excess of TF motif turnover within regulatory DNA (see work from Duncan Odom's group, Vierstra el., 2014 Science for a direct mouse-human comparison).

We appreciate Dr. Vierstra’s insights into the relationship between TF spacing and evolutionary conservation of TF binding sites. We have included discussion about this in the Discussion section. Specifically, the lack of requirement for exact spacing and remarkable tolerance of spacing alterations by TFs with relaxed spacing could potentially associate with the high turnover of TF binding sites found by previous studies (Vierstra et al., 2014), although further investigation would be needed to establish this association.

3. While Leveraging CRISR/Cas9 editing to generate a broad spectrum of 'spacing' alleles is a clever approach to tackle the experimentally test the effect of motif spacing on TF (co-)binding, the experiment is very underpowered to test the generalizability of the authors claims. Did the authors select single cell clones from the editing experiment or just look at bulk edited populations? If the former, it is unclear how any conclusions can be made from a mixture of edited cells that have a spectrum of indels (also likely carrying two different alleles).

We thank Dr. Vierstra for pointing out that the description of the analysis method of InDels was not clear. To perform these experiments, we prepared bulk populations of cells following CRISPR mutagenesis. Importantly, the ChIP-seq libraries were prepared by selective amplification of ChIP tags containing the targeted region of interest. Thus, for each region-specific sequence tag that was immunoprecipitated, we could simultaneously determine whether an InDel had been created and its specific length. Each tag is thus cell- and allele-specific. Since there was a mixture of edited cells with a spectrum of InDels, we first estimated the distribution of InDels based on the tags of input DNAs and then compared it with TF ChIPseq tags to associate TF binding with different sizes of InDels, in which the effect of an InDel is reported as the odds ratio of ChIP tags to the input tags. By focusing on specific loci, extremely deep tag counts were achieved (703686 to 16189879 tags/locus), allowing strong statistical conclusions. We revised the methods diagram in Figure 5A and further clarified our approach in the revised manuscript.

Reviewer #3:

Transcription factors (TFs) bind to the DNA in a sequence-specific manner at TF binding sites (TFBSs) to control gene transcription. Hence, characterizing how TFs interact with DNA is key to uncover how gene regulation occurs and how this process can be disrupted in diseases. While the binding properties of a large portion of human TFs are well characterized, a remaining challenge lies in our knowledge of how TFs interact cooperatively at regulatory elements, either forming dimers or co-binding the same regions. In this manuscript, Shen et al. explored spacing patterns between TFBSs. Relying on ChIP-seq data, they developed a new methodology to predict TF pairs harbouring constrained or relaxed spacing patterns between their TFBSs. The authors made their code available, which allows reproducibility and exploration; this should be a requirement in the field but is not always complied with so we thank the authors for this. When applied to a limited set of TFs with ChIP-seq data in K562 cells, the authors predicted that TF pairs primarily bind to DNA with relaxed spacing between their TFBSs. Nevertheless, they were able to highlight already known as well as novel specific pairs of TF harboring constrained spacing. Next, the authors leveraged naturally occurring small insertions and deletions in the human population and mouse strains to confirm that altering spacing between TFBSs of TF pairs with relaxed spacing patterns has limited effect. This observation was further supported by synthetic spacing alterations induced by CRISPR-Cas9 experiments. The study is overall well designed and addresses an important challenge in our understanding of TF-DNA interactions and TF cooperation.

Nevertheless, we believe that there are some methodological limitations that favor the identification of relaxed spacing patterns, which should be better outlined in the manuscript to allow the reader to fully comprehend the results. From the title and first sections of the manuscript the readers are given the impression that relaxed and constrained spacing instances are about to be described and analysed with an equal importance. However, more focus is given to the relaxed spacing with both the mouse and CRISPR analyses exclusively dedicated to this with no clear explanation why. It would be useful to the readers to have this explicitly outlined by the authors. Finally, the terminology associated with TF-DNA interactions is very often incorrect, which confuses the readers and should be addressed. Please see below for our detailed comments.

We thank Reviewer #3 for the positive comments and constructive suggestions. In response to the comment on the relative consideration of relaxed vs constrained spacing instances, the emphasis on relaxed instances was motivated by extensive prior work on constrained spacing instances. We modified the text so that readers are not given the impression that relaxed and constrained instances will be considered equally.

1. The terminology associated with TF binding events is inappropriate. The authors use "ChIP-seq peaks", "TFBSs", and "motifs" almost interchangeably, which is not correct. The inconsistency in the terminology makes it difficult to fully comprehend what the authors meant/did.

An example is one of the first sentences in the Introduction: "TFs bind to short, degenerate sequences at promoters and enhancers, often referred to as TF binding motifs." The sequences bound by TFs in promoters and enhancers are TFBSs while TF binding motifs are computational representations of TFBS sets, which can be represented in many ways such as consensus motifs, PFMs, etc.

The next sentence claims that "TFs bind in an inter-dependent manner to closely spaced motifs." Motifs cannot be closely spaced but TFBSs are. Another example is the subsection "Motif identification" in the Methods section while the authors describe the prediction of TFBSs (using motifs).

We appreciate reviewer’s insights into our usage of these terminologies. We have made sure in our revision that these terminologies are used in a consistent way. Specifically, “motifs” were used to describe an aggregation of DNA sequences recognized by TFs, which can be represented as PWMs. “TF binding sites” were used for individual sequences matching with motifs, and “ChIP-seq peaks” for enriched regions in ChIP-seq data originally identified using HOMER.

2. More details should be provided to the Methods section. We acknowledge that the authors provide their code for inspection but outlining all methodological details in the manuscript would help in the clear understanding of the methods used. For instance:

a. The authors do not describe how they selected de novo motifs using HOMER (only best motif?, any p-value threshold?, any specific background used?)

We have provided more details under “TF binding site identification” section in Methods.

b. For the TFBS predictions, the authors used a FPR threshold of 0.1%, but which was the specific tool that they used for that? FPR computation depends on background expectation, what was used (e.g. 25% A, C, G, or T or nucleotide composition of the genome)?

We have added the tool and background expectation for FPR computation under “TF binding site identification” section in Methods.

c. P. 23, lines 466-470. The authors described that they conducted a permutation test but then described that the null distribution was obtained using random spacing values between 0 and 100. If the null distribution is obtained by randomly selecting values between 0 and 100, it does not correspond to a permutation. A permutation test would imply permuting the observed spacing values.

We have added the tool and background expectation for FPR computation under “TF binding site identification” section in Methods.

d. P. 25, lines 513-516. It is not clear to us why the authors considered subsets of mutations when overlapping TFBSs predicted for the ChIP'ed TFs (only if 2-bit difference in motif score) but not for mutations overlapping TFBSs predicted by MAGGIE (all mutations). Why not consider all mutations in all cases?

We did require motif score difference to be at least 2-bit to be considered as a motif mutation for both ChIP’ed TFs and MAGGIE-predicted TFs during our analyses. To clarify our procedure, we included more details in Methods under the “Categorization of genetic variation based on impacts on motif or spacing” section. The threshold of 2-bit was set to reduce false positives of motif mutations, which could have a small sequence alteration but likely do not affect TF binding.

3. The methodology used has several limitations that are not described by the authors. We encourage the authors to clearly outline them to the readers. Furthermore, we believe that these limitations favor the identification of relaxed spacing, which should be acknowledged, especially since the majority of the work focuses on alterations of spacing for TF pairs with relaxed spacing patterns.

a. It is well documented that TF binding preferences for TFs binding in close proximity (e.g. as dimers) can be altered. For instance, Jolma et al. (https://www.nature.com/articles/nature15518) used CAP-SELEX to reveal that "Most TF pair sites identified involved a large overlap between individual TF recognition motifs, and resulted in recognition of composite sites that were markedly different from the individual TF's motifs." As the authors relied on TF binding profiles (or motifs) corresponding to the binding preferences of TFs binding as monomers, it is possible that they will miss cooperative binding inducing a change in binding preferences. Furthermore, the authors did not consider overlapping motifs, which is again precluding the identification of constrained spacing.

Our work is distinguished from Jolma et al. by focusing on TFs with independent motifs at nonoverlapping positions, and because of that, we potentially missed TFs that should be regarded as a complex and recognize composite motifs all the time, which took up the majority of the 315 TF-TF interactions discovered by Jolma et al. We discussed this as a limitation in our original submission and have expanded the original discussion to clarify the difference of our work from Jolma et al. in the Discussion section.

b. The authors rely on ChIP-seq data to identify TF cooperation. While this is fine overall, this data does not allow the authors to know whether two identified TFs bind to their TFBSs on the same molecules. Indeed, ChIP-seq being a bulk experiment, it does not allow to discriminate between true co-occupancy on the same molecule or not. This should be discussed to put the work in a larger context to the readers. See https://www.cell.com/molecular-cell/pdf/S1097-2765(20)30793-0.pdf for a reference.

We agree with this comment and have included a discussion referencing Sönmezer et al. in the Discussion section.

c. The de novo motif enrichment does not ensure that the motif found is actually the one bound by the ChIP'ed TF. Indeed, the motifs found for ARID1B and ARID2 correspond to GATA motifs while ARID TFs bind to more general A+T rich motifs. It is unclear whether the signal observed for these TFs is due to their direct interaction with DNA.

The de novo motifs used for identifying TF binding sites were manually selected from the top three significant HOMER motifs that looks like motifs of other TFs within the same TF family available in JASPAR. We added these details of our motif selection under “TF binding site identification” section in Methods. Meanwhile, we appreciate reviewer for pointing out that the motifs we used for ARID1B and ARID2, despite being A+T rich, are very similar to GATA motifs, so we excluded those two TFs in our revision to reduce confusion.

4. It would have been nice to perform similar experiments as the mouse and CRISPR ones but considering TF pairs with fixed spacing of their TFBSs. If the authors decide not to, they should at least clearly state to the readers that they more specifically focused on relaxed spacing and why.

As noted in our general response, we focused on instances of relaxed spacing for CRISPR mutagenesis because instances of fixed spacing have been investigated previously. We added a clear statement saying our focus on relaxed spacing for the mutagenesis analyses of InDels at the end of section “Natural genetic variants altering spacing between relaxed transcription factors are associated with less deleteriousness in human populations” in the Results section. As a summary of our reasons for focusing on relaxed spacing, our work together with several other studies found that relaxed spacing was more commonly seen among collaborative TFs, but very few studies have discussed the effects of altered spacing on relaxed TFs. We note that we extended the analysis presented in the previous submission for CEBPB by also carrying out ChIP-seq for SPI1 or PU.1 in the CRISPR-edited cells. These studies fully reproduced the patterns observed for CEBPB and are provided in Figure 5—figure supplement 2.

5. It would have been interesting to provide information about the prevalence of cobinding events in the different ChIP-seq datasets. For instance, what is the percentage of ChIP-seq peaks that contain a predicted TFBS for each TF? What is the percentage of such peaks that contain cobinding events? Is there a difference in numbers b/w constrained or relaxed spacing. Providing this information would help the readers to put these observations into a more general context of TF binding regions.

The numbers and percentages of ChIP-seq peaks containing predicted TF binding sites have been included in the new Figure 1-table supplement 2. The number of peaks with co-binding events were originally shown in Figure 1B, and we made a new Figure 1B to show percentages in parallel. There were only marginal differences between constrained and relaxed spacing according to a two-sample t-test (p=0.03), so we decided not to include the results in the revision, but reviewer can find the distributions below.

6. It would be nice to have similar plots as in Figure 3B for pairs of TFs with constrained spacing to show how it contrasts.

As reviewer suggested, we have included the plots as new Figure 1—figure supplement 5 and discussed these new results in the section “Transcription factors primarily co-bind with relaxed spacing” of the Results section.

7. In all analyses, it seems that more deletions than insertions were observed (see Figure 2A for instance). It would be interesting to see if the results recapitulate when considering insertions and deletions independently.

We performed separate analyses for insertions and deletions as reviewer suggested. Results were consistent between insertions and deletions and were included as a new Figure 2—figure supplement 2 in our revision.

8. It is unclear to us what is the analysis of MAGGIE-predicted TFBSs adding to the story. Moreover, the authors claim that MAGGIE identifies "functional" TFBSs but the authors do not provide any specific evidence of function (and which function?) in our opinion.

We thank Reviewer 3 for raising this point. MAGGIE is used to identify motifs of transcription factors that serve as collaborative binding partners for the reference transcription factor being analyzed by ChIP. For example, if mutations in the recognition motif for TF-A are significantly associated with reduced nearby binding of TF-B as determined by ChIP, we conclude that TF-A is a collaborative binding partner of TF-B. This relationship then serves as the basis for analysis of spacing relationships between TF-A and TF-B in the absence of direct binding data for TF-A. In this sense, MAGGIE identifies ‘functional’ motifs because they are associated with loss of binding of the reference TF. We clarify this in the revised manuscript.

9. It seems that there is a periodic pattern in Figure 5E with log-odds ratio periodically equal to 0 (at least with indel sizes < -45). Could this correspond to the minor groove width periodicity of ~10bp (see for instance https://www.cell.com/cell/pdf/S0092-8674(18)31312-6.pdf for periodicity of mutations)? Could the authors comment on that?

Those large deletions <-45 bp disrupt motifs directly, so it would be hard to tell if the “periodic pattern” in Figure 5E reflects DNA periodicity. If there is any periodicity, it should appear in those deletions altering spacing as well. Therefore, we think that further investigation would be needed to understand those patterns.

10. In several figures, the authors should provide all points instead of summarizing with boxplots, which are frowned upon as they hide data distribution.

As reviewer suggested, all data points have been added to the boxplots in Figure 3C,E,G, Figure 4C, and Figure 5C.

11. P. 9 line 178: the authors mentioned 50 millions SNPs but we do not find where this data is used as observing SNPs would not alter spacing.

SNPs were used when we looked for motif mutations. We clarified accordingly in the Methods section under “Categorization of genetic variation based on impacts on motif or spacing”.

12. P. 2 line 46: "implicating their effect…" we would rather write "suggesting their effect…".

We changed the texts as reviewer suggested.

13. The authors seem to have ignored homodimers. Maybe their methodology could be extended to consider spacing b/w TFBSs for the same TF.

Many motifs in JASPAR database are already based on homodimers or heterodimers (e.g., p65, C/EBP, AP-1). We therefore considered all such factors as recognizing a single motif. If we interpret the comment regarding spacing between TF binding sites for the same TF, we agree that this would be of interest. However, such sites are difficult to identify as objectively as sites where two different TFs bind, because the ChIP-peaks for the same TF would overlap if collaborative binding interactions occurred within the spacing ranges.

14. TF naming is sometimes inconsistent in the text and figures. For example, SPI1 and PU.1 are both used in Figure 3. Similarly, we recommend revisiting the text for p65 and RELA, as well as C/EBPβ and CEBPB.

To maintain consistency as reviewer suggested, we changed SPI1 to PU.1, CEBPB to C/EBPβ, and RELA to p65 when referring to transcription factors in Figure 3 and supplementary figures.

15. It is not clear where the authors retrieved HepG2 cell line data from (used in the Figure 1 —figure supplement 4). Was this data processed in the same way as K562 cell line data?

We added the source and processing description of HepG2 data in the Results section.

16. P. 8 line 152: we suggest replacing "we overlaid" with "we mapped".

We changed the texts as reviewer suggested.

17. P. 1 line 30: "ChIP-sequencing" should be replaced with "ChIP-seq" for consistency.

We changed the texts as reviewer suggested.

18. Figure 3D,F,H : instead of motif score this should be TFBS score.

We changed “motif score” to “PWM score” in our texts to align with the most commonly used terminology in literature.

https://doi.org/10.7554/eLife.70878.sa2

Article and author information

Author details

  1. Zeyang Shen

    1. Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    2. Department of Bioengineering, Jacobs School of Engineering, University of California San Diego, La Jolla, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Software, Visualization, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3870-1390
  2. Rick Z Li

    Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    Contribution
    Data curation, Formal analysis, Software, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8518-5793
  3. Thomas A Prohaska

    Department of Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    Contribution
    Investigation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Marten A Hoeksema

    1. Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    2. Department of Medical Biochemistry, Experimental Vascular Biology, Amsterdam Infection and Immunity, Amsterdam Cardiovascular Sciences, Amsterdam UMC, University of Amsterdam, Amsterdam, Netherlands
    Present address
    Department of Medical Biochemistry, Experimental Vascular Biology, Amsterdam Infection and Immunity, Amsterdam Cardiovascular Sciences, Amsterdam UMC, University of Amsterdam, Amsterdam, Netherlands
    Contribution
    Data curation, Investigation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5981-121X
  5. Nathan J Spann

    Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    Contribution
    Data curation, Investigation
    Competing interests
    No competing interests declared
  6. Jenhan Tao

    Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    Contribution
    Conceptualization, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  7. Gregory J Fonseca

    1. Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    2. Department of Medicine, McGill University, Montreal, Canada
    Present address
    Department of Medicine, McGill University, Montreal, Canada
    Contribution
    Conceptualization, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  8. Thomas Le

    Division of Biological Sciences, University of California San Diego, La Jolla, United States
    Contribution
    Formal analysis
    Competing interests
    No competing interests declared
  9. Lindsey K Stolze

    Department of Cellular and Molecular Medicine, College of Medicine, University of Arizona, Tucson, United States
    Contribution
    Data curation, Resources
    Competing interests
    No competing interests declared
  10. Mashito Sakai

    1. Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    2. Department of Biochemistry and Molecular Biology, Nippon Medical School, Tokyo, Japan
    Present address
    Department of Biochemistry and Molecular Biology, Nippon Medical School, Tokyo, Japan
    Contribution
    Investigation, Resources, Supervision
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4908-2727
  11. Casey E Romanoski

    Department of Cellular and Molecular Medicine, College of Medicine, University of Arizona, Tucson, United States
    Contribution
    Resources, Supervision
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0149-225X
  12. Christopher K Glass

    1. Department of Cellular and Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    2. Department of Medicine, School of Medicine, University of California San Diego, La Jolla, United States
    Contribution
    Conceptualization, Funding acquisition, Project administration, Supervision, Writing – original draft, Writing – review and editing
    For correspondence
    ckg@ucsd.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4344-3592

Funding

National Institutes of Health (DK091183)

  • Christopher K Glass

National Institutes of Health (HL147835)

  • Christopher K Glass

Leducq Transatlantic Network (16CVD01)

  • Christopher K Glass

National Institutes of Health (T32DK007044)

  • Thomas A Prohaska

American Heart Association (postdoctoral grant)

  • Marten A Hoeksema

Netherlands Organization for Scientific Research (Rubicon grant)

  • Marten A Hoeksema

Amsterdam Cardiovascular Sciences Institute (postdoctoral grant)

  • Marten A Hoeksema

National Institutes of Health (HL123485)

  • Casey E Romanoski

National Institutes of Health (HL147187)

  • Casey E Romanoski

National Institutes of Health (T32HL007249-42)

  • Lindsey K Stolze

American Heart Association (20PRE35200195)

  • Lindsey K Stolze

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors would like to thank J Collier for technical assistance, the IGM core for library sequencing, L Van Ael for assistance with manuscript preparation. These studies were supported by NIH grants DK091183 and HL147835 and a Leducq Transatlantic Network grant 16CVD01 to CKG, NIH grants HL123485 and HL147187 to CER. TAP was supported by NIH grant T32DK007044. MAH was supported by a Rubicon grant from the Netherlands Organization for Scientific Research and postdoctoral grants from the Amsterdam Cardiovascular Sciences Institute and the American Heart Association. LKS was supported by NIH grant T32HL007249-42 and American Heart Association grant 20PRE35200195.

Ethics

Bone marrow cells were isolated from femurs and tibias of Cas9-expressing transgenic mice (Jackson Laboratory, No.028555) housed at the University of California San Diego animal facility on a 12-hour/12-hour light/dark cycle with free access to normal chow food and water. All of the mice were handled according to approved institutional animal care and use committee (IACUC) protocols (S01015) of the University of California San Diego to minimize pain and suffering.

Senior and Reviewing Editor

  1. Jessica K Tyler, Weill Cornell Medicine, United States

Reviewer

  1. Jeff Vierstra

Publication history

  1. Preprint posted: April 3, 2020 (view preprint)
  2. Received: June 1, 2021
  3. Accepted: January 12, 2022
  4. Accepted Manuscript published: January 20, 2022 (version 1)
  5. Version of Record published: February 2, 2022 (version 2)

Copyright

© 2022, Shen et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 856
    Page views
  • 119
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Zeyang Shen
  2. Rick Z Li
  3. Thomas A Prohaska
  4. Marten A Hoeksema
  5. Nathan J Spann
  6. Jenhan Tao
  7. Gregory J Fonseca
  8. Thomas Le
  9. Lindsey K Stolze
  10. Mashito Sakai
  11. Casey E Romanoski
  12. Christopher K Glass
(2022)
Systematic analysis of naturally occurring insertions and deletions that alter transcription factor spacing identifies tolerant and sensitive transcription factor pairs
eLife 11:e70878.
https://doi.org/10.7554/eLife.70878

Further reading

    1. Chromosomes and Gene Expression
    2. Microbiology and Infectious Disease
    Bretta Hixson et al.
    Tools and Resources Updated

    Mosquitoes transmit numerous pathogens, but large gaps remain in our understanding of their physiology. To facilitate explorations of mosquito biology, we have created Aegypti-Atlas (http://aegyptiatlas.buchonlab.com/), an online resource hosting RNAseq profiles of Ae. aegypti body parts (head, thorax, abdomen, gut, Malpighian tubules, ovaries), gut regions (crop, proventriculus, anterior and posterior midgut, hindgut), and a gut time course of blood meal digestion. Using Aegypti-Atlas, we provide insights into regionalization of gut function, blood feeding response, and immune defenses. We find that the anterior and posterior midgut possess digestive specializations which are preserved in the blood-fed state. Blood feeding initiates the sequential induction and repression/depletion of multiple cohorts of peptidases. With respect to defense, immune signaling components, but not recognition or effector molecules, show enrichment in ovaries. Basal expression of antimicrobial peptides is dominated by holotricin and gambicin, which are expressed in carcass and digestive tissues, respectively, in a mutually exclusive manner. In the midgut, gambicin and other effectors are almost exclusively expressed in the anterior regions, while the posterior midgut exhibits hallmarks of immune tolerance. Finally, in a cross-species comparison between Ae. aegypti and Anopheles gambiae midguts, we observe that regional digestive and immune specializations are conserved, indicating that our dataset may be broadly relevant to multiple mosquito species. We demonstrate that the expression of orthologous genes is highly correlated, with the exception of a ‘species signature’ comprising a few highly/disparately expressed genes. With this work, we show the potential of Aegypti-Atlas to unlock a more complete understanding of mosquito biology.

    1. Chromosomes and Gene Expression
    2. Genetics and Genomics
    Bethany Sump et al.
    Research Article

    For some inducible genes, the rate and molecular mechanism of transcriptional activation depends on the prior experiences of the cell. This phenomenon, called epigenetic transcriptional memory, accelerates reactivation and requires both changes in chromatin structure and recruitment of poised RNA Polymerase II (RNAPII) to the promoter. Memory of inositol starvation in budding yeast involves a positive feedback loop between transcription factor-dependent interaction with the nuclear pore complex and histone H3 lysine 4 dimethylation (H3K4me2). While H3K4me2 is essential for recruitment of RNAPII and faster reactivation, RNAPII is not required for H3K4me2. Unlike RNAPII-dependent H3K4me2 associated with transcription, RNAPII-independent H3K4me2 requires Nup100, SET3C, the Leo1 subunit of the Paf1 complex and, upon degradation of an essential transcription factor, is inherited through multiple cell cycles. The writer of this mark (COMPASS) physically interacts with the potential reader (SET3C), suggesting a molecular mechanism for the spreading and re-incorporation of H3K4me2 following DNA replication.