Native functions of short tandem repeats

  1. Shannon E Wright
  2. Peter K Todd  Is a corresponding author
  1. Department of Neurology, University of Michigan–Ann Arbor, United States
  2. Neuroscience Graduate Program, University of Michigan–Ann Arbor, United States
  3. Department of Neuroscience, Picower Institute, United States
  4. VA Ann Arbor Healthcare System, United States

Abstract

Over a third of the human genome is comprised of repetitive sequences, including more than a million short tandem repeats (STRs). While studies of the pathologic consequences of repeat expansions that cause syndromic human diseases are extensive, the potential native functions of STRs are often ignored. Here, we summarize a growing body of research into the normal biological functions for repetitive elements across the genome, with a particular focus on the roles of STRs in regulating gene expression. We propose reconceptualizing the pathogenic consequences of repeat expansions as aberrancies in normal gene regulation. From this altered viewpoint, we predict that future work will reveal broader roles for STRs in neuronal function and as risk alleles for more common human neurological diseases.

Introduction

At least a third of the human genome is comprised of repetitive sequences (de Koning et al., 2011; Gemmell, 2021; Britten and Kohne, 1968). Some of the first genomic repetitive elements were discovered in association with disease. As a result, pathogenic roles of repeats were well studied, while potential native functions of these repeats were largely dismissed. However, the conservation of genomic repeats among different eukaryotic species (Eichler et al., 1995; Sulovari et al., 2019; Liquori et al., 2003) and their high polymorphism rates compared to other types of genetic variations (Willems et al., 2016) suggests that repeats may have important biological functions in addition to the pathogenic ones. A growing body of research has revealed complex biological and evolutionary functions for repeats across the genome. Here, we summarize the important functions of one type of genomic repeat, short (2–6 base pair) tandem repeats (STRs), in DNA, RNA, and as proteins. We then reframe STR toxicity observed in repeat expansion disorders (REDs) as an aberrancy of native STR functions, rather than as solely a emergent property disconnected from the native repeat. Finally, we discuss how this alternative view of STR toxicity can improve our understanding of roles of STRs in neuronal function and human health.

A brief history of repetitive DNA

Repetitive elements in DNA were first discovered by Barbara McClintock, who observed the presence of ‘controlling elements’ randomly dispersed throughout the maize (Zea mays) genome (Comfort, 2001; Ravindran, 2012; McClintock, 1950). These interspersed repeats, which would come to be known as transposable elements (TEs), use flanking repetitive sequences to ‘jump’ around to different locations in the genome, often resulting in duplications of genetic material.

In contrast to interspersed repeats, tandem repeats (TRs) are regions in which repeating units lie in parallel (or in tandem) and are classified by size of the repeating unit as satellites (>60 base pairs), minisatellites (10–60 base pairs), or microsatellites (<9 base pairs). Short (2–6 base pair) tandem repeats (STRs) comprise between 1% and 3% of the human genome (Gymrek et al., 2016; Wyner et al., 2020; Lander et al., 2001). In the early 1990s, a series of STR expansions were causally linked with human diseases, including spinobulbar muscular atrophy (La Spada et al., 1992), Fragile X Syndrome (Fu et al., 1991; Heitz et al., 1991; Oberlé et al., 1991; Verkerk et al., 1991; Yu et al., 1991), Huntington’s disease (MacDonald et al., 1993), and myotonic dystrophy (Brook et al., 1992; Buxton et al., 1992; Harley et al., 1992; Aslanidis et al., 1992; Fu et al., 1992; Mahadevan et al., 1992). As such, much of the research on STRs to date has centered on the mechanisms by which repeat expansions trigger neuronal toxicity. We will use the Fragile X locus as an exemplar of this now extensive body of literature, which is reviewed in more detail elsewhere (Malik et al., 2021a; Hagerman et al., 2017; Hagerman and Hagerman, 2015; Jacquemont et al., 2003; Glineburg et al., 2018), as it helps us understand how STRs might function normally in the absence of expansion.

Fragile X-associated disorders: the discovery of pathogenic short tandem repeats

Fragile X Syndrome (FXS), the most common monogenic form of intellectual disability, was one of the first genetic diseases linked to an STR expansion (Fu et al., 1991; Heitz et al., 1991; Oberlé et al., 1991; Verkerk et al., 1991; Yu et al., 1991). In 1943, Julia Bell and James Purdon Martin described an X-linked intellectual disability primarily affecting people assigned male at birth, that could be inherited from a carrier female parent or affected male parent (Martin and Bell, 1943). Karyotypes of affected individuals show a folate-sensitive fragile site on the X chromosome, which causes the chromosome to bend or break at one arm (Lubs, 1969; Proops and Webb, 1981; Chudley and Hagerman, 1987; Hagerman et al., 1986). The fragile site associated with FXS is located at the Fragile X messenger ribonucleoprotein 1 (FMR1) gene, which contains a large CGG repeat in the 5’ UTR of affected individuals (>200 repeats) (Fu et al., 1991; Heitz et al., 1991; Oberlé et al., 1991; Verkerk et al., 1991; Yu et al., 1991). In addition to intellectual disability, FXS patients commonly present with hyperactivity, anxiety, and seizures (Hagerman et al., 2017; Chudley and Hagerman, 1987; Hagerman and Hagerman, 2002a). Other chromosomal fragile sites also contain STRs, some of which are linked to other diseases (Glover, 2006; Debacker and Kooy, 2007; Schwartz et al., 2006). For example, Fragile XE syndrome (FRAXE), caused by a CGG repeat expansion in the FMR2 gene (Knight et al., 1993; Gu et al., 1996; Gecz et al., 1996), manifests in an X-linked intellectual disability similar to FXS (Mulley et al., 1995; Gecz, 2000).

While studying pedigrees of Fragile X families, Stephanie Sherman and colleagues observed incomplete penetrance of mental impairment, affecting only 79% of males and 35% of females (Sherman et al., 1984; Sherman et al., 1985). This ‘Sherman paradox’ suggested a generational risk factor in Fragile X mental impairment, as unaffected ‘normal transmitting’ males (NTMs) passed on a mutant allele to unaffected female children, with disease manifestation in affected (predominantly) male grandchildren. Subsequent studies of CGG repeat length variation found that individuals from non-Fragile X families have 6–54 CGG repeats, while some unaffected individuals in Fragile X families have 55–200 repeats, a ‘pre-mutation’ associated with increased risk of further repeat expansion during oogenesis (Fu et al., 1991).

Subsequent work with Fragile X families revealed that Fragile X premutation expansion carriers often manifest clinically distinct disorders that are caused by the CGG repeat. Fragile X-associated tremor/ataxia syndrome (FXTAS) is an age-linked neurodegenerative disorder characterized by progressive intention tremor and ataxia, parkinsonism, and cognitive decline (Hagerman and Hagerman, 2015; Jacquemont et al., 2003; Hagerman and Hagerman, 2004; Hagerman et al., 2001; Hagerman and Hagerman, 2002b; Leehey et al., 2003; Brunberg et al., 2002). As an X-linked disorder, FXTAS primarily affects people assigned male at birth. People with two X chromosomes may develop FXTAS, but are also at risk for developing Fragile X-associated premature ovarian insufficiency (FXPOI), a disorder characterized by absent or irregular menstrual cycles, early onset of menopause, and fertility issues (Hagerman and Hagerman, 2002a; Allingham-Hawkins et al., 1999; Murray et al., 2000a; Murray et al., 2000b). As ‘premutation disorders’, FXTAS and FXPOI are thought to share similar molecular mechanisms by which the premutation CGG repeat expansion causes cytotoxicity and dysfunction.

More than 50 REDs discovered to date show common mechanisms of molecular pathology (Malik et al., 2021a; Glineburg et al., 2018; Paulson, 2018; Rodriguez and Todd, 2019). FXS, FXTAS, and FXPOI, collectively referred to as Fragile X-associated disorders, are revisited throughout this review to exemplify the mechanisms by which STRs can cause cellular dysfunction and toxicity. However, the often-stereotyped manifestations of REDs, in addition to the abundance of repetitive elements throughout the genome, suggests that STRs could have native functions which become aberrant in the setting of repeat expansions. We will focus most of the rest of the review on this supposition.

Native STR functions

While overshadowed by disease-centered research, scientists have investigated functional consequences of repeat polymorphisms for decades. Studies of individual or small groups of genes showed phenotypic consequences of repeat length variation on flocculation and cell adhesion in yeast (Voynov et al., 2006; Levdansky et al., 2007; Verstrepen et al., 2005), limb and skull morphology in dogs (Fondon and Garner, 2004) and on behavioral traits in voles (Hammock and Young, 2005). Recent advances in sequencing technology and STR-conscious alignment techniques now permit the detection and characterization of thousands of new STRs and their variation across the human genome, and have enabled genome-wide study of the effect of repeat length polymorphisms on gene expression (Willems et al., 2016; Payseur et al., 2011; McIver et al., 2011; O’Dushlaine and Shields, 2008; McIver et al., 2013; Willems et al., 2014; Mallick et al., 2016). As thousands of single-nucleotide polymorphisms (SNPs) have been linked with disease risk in Genome Wide Association Studies (GWAS; Tam et al., 2019; Uffelmann et al., 2021), ongoing studies of human genomes aim to link variation in STR length to phenotypic outcomes (Gymrek et al., 2016; Fotsing et al., 2019; Mitra et al., 2021). In homage to the expression quantitative trait loci (eQTL) identified in traditional GWAS (Tam et al., 2019; Uffelmann et al., 2021), STRs associated with differences in expression of nearby genes are called eSTRs. In the following sections, we will first discuss evidence for evolutionary constraint on STRs linked to evolution across phylogeny and in humans. We will then showcase the mechanisms by which variation in STR length affects gene expression.

Repetitive DNA regulates transcription

Repetitive elements can impact the transcription of neighboring genes or the genes in which they reside by regulating chromatin structure and epigenetic markers. A role for repetitive DNA in facilitating 3D folding of the genome was first observed with TE-dependent formation of chromatin loops across multiple species, including yeast, Drosophila, and mammals (Figure 1(i); Cournac et al., 2016; Bourque, 2009; Lu et al., 2021). Contact maps generated using chromosome conformation capture (Hi-C) show high co-localization of repetitive elements in nuclear space in humans, mice, and Drosophila, demonstrating a structural function of repetitive DNA (Cournac et al., 2016). The enrichment of transcription factor binding sites in proximity to spatially associated repeats suggests that repeat-mediated 3D DNA packaging may allow for context-dependent co-transcription of linearly remote genes (Cournac et al., 2016).

Native functions of genomic repeats.

Repeats in DNA influence larger 3D chromatin structures, regulate binding of nucleosomes and (de)acetylases and (de)methylases. They also influence transcription factor binding and polymerase processivity to affect downstream RNA production. Repeats in RNA can affect pre-mRNA processing such as alternative splicing and can affect RNA binding protein function through direct or indirect sequestration. Repeats in 3’ UTRs serve as localization signals, directing mRNA transport. Repeats in 5’ UTRs regulate translational output by impeding ribosome processivity. Repeating units in proteins can provide structural flexibility within a protein or serve as binding sites for the formation of multi-protein complexes.

STRs play critical roles in maintaining chromatin structure (Nikumbh and Pfeifer, 2017; Sun et al., 2018; Volle and Delaney, 2012). For example, short CAG/CTG tracts avidly incorporate nucleosomes (Figure 1 (ii)), which are a basic subunit of chromatin packaging (Volle and Delaney, 2012). Nucleosome position varies with differences in STR length and flanking sequence context (Volle and Delaney, 2012), influencing chromatin structure and transcription of nearby genes. Other STRs, including CGG repeats, have the opposite effect and exclude nucleosomes in their native states, creating more open chromatin states near the transcription start sites of genes that favor local transcription (Wang et al., 1996; Wang, 2007). This feature may underlie the enrichment of CGG repeats in promoters and 5’UTRs (Uesaka et al., 2014).

STRs can also influence chromatin structure by modulating DNA methylation. Some STRs are prone to methylation, which can lead to gene silencing and the absence of transcription (Quilez et al., 2016; Garg et al., 2021; Pappalardo and Barra, 2021). One common example of repeat-mediated gene silencing, CpG islands are repeating di-nucleotide CpG sequences ranging from around 500–3000 base pairs and are located in ~40% of gene promoters across mammalian genomes (Deaton and Bird, 2011; Janitz and Janitz, 2011; Thomson et al., 2010; Clouaire et al., 2012; Blackledge et al., 2013). STRs are commonly located near CpG islands (Sun et al., 2018), and may influence their methylation states (Figure 1 (iii)). Moreover, other STRs can contain CpGs within their repetitive sequence that can undergo methylation (Bolton et al., 2013).

A genome-wide study in yeast (Vinces et al., 2009) estimates as many as 25% of promoters contain tandem repeats (TRs). Generally, expression of genes with TRs in their promoters increased with increasing repeat size. TRs in promoters may increase gene expression by increasing transcription factor binding (Figure 1 (iv)), blocking or reducing nucleosome density, or in the case of AT-rich repeats, by facilitating DNA melting (Vinces et al., 2009). However, stable secondary structures formed by TRs can also inhibit transcription by impeding procession and access of transcriptional machinery (Figure 1 (v); Grabczyk and Fishman, 1995; Belotserkovskii et al., 2010; Usdin and Woodford, 1995). For example, the evolutionarily conserved THO complex is recruited to actively transcribed genes (Kim et al., 2004; Abruzzi et al., 2004; Strässer et al., 2002), and facilitates elongation of RNA polymerase (Fan et al., 1996; Prado et al., 1997; Fan et al., 2001; Jimeno et al., 2002; Chávez et al., 2000) through super-helical structures formed by long GC-rich TRs (Voynov et al., 2006; Chávez et al., 2001). Yeast strains with mutations to THO complex subunits exhibited lower levels of TR-containing FLO11 mRNA. Reduced FLO11 mRNA coincided with an accumulation of RNA polymerase at the beginning of the gene. Removal of the TR or overexpression of topoisomerase I to enhance unwinding of the structured DNA, rescued the reduction in FLO11 mRNA in THO complex mutants (Voynov et al., 2006).

As in yeast, human STRs can either enhance and inhibit transcription of associated genes dependent on their sequence and locations, and also affect gene expression via changes in gene methylation and chromatin structure (Gymrek et al., 2016; Fotsing et al., 2019; Quilez et al., 2016; Garg et al., 2021; Jakubosky et al., 2020). Together, these studies demonstrate numerous mechanisms by which TRs can enhance or inhibit gene expression.

STRs in RNA regulate pre-mRNA processing and RNA localization

Transcribed repetitive elements regulate numerous aspects of RNA biology. STRs in RNAs form complex higher order structures, including G-quadruplexes and hairpins (Krzyzosiak et al., 2012; Sobczak et al., 2003; Malgowska et al., 2014; Sobczak et al., 2010), which are thought to exert broad influence over pre-mRNA splicing (Figure 1 (vi)) (Muro et al., 1999; Tu et al., 2000; Black, 2003; Solnick and Lee, 1987). An analysis of human introns found that sites of alternative splicing are enriched for STRs (Lian and Garner, 2005). STRs can facilitate alternative splicing by complementary pairing of intronic repeats, bringing exonic regions into close proximity (Lian and Garner, 2005). Structure-forming STRs can inhibit or enhance alternative splicing by blocking or facilitating the recruitment of splicing factors, respectively (Lian and Garner, 2005). For example, alternative splicing of the EIIIB exon in the well-conserved fibronectin gene is regulated by an intronic TGCATG repeat (Huh and Hynes, 1994; Lim and Sharp, 1998). Contractions in this STR reduce EIIIB exon inclusion, while overexpression of a specific splicing factor, SRp40, stimulates inclusion. While the TGCATG repeat differs from SRp40’s consensus binding site, it can form a strong hairpin structure, which is a key feature of SRp40 binding site motifs (Tacke et al., 1997). This suggests that the TGCATG repeat may modulate alternative splicing by recruiting the SRp40 splicing factor to the intron/exon boundary (Lim and Sharp, 1998).

Some STRs in RNA regulate splicing in trans, by binding to and sequestering splicing factors and blocking their functions (Figure 1 (vii)). A recent study identified a group of novel long non-coding RNAs (lncRNAs) with multiple predicted RNA binding motifs (Yap et al., 2018), a subset of which contained long stretches of STRs (‘strRNAs’). One strRNA called the pyrimidine-rich noncoding transcript (PNCTR) contains numerous stretches of (TC)n repeats, avidly binds to the polypyrimidine tract-binding protein (PTBP1) in cells, and negatively regulates PTBP1-mediated splicing (Yap et al., 2018). As such, PNCTR overexpression was sufficient to trigger mis-splicing of PTBP1 targets and trigger programmed cell death (Yap et al., 2018). In this way, STRs in RNA can regulate the global availability of other RNA-binding proteins (RBPs) with other functions, exerting profound control over numerous aspects of cell biology.

STRs in 3’ UTRs can also serve as RNA localization signals, and via interactions with RBPs, facilitate the transport of RNAs to specified cellular compartments (Figure 1 (viii)). A program called REPFIND was developed to analyze 3’ UTRs of localized mRNAs in Xenopus oocytes and identified various CAC-containing repeat motifs that serve as localization elements (Betley et al., 2002). Mutating these CAC-containing repeats was sufficient to abolish normal RNA localization. CAC-containing repeats were also found in zebrafish and human 3’ UTRs of transcripts that are known to be specifically localized within cells, suggesting that CAC-containing repeats are conserved localization elements in chordates (Betley et al., 2002). REPFIND was subsequently used to generate a database of repeating motifs in 3’ UTRs of mammalian genes from the Mammalian Gene Collection (MGC) that revealed hundreds of human genes containing short CAC- and CAG-rich repeats in their 3’ UTRs (Lim and Sharp, 1998). Intriguingly, these elements facilitate RNA localization to neurites in rat hippocampal neurons (Andken et al., 2007).

Repetitive RNA regulates translation

STRs located in 5’ UTRs and coding regions impact mRNA translational efficiency. GC-rich STRs form stable RNA structures (Krzyzosiak et al., 2012; Sobczak et al., 2003; Malgowska et al., 2014; Sobczak et al., 2010), which can impede the processivity of scanning translational complexes (Figure 1 (ix)) (Kozak, 1980; Kozak, 1986; Kudla et al., 2009; Tuller et al., 2011; Ding et al., 2012; Bentele et al., 2013; Weinberg et al., 2016). For example, a native GGN repeat in the 5’ UTR of the potassium 2-pore domain leak channel Task3 mRNA forms a G-quadruplex structure in vivo (Maltby et al., 2020). This G-quadruplex is inhibitory to translation of Task3 mRNA, but can be overcome by DHX36 helicase activity to improve ribosome processivity through the stable structure (Maltby et al., 2020).

Indeed, libraries of synthetic (Millette et al., 2022) and naturally occurring (Li et al., 2017; Niederer et al., 2022) hairpin sequences placed within 5’ UTRs can be used to precisely control translational transgene output, with potential implications for gene therapy dosing. These studies show how single unit variations in STRs can precisely modulate protein expression, generally permitting more and faster translation of mRNAs with smaller STRs, and less and slower translation of mRNAs with larger STRs.

Repeats in proteins facilitate multi-protein complex formation and structural flexibility

Eukaryotic proteins are more likely to have repeats than prokaryotic proteins, and proteins containing repeats are often unique to eukaryotes and eukaryotic functions (Marcotte et al., 1999). There are numerous long repeating motifs in proteins (>20 amino acids/repeat) with loose homology between repeats, that form complex tertiary structures (Andrade et al., 2001). These protein repeat domains are characterized by the structures they form, as all-β (i.e. β-propellers, β-trefoils), all-α (i.e. HEAT and tetratricopeptide repeats (TPRs)), or mixed α/β (i.e. leucine-rich repeats, ankyrin repeat; Andrade et al., 2001). Although their specific functions vary, protein repeat domains typically serve as binding sites, and are thought to have evolved in eukaryotes to aide in the formation of multi-protein complexes with advanced cellular functions (Andrade et al., 2001; Kajava, 2012; Sharma and Pandey, 2015).

STRs translated into proteins, are thought to have similar functions as these larger repeat-based protein domains, serving as sites for protein-protein interactions (Figure 1 (x); Schaefer et al., 2012; Faux, 2012). CAG repeats are enriched in coding regions and are most frequently found in the polyglutamine (polyQ) reading frame, suggesting that polyQ stretches in proteins have a native function (Schaefer et al., 2012). PolyQ stretches are enriched in proteins that are components of multi-protein complexes, and have functions in transcriptional control, phosphatidylinositol (PI) signaling, protein degradation, and chromatin remodeling. Evolutionary sequence comparison reveals that the location of polyQs within a protein is not always conserved (Schaefer et al., 2012). This suggests that polyQ stretches have evolved multiple times, and don’t directly confer a protein’s function, but rather modulate the protein-protein interactions necessary for those functions (Schaefer et al., 2012; Orr, 2012). Other CG-containing STRs (i.e. CUGs and CGGs) show similar patterns of overrepresentation in coding regions (Schaefer et al., 2012), and likely serve similar complex-scaffolding functions (Nasrallah et al., 2012).

STRs when translated into proteins can be critical for proper protein folding. For example, translation of a CAG repeat in the huntingtin gene (HTT) produces a polyQ tract in the HTT protein which serves as a flexible hinge, allowing the neighboring domains to fold into close proximity (Figure 1 (xi)) (Caron et al., 2013). HTT protein structure is altered with repeat expansion, demonstrating the importance of the flexibility conferred by this STR (Caron et al., 2013).

Pathogenic consequences of STRs: A Fragile X case study

In the previous section, we summarized how STRs in DNA, RNA, and when translated into proteins can affect gene expression and protein function. In the following section, we will draw parallels from these native functions of STRs to pathogenic mechanisms in REDs (Figure 2). These parallels demonstrate how STR toxicity can be viewed as aberrancies of native processes, rather than emergent dysfunctions. For this analysis, we will largely use the Fragile X locus discussed earlier as a well-characterized case study, although many of these principles also apply to other REDs and a few specific examples are included here (reviewed in broader detail in Malik et al., 2021a; Glineburg et al., 2018; Paulson, 2018; Rodriguez and Todd, 2019).

STR-associated toxicity in Repeat Expansion Disorders.

Repeat expansions can alter global 3D chromatin structure, and influence transcription via blocking or enhancing binding of nucleosomes, (de)acetylases, (de)methylases, and transcription factors. Expanded repeats may also impede polymerase processivity. In some cases, elevated transcription of repeat expansion RNA can lead to depletion of RNA-binding proteins. Depletion of these proteins can impact many processes to which they contribute, including pre-mRNA splicing and processing, and mRNA localization. Expanded repeat RNA and bound RBPs can also aggregate into RNA foci, causing toxicity. Expanded repeat RNA can stall translational complexes, leading to repeat-associated non-AUG (RAN) translation, and contribute to the production of polymeric proteins. Polymeric proteins are aggregate prone. Longer polymeric stretches in native proteins may also cause dysfunction by preventing proper protein folding or causing the folded protein to mis-localize within the cell.

Epigenetic and transcriptional dysfunction of STRs in DNA

The functional consequences of STRs on genome organization and transcription are evident when dysfunction is observed in REDs (Dion and Wilson, 2009; López-Martínez et al., 2020; Yin et al., 2020; Usdin, 2008). Repeat expansions can alter local genome architecture and expression of neighboring genes. A prime example is observed at a CTG repeat in the 3’UTR of the DMPK gene associated with myotonic dystrophy type 1 (DM1), expansion of which alters local chromatin structure and suppresses transcription of neighboring gene, Six5 (Winchester et al., 1999; Brouwer et al., 2013; López Castel et al., 2011). Repeat expansions also cause global alterations in chromatin structure. CGG repeat expansions in FXS patients cause severe disruptions in chromatin boundaries (Figure 2 (i); Sun et al., 2018). These disruptions may explain delayed DNA replication (Subramanian et al., 1996), activation of DNA replication stress pathways (Chakraborty et al., 2020) and altered local DNA replication patterns (Gerhardt et al., 2014) observed at CGG repeat expansions and the Fragile X locus in particular.

As genomic repeats influence native DNA methylation, some STRs are aberrantly methylated only upon expansion (Figure 2(ii); Otten and Tapscott, 1995; Steinbach et al., 1998; Herman et al., 2006; Greene et al., 2007; Belzil et al., 2013; Xi et al., 2013). When the CGG repeat in the 5’ UTR of FMR1 expands beyond 200 repeats, it is susceptible to DNA methylation of both the CpG elements within the repeat and at a CpG element within the FMR1 promoter (Oberlé et al., 1991; Sutcliffe et al., 1992; Pieretti et al., 1991; McConkie-Rosell et al., 1993; Hansen et al., 1992; Coffee et al., 2002; Colak et al., 2014; Willemsen et al., 2002). This hypermethylation is associated with FMR1 gene silencing, with a resulting absence of FMR1 mRNA and FMRP, a critical RBP involved in synaptic plasticity and neuronal function (Oberlé et al., 1991; Hagerman et al., 2017; Quartier et al., 2017; Myrick et al., 2014). How exactly repeat expansion triggers methylation and the relationship between expansion, methylation, and epigenetic silencing is not fully understood, but the locus remains transcriptionally active and unmethylated in human embryonic stem cells even in the presence of very large repeat expansions, with silencing occurring during differentiation. Some studies suggest that FMR1 silencing requires co-transcriptional binding of CGG repeat mRNA directly to the FMR1 promoter region as an RNA-DNA heteroduplex (Colak et al., 2014; Groh et al., 2014).

STR expansions can enhance or inhibit mRNA production from nearby genes. At FMR1, premutation range CGG repeats which cause FXTAS or FXPOI (and which are unmethylated) result in elevated transcription of FMR1 mRNA (Tassone et al., 2000a; Tassone et al., 2000b; Entezam et al., 2007; Brouwer et al., 2008; Kenneson et al., 2001). This may result from use of additional upstream transcription start sites (Beilina et al., 2004; Tassone et al., 2011), or be associated with enrichment of acetylated histones or other chromatin activating factors at the premutation allele (Todd et al., 2010). It’s possible that both hypo-expression and hyperexpression of FMR1 stems from the complex structures formed by these CGG repeats as DNA (Usdin and Woodford, 1995; Fry and Loeb, 1994; Kettani et al., 1995; Patel et al., 2000). As seen in native STRs, different structures formed by expanded STRs could facilitate or block binding of histone-modifying methylases, demethylases, acetylases, deacetylases, and even entire nucleosomes to affect downstream gene expression (Figure 2 (iii-v)) (Wang et al., 1996; Usdin and Kumari, 2015).

Repeat expansions cause defects in pre-mRNA processing and mRNA localization

The native roles of STRs in RNA in regulating splicing mirror splicing dysfunction observed in numerous REDs (Figure 2 (vi)). Splicing of the HTT huntingtin gene, which contains a CAG repeat, is altered at expanded repeats associated with Huntington’s Disease, resulting in the production of a transcript containing only exon 1 and the production of an exon 1 HTT protein (Gipson et al., 2013; Sathasivam et al., 2013; Neueder et al., 2017; Neueder et al., 2018; Franich et al., 2019). The exon 1 HTT protein is found in patient tissues and is toxic in model systems (Gipson et al., 2013; Sathasivam et al., 2013; Neueder et al., 2017; Neueder et al., 2018; Franich et al., 2019). Incomplete splicing of HTT with the CAG repeat expansion increased with overexpression and decreased with knockdown of splicing factor SRSF6. SRSF6 is predicted to bind to the 5’ end of HTT transcripts via the CAG repeat, suggesting that SRSF6-CAG repeat interactions interfere with spliceosome formation at the nearby splice site (Neueder et al., 2018).

Global splicing defects in REDs result from sequestration of critical splicing factors that bind to STR-containing RNA (López-Martínez et al., 2020; Mykowska et al., 2011; Botta et al., 2008). STRs can be binding sites for RBPs, regulating their availability throughout the cell. RNAs with longer STRs bind more RBPs, which can cause a global cellular depletion of these factors (Figure 2 (vii)) (Malik et al., 2021a; Glineburg et al., 2018; Rodriguez and Todd, 2019). In myotonic dystrophy type 1 (DM1), the expanded CTG repeat in the 3’ UTR of the DMPK gene (López-Martínez et al., 2020; Korade-Mirnics et al., 1998; Udd and Krahe, 2012) binds to muscleblind-like splicing regulator 1 protein (MBNL1) among other RBPs (Paul et al., 2011; Jiang et al., 2004; Fardaei et al., 2001; Mankodi et al., 2001; Miller et al., 2000), resulting in depletion of critical splicing factors (Botta et al., 2008; Jiang et al., 2004; Pascual et al., 2006; Jog et al., 2012) and global splicing defects (López-Martínez et al., 2020; Paul et al., 2011; Jiang et al., 2004; Du et al., 2010; Philips et al., 1998). RBP depletion by the CTG repeat in DM1 can also impact other aspects of pre-mRNA processing, including polyadenylation (Thomas et al., 2017; Goodwin et al., 2015).

In FXTAS, premutation expansion mRNA sequesters and depletes multiple RBPs that bind to the CGG repeat RNA directly (i.e. DGCR8 Sellier et al., 2013), Purα (Jin et al., 2007), hnRNP A2/B1 (Jin et al., 2007; Iwahashi et al., 2006; Muslimov et al., 2011; Sofola et al., 2007) or indirectly via binding to CGG-bound proteins (i.e. Drosha Sellier et al., 2013 and Sam68 Sellier et al., 2010). These RBPs are involved in a variety of functions that are affected by their sequestration, including miRNA processing (DGCR8, Drosha), mRNA transport (hnRNP A2/B1, Purα) (Figure 2 (viii)), and in splicing (hnRNP A2/B1, Sam68). Splicing defects have been observed in CGG premutation expansion models (Sellier et al., 2010; He et al., 2014), but compensatory overexpression of CGG repeat-sequestered RBPs (Jin et al., 2007; Sofola et al., 2007; He et al., 2014; Qurashi et al., 2011) or blocking RBP binding to CGG repeats (Disney et al., 2012; Verma et al., 2019; Verma et al., 2020; Verma et al., 2022) can improve these disease-associated defects. This pathogenic sequestration of RBPs by expanded repeats mirrors the native role for STRs in RNA as RBP reserves, mediating fine-tuned dosing of RBP availability with repeat length.

In addition to the depletion of RBPs and consequent defects in RNA splicing and localization and miRNA processing, expanded STRs in RNAs may also cause toxicity by self-association (gelation) (Figure 2(ix); Glineburg et al., 2018; Sellier et al., 2010; He et al., 2014; Jain and Vale, 2017; Ciesiolka et al., 2017; Fay et al., 2017; Tassone et al., 2004). Yet, these processes also occur on RNAs with shorter STRs that are below the pathological threshold for disease, suggesting such that such phase separation properties of specific RNA motifs and their associated RBPs may exist on a spectrum from physiologic to pathologic.

Expanded STRs in RNA can mis-localize or be retained in the nucleus instead of transported to its functional location in the cell (Davis et al., 1997; Mastroyiannopoulos et al., 2010; Sun et al., 2015). This may be mediated by splicing defects (Sun et al., 2015), via export-inhibiting RBP interactions (Smith et al., 2007), or via a larger dysfunction of nucleocytoplasmic transport (Zhang et al., 2015; Zhang et al., 2016; Jovičić et al., 2015; Freibaum et al., 2015; Grima et al., 2017; Gasset-Rosa et al., 2017; Sellier et al., 2017). For example, SRSF proteins bind to CGG and G4C2 repeats and appear critical to their cytoplasmic transport out of the nucleus (Malik et al., 2021b; Hautbergue et al., 2017). In this context, lowered expression of SRSF proteins or inhibition of the SRSF protein kinase SRPK1, which regulates SRSF nuclear entry, suppress CGG repeat exit to the cytoplasm and reduce toxicity in Drosophila and neuronal model systems (Malik et al., 2021b). Together, these studies show that expanded STRs in RNA can induce toxicity via RBP depletion or by direct RNA dysfunction.

Aberrant translation of expanded STRs

Scanning translational complexes are more likely to stall at stable secondary structures formed by expanded STRs (Figure 2 (x)), resulting in aberrant translation initiation upstream of or within the repeat in a process known as repeat-associated non-AUG (RAN) translation (Figure 2 (xi)). RAN translation produces toxic peptides that contribute to expanded STR toxicity and neurodegeneration in numerous REDs (Glineburg et al., 2018; Ciesiolka et al., 2017; Cleary and Ranum, 2017; Kearse and Todd, 2014; Kearse et al., 2016; Todd et al., 2013; Wojciechowska et al., 2014; Mori et al., 2013a; Ash et al., 2013; Mori et al., 2013b; Bañez-Coronel et al., 2015; Zu et al., 2011; Zu et al., 2017; Soragni et al., 2018; Ishiguro et al., 2017).

The mechanisms underlying RAN translation likely vary across different STRs and different genetic contexts (Malik et al., 2021a; Gao et al., 2017). At the CGG repeat of FMR1, RAN translation occurs in all three reading frames to produce polyarginine (FMRpolyR) (+0-frame relative to the AUG of FMR1), polyglycine (FMRpolyG) (+1-frame), and polyalanine (FMRpolyA) (+2-frame) peptides at different efficiencies (Kearse et al., 2016; Todd et al., 2013). STR-induced stalling of translation machinery is also responsible for a reduction in downstream production of the main protein produced by FMR1 translation, FMRP, in CGG premutation carriers (Tassone et al., 2000b; Kenneson et al., 2001).

Repetitive proteins have pathogenic consequences

Repeat-containing peptides, produced via canonical translation of STRs in coding regions or via RAN translation, contribute to toxicity in REDs. At CGG repeats, both FMRpolyG and FMRpolyA are present within intranuclear neuronal inclusions in patient tissues (Sellier et al., 2017; Todd et al., 2013; Buijsen et al., 2014; Krans et al., 2019; Ma et al., 2019), and are toxic in model systems (Sellier et al., 2017; Todd et al., 2013; Derbis et al., 2018; Gohel et al., 2019; Hoem et al., 2019). FMRpolyG, the most abundant CGG RAN product, is necessary for CGG repeat toxicity and inclusion formation (Sellier et al., 2017; Todd et al., 2013; Oh et al., 2015) in overexpression models. Numerous RAN or homopolymeric peptides generated in other REDs are essential for their toxicity and formation of proteinaceous inclusions (Figure 2 (xii); Yamamoto et al., 2000; Schilling et al., 1999; Ordway et al., 1997; Bäuerlein et al., 2017; Paulson et al., 1997; Mizielinska et al., 2014; May et al., 2014; Zu et al., 2013). Overall, dysfunctional aggregation of repeat derived protein products mirrors the native function of STRs in proteins as facilitators of protein-protein interactions.

Translation through large STRs that form stable secondary structures likely induces ribosome stalls and elongation errors. A growing body of work shows that disease-associated STRs undergo stall-induced translational frameshifting to produce novel chimeric polypeptides (Gaspar et al., 2000; Toulouse et al., 2005; Davies and Rubinsztein, 2006; Tabet et al., 2018; McEachin et al., 2020; Wright et al., 2022), and several of these studies have shown that these frameshift products have distinct contributions to neuronal dysfunction in disease (Tabet et al., 2018; McEachin et al., 2020; Wright et al., 2022). While there is evidence that polymeric peptides contribute to toxicity observed in REDs via aggregation, the mechanistic details of homo- and di-polymeric peptide toxicity and chimeric polypeptide toxicity remain under investigation.

Antisense transcripts contribute to REDs via multiple mechanisms

Antisense transcription from the FMR1 locus generates multiple long-noncoding asFMR1 mRNAs, with some including the repeat (Ladd et al., 2007; Khalil et al., 2008; Elizur et al., 2016; Pastori et al., 2014). One antisense transcript, FMR4, is thought to play a critical role in regulating the cell cycle and apoptosis (Khalil et al., 2008). Another antisense transcript, FMR6, is upregulated in premutation women, with increased repeat length correlating to elevated RNA levels and reduced number of oocytes, suggesting a relationship between antisense transcript expression and toxicity (Elizur et al., 2016). FMR1 antisense transcription in general is upregulated in Fragile X premutation disorders and lost in FXS, like the sense FMR1 mRNA (Ladd et al., 2007). Moreover, asFMR1 mRNAs containing the CCG repeats can undergo RAN translation, producing additional homopolymeric proteins with toxic potential (Kearse et al., 2016). STR-containing antisense transcripts likely contribute to toxicity observed in many REDs, but this is best characterized in C9ALS/FTD and SCA8, where antisense transcripts are found in toxic RNA foci and contribute to RAN peptide production (Mori et al., 2013a; Zu et al., 2011; Zu et al., 2013; Moseley et al., 2006; Gendron et al., 2013).

Mechanisms of STR toxicity reveal novel native functions of STRs

Studies over the past three decades have delineated numerous mechanisms by which repeat expansions trigger cellular toxicity. Yet, there are striking parallels between the pathologic drivers of dysfunction elicited by repeat expansions and the native functions of STRs in regulating gene expression. In this section, we provide examples of how mechanisms initially identified as causing STR toxicity directly inform our understanding of native functions of STRs more broadly. We also discuss how emergent pathogenic properties associated with repeat expansions might inform additional native functions of repeats that are not yet well understood.

RAN translation occurs at native repeat lengths and have native functions

While CGG repeats in the FMR1 gene were primarily studied for their disease association, the CGG repeat is present in all humans at nonpathogenic lengths (<55 repeats) and conserved across mammals (Eichler et al., 1995; Sellier et al., 2017). Some studies suggest phenotypes associated with low CGG repeat numbers at this allele in humans, including memory difficulties and language dysfluency (Klusek et al., 2018; Mailick et al., 2014). Our group observed that CGG RAN translation, originally thought to be an aberrant toxic event, occurs in reporters with native repeat lengths (25 repeats) (Kearse et al., 2016), suggesting CGG repeats and/or translation of those repeats may have a native function in addition to the pathogenic one. CGG RAN translation at native and expanded STRs acts as an overlapping upstream open reading frame (uORF), inhibiting translation of the downstream main ORF (mORF) and thereby reducing FMRP synthesis (Rodriguez et al., 2020). Furthermore, this RAN uORF-like regulation of FMRP synthesis was critical for facilitating translational changes associated with stimulation of metabotropic glutamate receptors (mGluRs) in neurons (Rodriguez et al., 2020).

Upstream open-reading frames (uORFs) are well-characterized regulatory elements in eukaryotes that influence expression of protein produced from the main open reading frame (mORF) on the same transcript, and are typically inhibitory to downstream mORF translation (Hinnebusch et al., 2016). In this way, uORFs resulting from RAN translation of STRs may play a global role in regulating mRNA translation, presenting another mechanism by which STRs influence gene expression.

STRs facilitate protein function and localization

Expanded STRs in coding regions can fundamentally change the functions of the proteins within which they reside. In spinocerebellar ataxia type 1 (SCA1), a CAG repeat expansion in the ataxin 1 (ATXN1) gene changes the localization of ATXN1 protein (Irwin et al., 2005). ATXN1 normally shuttles between the nucleus and the cytoplasm, but an expanded polyQ region shifts ATXN1 localization to the nucleus (Figure 2(xiii); Irwin et al., 2005). Aberrant nuclear localization of ATXN1 underlies dysfunction in SCA1 (Lam et al., 2006; Lai et al., 2011; Klement et al., 1998; Emamian et al., 2003; Duvick et al., 2010), as modifications that favor nuclear localization are sufficient to elicit disease relevant phenotypes in the absence of the repeat expansion in mouse models.

PolyQ-associated nuclear translocation is also central to pathology in spinal and bulbar muscular atrophy (SBMA), where ligand binding and translocation to the nucleus of the expanded PolyQ-containing androgen receptor is required to elicit disease-associated transcriptional defects and cytotoxicity (Katsuno et al., 2006; Katsuno et al., 2002; Montie et al., 2009; Palazzolo et al., 2007). However, within the normal range of polyQ lengths observed in humans, Androgen receptor CAG repeat size inversely correlates with the receptor’s transactivational activity and linearly correlates with infertility and decreased sperm function (Choong and Wilson, 1998; Osadchuk et al., 2022; Pan et al., 2016). These findings suggest that the CAG repeats play a normal role in testosterone activated gene cascades that become aberrant at larger repeat sizes.

STRs facilitate mRNA transport to dendrites

An investigation into dendritic mRNA localization identified a localization pathway dependent on the interaction of a CGG repeat-interacting RBP, hnRNP A2, with a GA dendritic targeting element of an RNA (Muslimov et al., 2011). This GA targeting motif was competed for by CGG repeat-containing RNAs, including FMR1 mRNA. In addition to a native function of CGG repeats as a dendritic localization factor, this study revealed that elevated levels of CGG repeat mRNA caused by the CGG premutation expansion sequester hnRNP A2, resulting in global dysfunction in the transport of hnRNP A2-target mRNAs (Muslimov et al., 2011). Another study seeking to reveal transcriptome-wide impacts of C(C)UG repeat-mediated MBNL depletion on splicing in myotonic dystrophy (DM) also uncovered a global role for MBNL in mRNA localization (Wang et al., 2012).

PolyQ containing proteins regulate autophagy

Numerous REDs are caused by CAG repeats, including the huntingtin gene in Huntington’s disease (HD) and Ataxin 3 in spinocerebellar ataxia type 3 (SCA3), with toxicity largely attributed to the aggregation of long polyQ containing proteins. Autophagy induction results in clearance of these aggregates, attenuating their toxicity (Rubinsztein, 2006; Ravikumar et al., 2004; Menzies et al., 2010). PolyQ tracts in ataxin 3, a deubiquitinase associated with spinocerebellar ataxia type 3 (SCA3), interact with beclin 1, a key initiator of autophagy (Ashkenazi et al., 2017). Ataxin 3 then deuniquitinates beclin 1, protecting it from degradation and permitting autophagy initiation. Ataxin 3 activity and interaction with beclin 1 is competitively inhibited by other polyQ tract-containing proteins in a length-dependent manner (Ashkenazi et al., 2017). As such, polyQ tracts may actively engage protein quality control pathways basally but then these interactions become aberrant after STR expansion, in this case inhibiting autophagy and clearance of toxic proteins. Together, these studies suggest that the pathology of disease-associated STR expansions reveal native functions of STRs, just as an improved understanding of the native functions of STRs can inform on dysfunctions in disease.

Tetranucleotide, pentanucleotide, and biallelic repeat expansion disorders

Tetranucleotide and pentanucleotide STRs are rare within coding sequences, presumably because changes in their repeat number would trigger translational frameshifts. However, they are relatively common within introns, where their expansion causes several neurological disorders that likely act through pathogenic mechanisms that are similar to those exhibited by non-coding trinucleotide STRs. For example, Myotonic dystrophy type 2 (DM2) results from a dominantly inherited intronic CCTG STR expansion in ZNF9 (Liquori et al., 2001). CCTG STRs form RNA secondary structures that are like those generated by CTG STRs, and like the 3’ UTR CTG repeat in DM1, the DM2 repeat binds to and sequesters the RBP muscleblind (Botta et al., 2008; Paul et al., 2011; Fardaei et al., 2001; Mankodi et al., 2001; Miller et al., 2000; Du et al., 2010; Philips et al., 1998). This shared mechanism explains the significant overlap in their clinical phenotypes. Perhaps more interesting, however, is how subtle differences in how these repeats underlie the phenotypic differences in these conditions. In particular, CCTG expansions in DM2 do not trigger genetic anticipation or congenital forms of disease as occurs in DM1 despite the presence of very large CCTG expansions in DM2. These phenotypic differences are thought to occur for two reasons. First, these repeats exhibit differences in how they interact with other RBPs, such as rbFOX, that modulate the effects of muscleblind sequestration (Sellier et al., 2018). Second, differences in the genic positioning (intron versus 3’ UTR) and temporal expression of the two STRs alters their relative abilities to disrupt early developmental processes (Thomas et al., 2017; Cerro-Herreros et al., 2017).

An intriguing feature observed in multiple pentanucleotide repeat expansion disorders, including complex TTTTA a d TTTCA repeats that cause benign adult familial myoclonic epilepsy (BAFME) in multiple genes (Ishiura et al., 2018), ATTTC repeats in Spinocerebellar ataxia (SCA) type 31 (Sato et al., 2009), and AAGGG repeats that cause cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS) (Nakamura et al., 2020; Cortese et al., 2019; Tsuchiya et al., 2020), is that the pathogenic alleles represent non-reference STRs. That is, the repeat element is not only expanded in size, but it has a different sequence than the normal allele. For example, in CANVAS, an AAAAG pentanucleotide STR normally resides within the first intron of RFC1. However, the pathological repeat is a qualitatively different and expanded AAGGG repeat. Moreover, CANVAS can also occur with a third pentanucleotide repeat, ACAGG, at this same genomic location. In all these cases, these pentanucleotide repeats occur within the polyA region of an Alu transposable element. Active Alu transposition requires pure polyA elements at their 3’ ends (Deininger, 2011). As such, there is strong evolutionary selection pressure favoring mutation of these regions to pure polyA sequences. This suggests that both the reference and non-reference STRs occurred initially through a protective process that disrupted the polyA element and prevented continued transposase activity. However, stochastic differences in these interrupting mutations created some STRs that were more prone to expansion, resulting in pathogenic alleles that either create toxic STR RNAs or that interfere with local gene expression.

Conclusions and open questions

Native functions of STRs from an evolutionary perspective

Evolutionary pressures on STR copy number predict that repeat expansions will be either tolerated or selected for until an upper, deleterious limit is reached. If STRs were intrinsically deleterious, then there would be selective pressure towards repeat contractions, leading to global reductions in STR size or their selective elimination. However, several recent studies suggest that there is selective STR expansion across phylogeny, especially in primates. This is particularly true in 5’ UTRs and coding regions, where constraints on repeat expansion and contraction are greatest (Gymrek et al., 2017). At the same time, many intergenic STRs show correlations between their size and the expression of neighboring genes. These eSTRs contribute meaningfully to population variance in gene expression profiles and disease associated Quantitative trait loci (eQTLs) in human populations (Fotsing et al., 2019; Gymrek et al., 2017). In general, these eSTRs are largely unconstrained unless neighboring or embedded within a gene already under strong constraint. This mutation-selection balance suggests some inherent native functions of STRs within transcript and protein space while also implying that intrinsic STR instability may allow for more rapid variation and acquisition of traits through local perturbations in gene expression than could be accomplished through single nucleotide mutations (Figure 3A). The highly variable methylation status, mRNA and protein expression patterns elicited by differences in FMR1 CGG repeat lengths typify the potential for repeat variation within populations to influence gene expression (Figure 3B). As even subtle changes in repeat size can tune gene expression and protein function and have downstream impacts on simple and complex phenotypes, they may be an important component of the genetic differences between humans and other species, and among humans themselves.

The effects of STR variation on local gene expressivity.

(A) Bi-allelic variation in a gene through single nucleotide polymorphisms often result in small and discrete differences in gene expression, offering limited phenotypic differences across a population with slow evolutionary timescales. In contrast, STRs in promoters and 5’UTRs can influence protein expression over a broader dynamic range, with an inverse correlation between repeat length and protein output within transcribed regions and with differential effects on transcription dependent on the repeat and local epigenetic context. Unstable repeats change rapidly from generation to generation (and even within an individual through somatic variation), creating a mechanism by which mRNA or protein expression can vary broadly and subtly across a population, offering greater genetic and phenotypic diversity and a greater propensity for disease-causing aberrancies at the extremes. (B) Predicted effects of CGG repeat length on FMR1 gene expression. CGG repeat length influences FMR1 promoter epigenetic state (more open chromatin with initial expansion, then DNA methylation and closed chromatin at >200 CGG repeats), FMR1 mRNA expression, and FMRP protein expression across the polymorphic range.

Revisiting our understanding of and approach to Repeat Expansion Disorders

Historically, repetitive elements within human genomes have been viewed as mostly unregulated ‘junk DNA’ that is not under selective evolutionary pressure. As such expansions of these repetitive elements are unfortunate accidents which become apparent and important only when they elicit highly penetrant and syndromic human diseases. Consistent with this line of reasoning, the field of REDs has largely focused on emergent toxic mechanisms as drivers of disease only in the setting of large STR expansions rather than considering their pathology as alterations in the native functions played by these repeats in their normal genomic contexts. Here, we propose re-framing the discussion around repetitive elements in general- and STRs in particular- within human genomes. For each STR, we suggest first considering whether the STRs associated with a human disease have any native functions at their ‘normal’ size. If a native function exists, then expansion of these STRs can be viewed primarily as an aberrancy of that native function with coincident predictable impacts on gene expression dysregulation above certain repeat lengths. This reframing aligns with the approach typically taken in studying gain-of-function and loss-of-function mutations in disease associated single amino acid mutations and better ties the native functions of STRs to their pathology. It also suggests that shared regulatory rules will likely apply across REDs.

This approach to thinking about REDs leads to specific predictions. First, we predict that more REDs will be discovered in the future. For example, multiple recently described REDs are linked to CGG repeats, including neuronal intranuclear inclusion disease (NIID) (Ishiura et al., 2019; Sone et al., 2019), oculopharyngodistal myopathy (OPDM) and leukodystrophy (OPML) (Ishiura et al., 2019; Deng et al., 2020; Ogasawara et al., 2020; Tian et al., 2019), adult onset leukoencephalopathy (Okubo et al., 2019), and autism/intellectual disability (Annear et al., 2022). Most of these new CGG repeatopathies reside within the 5’ UTRs, like the CGG repeat in FMR1, and there is already evidence of convergent disease mechanisms triggered by these new repeats with those already established in Fragile X disorders. In one particularly notable example, a CGG repeat expansion in NOTCH2NLC leads to the creation of an AUG-initiated upstream open reading frame in the 5’ UTR that is generates a polyglycine-containing protein akin to FMRpolyG in FMR1 (Liu et al., 2022; Boivin et al., 2021). This polyglycine protein is found within inclusions in patients with NIID and its generation is required to trigger inclusion formation and behavioral phenotypes in a mouse model of NOTCH2NLC associated NIID. As such, we know that this motif in this location within neuronally expressed genes can elicit dysfunction through predictable mechanisms. This means that we should expect other CGG repeat expansions to emerge that mirror the pathologic processes established for the FMR1 locus and now extended to a large set of loci. Similarly, given evidence that the CGG repeat in FMR1 5’UTRs can serve as a functional element that regulates transcription, mRNA localization and translation, we predict that native CGG repeat elements in these disease-associated alleles may have normal functions akin to those observed for FMR1, and as such represent a functional motif shared among many genes.

However, these new REDs may not all fit the typical model observed to date, where highly penetrant STR expansions lead to syndromic disorders. Instead, smaller changes in repeat size at multiple loci, impacting expression of the genes in which they reside or neighboring genes, will serve as risk alleles for common conditions. This risk-allele model is already apparent, as intermediate CAG repeat expansions in ATXN1, ATXN2, and HTT are associated with sporadic ALS and some other common neurodegenerative disorders (Elden et al., 2010; Rosas et al., 2020). Indeed, a fair proportion of the unexplained signal within Genome Wide association Studies (GWAS) can be explained by variations within neighboring STRs (Gymrek et al., 2016; Gymrek, 2017; Hannan, 2018). To date, numerous STR variants have been linked to ASD (Mitra et al., 2021; Trost et al., 2020) and Schizophrenia (Mojarad et al., 2022). As PCR-free and long-read whole genome data becomes more abundant and available (reviewed in Mitsuhashi and Matsumoto, 2020), it will become increasingly easy to detect these dynamic repeat size/disease relationships, creating a whole new class of STR-associated conditions that will likely expand outside of neurological conditions.

Second, we predict that long-read whole genome sequencing datasets will improve our understanding of the native roles of STRs in humans, and reveal a ubiquitous impact of repeat length variation on gene expression. Once we create accurate maps of STR variation across the genome and link this variation to neighboring gene loci expression, we will be able to better discern the mechanisms by which STRs influence gene expression across cell types. We predict that many genes whose expression is affected by neighboring repeat length variation will play critical functions in the nervous system. Most known REDs present with neurological symptoms. If REDs reflect the native functions of STRs, then the overrepresentation of neurological dysfunctions linked to STR expansions suggests that STRs may play roles relevant to neuronal health and function. It is also possible that neurons, as terminally differentiated cells, may be more prone to somatic instability, leading to repeat expansion and the emergence of associated dysfunction with age.

Finally, we predict that the native functions of STRs will inform our understanding of how STR expansions cause disease and vice versa. A deeper understanding of the native functions of both disease-associated STRs and STRs in general will reveal the pathways altered in REDs, and these pathways may be areas for therapeutic intervention that can be applicable across all REDs. By studying the mechanisms by which STRs elicit disease, we can also surmise key elements of how they might function normally within nervous systems (see examples in previous section, “Mechanisms of STR toxicity reveal novel native functions of STRs”). Ultimately, research into native functions of STRs will reveal both mechanisms by which they regulate neuronal function and therapeutic targets by which their toxicity in REDs can be mitigated.

References

    1. Allingham-Hawkins DJ
    2. Babul-Hirji R
    3. Chitayat D
    4. Holden JJ
    5. Yang KT
    6. Lee C
    7. Hudson R
    8. Gorwill H
    9. Nolin SL
    10. Glicksman A
    11. Jenkins EC
    12. Brown WT
    13. Howard-Peebles PN
    14. Becchi C
    15. Cummings E
    16. Fallon L
    17. Seitz S
    18. Black SH
    19. Vianna-Morgante AM
    20. Costa SS
    21. Otto PA
    22. Mingroni-Netto RC
    23. Murray A
    24. Webb J
    25. Vieri F
    (1999)
    Fragile X premutation is a significant risk factor for premature ovarian failure: the International collaborative POF in fragile X study -- preliminary data
    American Journal of Medical Genetics 83:322–325.
    1. Brunberg JA
    2. Jacquemont S
    3. Hagerman RJ
    4. Berry-Kravis EM
    5. Grigsby J
    6. Leehey MA
    7. Tassone F
    8. Brown WT
    9. Greco CM
    10. Hagerman PJ
    (2002)
    Fragile X premutation carriers: characteristic MR imaging findings of adult male patients with progressive cerebellar and cognitive dysfunction
    AJNR. American Journal of Neuroradiology 23:1757–1766.
    1. Hagerman PJ
    2. Hagerman RJ
    (2004) Fragile X-associated Tremor/Ataxia Syndrome (FXTAS)
    Mental Retardation and Developmental Disabilities Research Reviews 10:25–30.
    https://doi.org/10.1002/mrdd.20005
  1. Book
    1. Janitz K
    2. Janitz M
    (2011) Chapter 12 - Assessing epigenetic information
    In: Tollefsbol T, editors. Handbook of Epigenetics. San Diego: Academic Press. pp. 173–181.
    https://doi.org/10.1016/B978-0-12-375709-8.00012-5
    1. Lander ES
    2. Linton LM
    3. Birren B
    4. Nusbaum C
    5. Zody MC
    6. Baldwin J
    7. Devon K
    8. Dewar K
    9. Doyle M
    10. FitzHugh W
    11. Funke R
    12. Gage D
    13. Harris K
    14. Heaford A
    15. Howland J
    16. Kann L
    17. Lehoczky J
    18. LeVine R
    19. McEwan P
    20. McKernan K
    21. Meldrim J
    22. Mesirov JP
    23. Miranda C
    24. Morris W
    25. Naylor J
    26. Raymond C
    27. Rosetti M
    28. Santos R
    29. Sheridan A
    30. Sougnez C
    31. Stange-Thomann Y
    32. Stojanovic N
    33. Subramanian A
    34. Wyman D
    35. Rogers J
    36. Sulston J
    37. Ainscough R
    38. Beck S
    39. Bentley D
    40. Burton J
    41. Clee C
    42. Carter N
    43. Coulson A
    44. Deadman R
    45. Deloukas P
    46. Dunham A
    47. Dunham I
    48. Durbin R
    49. French L
    50. Grafham D
    51. Gregory S
    52. Hubbard T
    53. Humphray S
    54. Hunt A
    55. Jones M
    56. Lloyd C
    57. McMurray A
    58. Matthews L
    59. Mercer S
    60. Milne S
    61. Mullikin JC
    62. Mungall A
    63. Plumb R
    64. Ross M
    65. Shownkeen R
    66. Sims S
    67. Waterston RH
    68. Wilson RK
    69. Hillier LW
    70. McPherson JD
    71. Marra MA
    72. Mardis ER
    73. Fulton LA
    74. Chinwalla AT
    75. Pepin KH
    76. Gish WR
    77. Chissoe SL
    78. Wendl MC
    79. Delehaunty KD
    80. Miner TL
    81. Delehaunty A
    82. Kramer JB
    83. Cook LL
    84. Fulton RS
    85. Johnson DL
    86. Minx PJ
    87. Clifton SW
    88. Hawkins T
    89. Branscomb E
    90. Predki P
    91. Richardson P
    92. Wenning S
    93. Slezak T
    94. Doggett N
    95. Cheng JF
    96. Olsen A
    97. Lucas S
    98. Elkin C
    99. Uberbacher E
    100. Frazier M
    101. Gibbs RA
    102. Muzny DM
    103. Scherer SE
    104. Bouck JB
    105. Sodergren EJ
    106. Worley KC
    107. Rives CM
    108. Gorrell JH
    109. Metzker ML
    110. Naylor SL
    111. Kucherlapati RS
    112. Nelson DL
    113. Weinstock GM
    114. Sakaki Y
    115. Fujiyama A
    116. Hattori M
    117. Yada T
    118. Toyoda A
    119. Itoh T
    120. Kawagoe C
    121. Watanabe H
    122. Totoki Y
    123. Taylor T
    124. Weissenbach J
    125. Heilig R
    126. Saurin W
    127. Artiguenave F
    128. Brottier P
    129. Bruls T
    130. Pelletier E
    131. Robert C
    132. Wincker P
    133. Smith DR
    134. Doucette-Stamm L
    135. Rubenfield M
    136. Weinstock K
    137. Lee HM
    138. Dubois J
    139. Rosenthal A
    140. Platzer M
    141. Nyakatura G
    142. Taudien S
    143. Rump A
    144. Yang H
    145. Yu J
    146. Wang J
    147. Huang G
    148. Gu J
    149. Hood L
    150. Rowen L
    151. Madan A
    152. Qin S
    153. Davis RW
    154. Federspiel NA
    155. Abola AP
    156. Proctor MJ
    157. Myers RM
    158. Schmutz J
    159. Dickson M
    160. Grimwood J
    161. Cox DR
    162. Olson MV
    163. Kaul R
    164. Raymond C
    165. Shimizu N
    166. Kawasaki K
    167. Minoshima S
    168. Evans GA
    169. Athanasiou M
    170. Schultz R
    171. Roe BA
    172. Chen F
    173. Pan H
    174. Ramser J
    175. Lehrach H
    176. Reinhardt R
    177. McCombie WR
    178. de la Bastide M
    179. Dedhia N
    180. Blöcker H
    181. Hornischer K
    182. Nordsiek G
    183. Agarwala R
    184. Aravind L
    185. Bailey JA
    186. Bateman A
    187. Batzoglou S
    188. Birney E
    189. Bork P
    190. Brown DG
    191. Burge CB
    192. Cerutti L
    193. Chen HC
    194. Church D
    195. Clamp M
    196. Copley RR
    197. Doerks T
    198. Eddy SR
    199. Eichler EE
    200. Furey TS
    201. Galagan J
    202. Gilbert JG
    203. Harmon C
    204. Hayashizaki Y
    205. Haussler D
    206. Hermjakob H
    207. Hokamp K
    208. Jang W
    209. Johnson LS
    210. Jones TA
    211. Kasif S
    212. Kaspryzk A
    213. Kennedy S
    214. Kent WJ
    215. Kitts P
    216. Koonin EV
    217. Korf I
    218. Kulp D
    219. Lancet D
    220. Lowe TM
    221. McLysaght A
    222. Mikkelsen T
    223. Moran JV
    224. Mulder N
    225. Pollara VJ
    226. Ponting CP
    227. Schuler G
    228. Schultz J
    229. Slater G
    230. Smit AF
    231. Stupka E
    232. Szustakowki J
    233. Thierry-Mieg D
    234. Thierry-Mieg J
    235. Wagner L
    236. Wallis J
    237. Wheeler R
    238. Williams A
    239. Wolf YI
    240. Wolfe KH
    241. Yang SP
    242. Yeh RF
    243. Collins F
    244. Guyer MS
    245. Peterson J
    246. Felsenfeld A
    247. Wetterstrand KA
    248. Patrinos A
    249. Morgan MJ
    250. de Jong P
    251. Catanese JJ
    252. Osoegawa K
    253. Shizuya H
    254. Choi S
    255. Chen YJ
    256. Szustakowki J
    257. International Human Genome Sequencing Consortium
    (2001) Initial sequencing and analysis of the human genome
    Nature 409:860–921.
    https://doi.org/10.1038/35057062
    1. Lubs HA
    (1969)
    A marker X chromosome
    American Journal of Human Genetics 21:231–244.
    1. McConkie-Rosell A
    2. Lachiewicz AM
    3. Spiridigliozzi GA
    4. Tarleton J
    5. Schoenwald S
    6. Phelan MC
    7. Goonewardena P
    8. Ding X
    9. Brown WT
    (1993)
    Evidence that methylation of the FMR-I locus is responsible for variable phenotypic expression of the fragile X syndrome
    American Journal of Human Genetics 53:800–809.
  2. Book
    1. Paulson H
    (2018) Chapter 9 - Repeat expansion diseases
    In: Geschwind DH, Paulson HL, Klein C, editors. Handbook of Clinical Neurology, Neurogenetics, Part I. Elsevier. pp. 105–123.
    https://doi.org/10.1016/B978-0-444-63233-3.00009-9
    1. Subramanian PS
    2. Nelson DL
    3. Chinault AC
    (1996)
    Large domains of apparent delayed replication timing associated with triplet repeat expansion at FRAXA and FRAXE
    American Journal of Human Genetics 59:407–416.

Article and author information

Author details

  1. Shannon E Wright

    1. Department of Neurology, University of Michigan–Ann Arbor, Ann Arbor, United States
    2. Neuroscience Graduate Program, University of Michigan–Ann Arbor, Ann Arbor, United States
    3. Department of Neuroscience, Picower Institute, Cambridge, United States
    Contribution
    Conceptualization, Investigation, Writing - original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7388-1617
  2. Peter K Todd

    1. Department of Neurology, University of Michigan–Ann Arbor, Ann Arbor, United States
    2. VA Ann Arbor Healthcare System, Ann Arbor, United States
    Contribution
    Conceptualization, Supervision, Investigation, Writing – review and editing
    For correspondence
    petertod@umich.edu
    Competing interests
    Dr Todd served as a consultant to Denali Therapeutics and holds a shared patent on ASOs developed with Ionis Pharmaceuticals
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4781-6376

Funding

National Institute of Neurological Disorders and Stroke (F31NS113513)

  • Shannon E Wright

Eunice Kennedy Shriver National Institute of Child Health and Human Development (P50HD104463)

  • Peter K Todd

National Institute of Neurological Disorders and Stroke (R01NS099280)

  • Peter K Todd

National Institute of Neurological Disorders and Stroke (R01NS086810)

  • Peter K Todd

Veterans Administration Medical Center (BX004842)

  • Peter K Todd

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank current and former members of the Todd lab for helpful discussions and commentary. This work was funded by grants from the NIH to SEW (T32-NS076401 and NRSA F31NS113513) and PKT (P50HD104463, R01NS099280 and R01NS086810). PKT was also supported by the VA (BLRD BX004842 to PKT) and private philanthropic support.

Copyright

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Metrics

  • 2,956
    views
  • 445
    downloads
  • 11
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Shannon E Wright
  2. Peter K Todd
(2023)
Native functions of short tandem repeats
eLife 12:e84043.
https://doi.org/10.7554/eLife.84043

Share this article

https://doi.org/10.7554/eLife.84043

Further reading

    1. Genetics and Genomics
    2. Microbiology and Infectious Disease
    Dániel Molnár, Éva Viola Surányi ... Judit Toth
    Research Article

    The sustained success of Mycobacterium tuberculosis as a pathogen arises from its ability to persist within macrophages for extended periods and its limited responsiveness to antibiotics. Furthermore, the high incidence of resistance to the few available antituberculosis drugs is a significant concern, especially since the driving forces of the emergence of drug resistance are not clear. Drug-resistant strains of Mycobacterium tuberculosis can emerge through de novo mutations, however, mycobacterial mutation rates are low. To unravel the effects of antibiotic pressure on genome stability, we determined the genetic variability, phenotypic tolerance, DNA repair system activation, and dNTP pool upon treatment with current antibiotics using Mycobacterium smegmatis. Whole-genome sequencing revealed no significant increase in mutation rates after prolonged exposure to first-line antibiotics. However, the phenotypic fluctuation assay indicated rapid adaptation to antibiotics mediated by non-genetic factors. The upregulation of DNA repair genes, measured using qPCR, suggests that genomic integrity may be maintained through the activation of specific DNA repair pathways. Our results, indicating that antibiotic exposure does not result in de novo adaptive mutagenesis under laboratory conditions, do not lend support to the model suggesting antibiotic resistance development through drug pressure-induced microevolution.

    1. Computational and Systems Biology
    2. Genetics and Genomics
    Sanjarbek Hudaiberdiev, Ivan Ovcharenko
    Research Article

    Enhancers and promoters are classically considered to be bound by a small set of transcription factors (TFs) in a sequence-specific manner. This assumption has come under increasing skepticism as the datasets of ChIP-seq assays of TFs have expanded. In particular, high-occupancy target (HOT) loci attract hundreds of TFs with often no detectable correlation between ChIP-seq peaks and DNA-binding motif presence. Here, we used a set of 1003 TF ChIP-seq datasets (HepG2, K562, H1) to analyze the patterns of ChIP-seq peak co-occurrence in combination with functional genomics datasets. We identified 43,891 HOT loci forming at the promoter (53%) and enhancer (47%) regions. HOT promoters regulate housekeeping genes, whereas HOT enhancers are involved in tissue-specific process regulation. HOT loci form the foundation of human super-enhancers and evolve under strong negative selection, with some of these loci being located in ultraconserved regions. Sequence-based classification analysis of HOT loci suggested that their formation is driven by the sequence features, and the density of mapped ChIP-seq peaks across TF-bound loci correlates with sequence features and the expression level of flanking genes. Based on the affinities to bind to promoters and enhancers we detected five distinct clusters of TFs that form the core of the HOT loci. We report an abundance of HOT loci in the human genome and a commitment of 51% of all TF ChIP-seq binding events to HOT locus formation thus challenging the classical model of enhancer activity and propose a model of HOT locus formation based on the existence of large transcriptional condensates.