The naive Tcell receptor repertoire has an extremely broad distribution of clone sizes
Abstract
The clone size distribution of the human naive Tcell receptor (TCR) repertoire is an important determinant of adaptive immunity. We estimated the abundance of TCR sequences in samples of naive T cells from blood using an accurate quantitative sequencing protocol. We observe most TCR sequences only once, consistent with the enormous diversity of the repertoire. However, a substantial number of sequences were observed multiple times. We detect abundant TCR sequences even after exclusion of methodological confounders such as sort contamination, and multiple mRNA sampling from the same cell. By combining experimental data with predictions from models we describe two mechanisms contributing to TCR sequence abundance. TCRα abundant sequences can be primarily attributed to many identical recombination events in different cells, while abundant TCRβ sequences are primarily derived from large clones, which make up a small percentage of the naive repertoire, and could be established early in the development of the Tcell repertoire.
Introduction
The human adaptive immune system employs a vast number (> 10^{11} [Clark et al., 1999]) of T lymphocytes, to detect and control pathogens. Most T cells express a single Tcell receptor (TCR) variant, which binds antigen in the form of a short peptide presented by the Major Histocompatibility Complex (pMHC) (Davis and Bjorkman, 1988). The TCR has to be specific to distinguish between self and nonselfpMHC, but due to the large number of possible foreign antigens (> 20^{9}) a specific TCR is nevertheless expected to bind many different pMHC (i.e., crossreactivity) (Mason, 1998; Sewell, 2012). The actual diversity of the TCR repertoire is unknown, but with improved sequencing techniques, estimates have risen by orders of magnitude from 10^{6} (Arstila et al., 1999), 10^{7} (Robins et al., 2009), to over 10^{8} (Qi et al., 2014).
Generation of αβ TCRs occurs in the thymus, where thymocytes randomly rearrange and imprecisely recombine gene segments to create a complete receptor (NikolichZugich et al., 2004). This heterodimer is generated by random recombination of Variable, Diversity, and Joining (V, D and J) segments for TCRβ, and V and J segments for TCRα sequences (Davis and Bjorkman, 1988). Most variability arises due to random nucleotide insertions and deletions where the segments are joined (Murugan et al., 2012). Recent estimates of the potential number of TCRs produced by this V(D)Jrecombination process range from > 10^{20} (Zarnitsyna et al., 2013) to 10^{61} (Mora and Walczak, 2019), which vastly outnumbers the number of distinct TCRs present in a human body. After generation of the TCR, T cells undergo positive and negative selection, which selects those T cells that have sufficient, but not too high, affinity for any selfpMHC (McDonald et al., 2015). About 3–5% of thymocytes survive selection (Merkenschlager et al., 1997) and enter the periphery as T cells that have not yet encountered foreign cognate antigen, that is as naive T cells.
The thymic output of new T cells decreases because of thymic involution, making peripheral division of existing cells the main source of naive T cells from early adulthood onwards in humans (den Braber et al., 2012; Kumar et al., 2018). In the periphery, naive T cells compete for cytokines, such as IL7, and need to interact with selfpMHC to survive (Tanchot et al., 1997; Takada and Jameson, 2009; Jenkins et al., 2010). Competition between Tcell specificities may reduce repertoire diversity when cells with some TCRs outcompete others (De Boer and Perelson, 1994), resulting in differences in TCR frequencies, and heterogeneous naive Tcell clone sizes. Experimental evidence for large heterogeneity in division and survival rates within the naive Tcell pool has been shown in mice (Hogan et al., 2015; Rane et al., 2018; Reynaldi et al., 2019). Such experiments are not feasible in humans, but mathematical modelling has been used to assess how fitness differences between Tcell clones may affect the frequency of clones in the naive repertoire (Stirk et al., 2008; Stirk et al., 2010; Hapuarachchi et al., 2013; Lythe et al., 2016; Desponds et al., 2016; Desponds et al., 2017; Dowling and Hodgkin, 2009; Johnson et al., 2012).
Measuring the distribution of TCRα and TCRβ sequences in samples of naive T cells can inform us about the clonesize distribution of the naive Tcell repertoire. Previous studies have reported large heterogeneity in the frequency of TCRβ sequences in naive repertoires from mice (Quigley et al., 2010) and humans (Robins et al., 2009; Venturi et al., 2011; Qi et al., 2014; Pogorelyy et al., 2017). One important factor shaping the abundance of TCR sequences is their likelihood to be produced during VDJrecombination. Rearrangements with less Ninsertions, for example, tend to be more commonly observed (Robins et al., 2009; Robins et al., 2010; Venturi et al., 2011; Pogorelyy et al., 2017). To study this in more detail, the Mora and Walczak groups developed probabilistic models that predict the generation probability of any specific TCRα or TCRβ sequence (Murugan et al., 2012; Marcou et al., 2018). They showed that these sequences ($\sigma $) differ by several orders of magnitude in their probability $\mathcal{P}(\sigma )$ of being produced by V(D)J recombination in the thymus. Differential generation probabilities do not only impact the abundance of TCRα and TCRβ sequences within an individual, but also contribute to sharing among individuals (Robins et al., 2010; Quigley et al., 2010; Venturi et al., 2011; Qi et al., 2014; Pogorelyy et al., 2017; Elhanati et al., 2018). Hence, it is essential to take the likelihood of generating a sequence into account when interpreting sequencing data of immune repertoires.
In this study, we characterize the frequency distribution of TCRα and TCRβ sequences in the naive repertoire. We analyze published and new experimental data on both the TCR α and β chain, and combine a quantitative unique molecular identifier (UMI)based TCR sequencing pipeline with mathematical modeling to consider carefully the contributions of different mechanisms that may lead to observed abundant TCRα and TCRβ sequences in the naive repertoire. Such mechanisms include experimental confounders, such as the purity of the cell populations and repeated sampling of mRNA from the same cell, and diverse biological processes including distinguishing carefully between repeat generation of identical sequences in different cells, and large naive Tcell clones. We show that all these processes are likely to contribute to the observed abundance profile of TCR sequences in samples of naive repertoires. In particular, even after all other mechanisms are accounted for, we find evidence for naive Tcell clone size heterogeneity. Specifically, the results are compatible with an underlying powerlaw distribution of naive Tcell clone sizes (Desponds et al., 2016), or more generally by models in which 1–5% of naive T cells represent large clones of 10^{5}  10^{6} cells. Preferential expansion of some clonotypes, perhaps those occurring early in development of adaptive immunity, therefore plays an important role in shaping the naive Tcell repertoire.
Results
We analysed the frequency distribution of TCR sequences in the naive Tcell compartment, using TCRα and TCRβ sequences published in Oakes et al. (2017). In brief, peripheral blood mononuclear cells (PBMCs) from two adult volunteers were FACSsorted into naive (CD27^{+}CD45RA^{high}) and various memory CD4^{+} and CD8^{+} populations. TCRα and TCRβ mRNA was reverse transcribed to cDNA molecules to which unique molecular identifiers (UMIs) were attached, followed by PCRamplification and highthroughput sequencing (HTS) on an Illumina MiSeq platform. We refer to this as experiment 1 below (for further details see Section 'Sequence analysis'). Sequence reads were processed using a customized version of the Decombinator pipeline (Thomas et al., 2013), with an improved error correction on UMIs to more reliably estimate the frequency of nucleotide TCRα and TCRβ sequences in the samples (see Section 'Sequence analysis'). Additionally, we used the RTCR pipeline (Gerritsen et al., 2016) for comparison (Section 'Sequence analysis'). The different memory populations were combined for the purpose of the analysis presented below.
Abundant TCR sequences are frequently shared between naive and memory populations, and are enriched for high V(D)J recombination probabilities
Within the naive Tcell repertoires, the vast majority of TCRα and TCRβ sequences were observed only once, and most frequencies fall within the range from 1 to 5 (Figure 1A). As expected, in the memory repertoires, which contain clonally expanded T cells, much more abundant sequences were present, with a substantial number of α and β chains observed more than 1000 times (Figure 1A). The few sequences observed with a frequency higher than five in the naive samples were shared in most cases (94.6%) with the corresponding memory subset from the same individual. We examined whether this overlap might arise from imperfect sorting of the Tcell populations, despite the tight nonoverlapping sort gates applied (see [Oakes et al., 2017]). A prediction of such sorting contamination is that the abundance of the shared TCR sequences in the naive and memory repertoires should be proportional. Such a linear relationship could be observed clearly for CD8^{+} TCRα and TCRβ sequences (Figure 1A), especially for memory abundances greater than 1000. Correlation measurements suggested that the amount of contamination for CD8^{+} T cells was 0.1  1.5%. As expected, no correlation was observed between the abundance of TCR sequences shared between naive and memory populations of different donors (Figure 1B).
We next examined the relationship between V(D)J recombination probabilities and the overlap between naive and memory repertoires. Using the V(D)Jrecombination model of Marcou et al. (2018), we predicted the generation probabilities $\mathcal{P}(\sigma )$ of all TCRα and TCRβ sequences in our datasets. As expected, we observed a wide range of $\mathcal{P}(\sigma )$ values, which were several orders of magnitude higher for TCRα sequences than TCRβ, due to additional recombination of the D segment. The generation probability distributions of sequences derived from naive and memory T cells were indistinguishable (Figure 1C, blue and red, respectively). Thus, our data provide no evidence that the V(D)Jrecombination process preferentially produces sequences that are more likely to enter the memory pool during an immune response. However, TCRα sequences shared between memory and the corresponding naive samples, were strikingly enriched for high $\mathcal{P}(\sigma )$ (Figure 1C, green). This enrichment is much less evident for TCRβ sequences. The enrichment for sequences with high $\mathcal{P}(\sigma )$ in the population of shared memory/naive TCRα is not compatible with overlap derived from contamination during cell sorting, but rather suggests that the sharing may also arise from T cells which use the same TCRα because of identical VJ recombination events in different T cells. It is important to stress that, since such different T cells are highly unlikely to also share TCRβ sequences, the clonotype, and hence specificity of the T cells in the naive and memory compartments may well be different, despite sharing TCRα sequences.
As a control, we also analyzed overlap between the naive sample from one volunteer and the memory sample from the other. In this case, sort contamination of naive repertoires by memory T cells is excluded and a shared sequence can only result from independent identical recombination events, from distinct Tcell clones. For CD4^{+} cells, we find that the number of TCRα sequences shared between naive and memory is similar between and within volunteers, and that the $\mathcal{P}(\sigma )$ distribution is nearly identical (Figure 1C, purple). For CD8^{+} cells, the number of sequences shared within an individual is somewhat larger than between individuals, compatible with some degree of sort contamination in this population as discussed above. The small number of TCRβ sequences shared between individuals also had a relatively high $\mathcal{P}(\sigma )$, although considerably smaller than for TCRα.
In summary, although contamination with abundant memory T cells may make a small contribution to the TCR sequences which are found in both naive and memory for CD8^{+} cells, multiple identical recombinations arising from high $\mathcal{P}(\sigma )$ values is the dominant mechanism leading to overlap in the TCRα repertoires. Nevertheless, in order to stringently exclude any possible contribution of contamination, we included an analysis which excluded all the shared sequences from the further investigations of the relationship between TCR sequence abundance and $\mathcal{P}(\sigma )$ (Figure 1D).
The abundances of sequences in all naive repertoires were correlated to $\mathcal{P}(\sigma )$ (Figure 1D, blue). The median $\mathcal{P}(\sigma )$ of the α chains that were observed at least three times was about 154fold higher than for those that have only been observed once ($p<{10}^{15}$, Wilcoxon test). The enrichment for high $\mathcal{P}(\sigma )$ in more abundant TCR sequences was weaker for TCRβ (∼2.5fold, $p<0.01$, Wilcoxon test), but still stronger than for memory subsets (1.65 and 1.03fold for TCRα and TCRβ, respectively, $p<{10}^{15}$ and $p=0.27$). In line with this, the number of Nadditions tended to be lower for TCRα and TCRβ sequences abundant in the naive samples (Figure 1—figure supplement 1). These correlations suggest that multiple identical recombination events which occur during formation of the naive Tcell repertoire in the thymus due to high generation probabilities, contribute to the observation of abundant TCR sequences. This is especially evident for TCRα, where the probabilities of producing a given sequence are higher because of the absence of a D region. However, abundant TCR sequences with low $\mathcal{P}(\sigma )$ are also observed, especially for TCRβ, leaving open the possibility of large naive Tcell clones.
Frequently observed TCR sequences cannot be attributed only to multiple RNA molecules per cell
T cells contain on average in the order of 100 molecules of TCRα and 300 molecules of TCRβ mRNA (Oakes et al., 2017). Because the TCR sequencing pipeline is not 100% efficient, only a small proportion of these molecules are actually sequenced, but the possibility remains that TCR sequences observed multiple times may be due to repeat sampling from the same cell. Because the variance of this number remains undetermined, it is difficult to computationally determine the contribution of this multiple sampling to the data. Instead, we performed an additional experiment (referred to as experiment 2) in which we sorted naive T cells from an additional volunteer, and split the naive T cells into three subsamples before mRNA extraction. We then carried out library preparations and sequenced TCRα and TCRβ sequences from each subsample independently. In this experiment, sequences observed in more than one subsample must have been derived from different cells, and cannot be a result of sequencing multiple mRNA molecules from a single cell. Repeated sequences must therefore derive from different cells, and represent abundant sequences.
In total 16913 (3.4%) TCRα sequences, and 5744 (0.61%) TCRβ sequences, were observed in more than one subsample (Figure 2A), confirming the existence of a substantial number of frequent TCR α and β chains in the full naive repertoire. In order to exclude any contribution from sort contamination, we also plot the data after removing all TCR sequences found in both memory and naive repertoires (Figure 2A, grey bars). A substantial number of α and β chains were still found in multiple subsamples. In order to estimate the impact of multiple sampling on the observed abundances we randomly permuted the TCR sequences between subsamples, and reanalyzed the distributions (see detailed explanation in Section 'Subsampling to exclude inflated abundance through multiple RNA contributions by single cells'). We estimated that ∼25% of α and > 75% of β chains with an abundance of greater than 1 in an individual sample may arise from sampling multiple RNA molecules from single cells. The impact is strongest on TCR sequences observed twice (see Figure 2—figure supplement 1). Thus multiple mRNA sampling is an important confounder of estimating TCR sequence abundances in individual repertoires, especially for TCRβ.
Having ruled out the contribution of multiple mRNA sampling experimentally, we examined the relationship between TCR sequence abundance and $\mathcal{P}(\sigma )$ in this new data set. The TCRα chains present in more than one naive subsample are dominated by sequences with high $\mathcal{P}(\sigma )$. The median generation probability of TCRα sequences observed in two and three subsamples was 56 and 165fold higher, respectively, than those observed only once (Figure 2B). The relationship for TCRβ sequences was remarkably different, however. While TCRβ sequences observed in two subsamples are mildly enriched for high generation probabilities, those observed in three subsamples have hardly any enrichment for high $\mathcal{P}(\sigma )$ (Figure 2B). Instead, their generation probabilities tend to be lower than those of the sequences observed in two subsamples, and more similar to the generation probabilities of TCRβ sequences seen in only one subsample. We obtained similar results when measuring the number of Nadditions and VJdeletions in the rearrangements: abundant α chains (with incidence 2 or 3) tend be closer to germline rearrangements, while this was only the case for β chains with incidence 2, and not for the most abundant β chains with incidence 3 (Figure 2C and D). These trends were observed both with and without removing the sequences that were also observed in memory (Figure 2, grey versus colored bars) and when processing the data with RTCR (Figure 2—figure supplement 2).
We further explored whether the more abundant sequences were also more ‘public’ (found in the repertoires of multiple individuals), which would be predicted if they are more likely products of V(D)Jrecombination. We measured the degree of sharing between those TCR sequences observed in 1, 2, or 3 naive subsamples, and the TCRα and TCRβ repertoires of unfractionated blood samples collected from 28 healthy donors (details in Section 'Sharing of TCRα and TCRβ sequences'). Both TCRα and TCRβ sequences observed in two or three subsamples were found to be significantly more often shared with this independent cohort than those observed once (Figure 2—figure supplement 3A). The most frequent TCRα sequences, which were seen in three subsamples, showed the highest sharing degree, consistent with their strongest enrichment for high generation probabilities. The relatively small number of most frequent TCRβ sequences (i.e., those observed in three subsamples), did not show increased interindividual sharing compared to the TCRβ sequences observed in two subsamples. Additional comparison with publicly available TCRβ data from a large cohort (Emerson et al., 2017) (see Section 'Sharing of TCRα and TCRβ sequences') showed that the most frequently observed β chains, which were observed in all three subsets in experiment 2, were less public than sequences observed in two subsamples (Figure 2—figure supplement 3B). The seemingly paradoxical finding that the most abundant TCRβ sequences (observed in all three subsamples) have lower $\mathcal{P}(\sigma )$, and are less public than those found twice, is explored in more detail below.
Computational models of TCR repertoire generation suggest the presence of a small proportion of large Tcell clones in the naive repertoire
In order to more rigorously test our ideas about the frequency distribution of clonotypes in the naive Tcell repertoire, we explored a number of possible computational models of repertoire generation and sampling, and compared model predictions with the experimental data discussed above. The first simplest scenario we considered was a neutral model of repertoire formation, similar to Hubbell’s Neutral Community Model (Hubbell, 2001; Figure 3A, details in Section 'Neutral model for dynamics of naive T cells'). The model assumes that there is no selective advantage of one TCR over another, and therefore the TCR of a naive T cell does not affect its lifespan or division rate. Consider a pool of $N$ naive T cells, from which cells are removed by cell death or by priming with antigen, leading to differentiation into a memory population. A fraction θ of these cells is replaced by thymic production of new clones and the remaining fraction 1  θ gets replaced by division of cells present in the pool. When simulating the naive Tcell pool with this model, the clonesize distribution approaches a ‘steady state’ (not shown). We use this steadystate distribution, for which we have an analytical expression (Section 'Neutral model for dynamics of naive T cells') to predict the size of clones in the naive Tcell pool. As the contribution of thymic output decreases during aging (Steinmann et al., 1985), we evaluated the model for a wide range of values for θ. The clonesize distribution which emerges from the neutral model is approximately geometric for clone sizes larger than the introduction size $c$ (Figure 3B, Section 'Neutral model for dynamics of naive T cells'). We compared this basic model to models in which we impose other distributions on the underlying clonotype abundances (model details in Section 'Clonesize distributions of the naive Tcell pools'). We specifically focused on heavytailed distributions such as lognormal and powerlaw distributions, which have previously been associated with Tcell repertoires (Desponds et al., 2016). The shape of each of these distributions is controlled by a single parameter (as shown in Figure 3B), allowing us to compare distributions with different degrees of heterogeneity. In all cases, we normalized the clonesize distribution such that the total number of cells $N$ is constant. Since we had separate experimental data for CD4^{+} and CD8^{+} cells, we considered CD4^{+} and CD8^{+} cells separately, setting $N(CD4)=7.5\times {10}^{10}$, and $N(CD8)=2.5\times {10}^{10}$.
From all model clonesize distributions we simulate three subsamples, so as to compare with the data from the second experiment described above. Each sampled TCR is assigned a TCRα and a TCRβ sequence that were generated with IGoR (Marcou et al., 2018). Previous studies showed that α and β chains with higher generation probabilities tend to have a higher probability to survive selection (Elhanati et al., 2014). Therefore, we train a simple $\mathcal{P}(\sigma )$dependent selection model on the data from the single naive Tcell samples shown in Figure 1. First, we assume that productively rearranged chains have an overall 1/3 probability to survive thymic selection. Then we bias the probability for bins of sequences based on their $\mathcal{P}(\sigma )$, such that the resulting set of α and β chains has the same generation probability distribution as in the experimental repertoire data (Section 'In silico samples from modelled clonesize distributions'). The models also incorporate the expected number of cells that contribute at least one mRNA molecule. This parameter is also learnt from the data, by setting the number of cells that contributed mRNA such that the predicted diversity of a subsample matches the observed diversity (Section 'In silico samples from modelled clonesize distributions'). Taken together, the subsamples we take from the various model clonesize distributions are such that they match the generation probabilities and diversity of the experimental subsamples as closely as possible. We compare the number and median $\mathcal{P}(\sigma )$ of the TCR sequences that are predicted to occur in only one, in two or in three subsamples, with the equivalent experimental data from experiment two above (Figure 4).
We consider first the neutral model (Figure 4A and C). For the $\alpha $ chain, a wide range of thymic output rates predict the number of chains occurring in 1, 2 and 3 subsamples reasonably well (Figure 4A). The model does not predict the median $\mathcal{P}(\sigma )$ of TCRα found in 2 and 3 subsamples well, although qualitatively the model does predict the increasing $\mathcal{P}(\sigma )$ with increasing abundance (Figure 4C). For the β chains, there is no range of thymic output rates for which the model correctly predicts the number of sequences observed in 2 and 3 subsamples. Moreover, the observation that incidence two chains have higher $\mathcal{P}(\sigma )$ than incidence three chains was not predicted for any value of θ (Figure 4C). Thus, although the neutral model captures some features of the observed TCRα sequence abundances, it cannot account for observed TCRβ distributions. A similarly poor match between observed and predicted data is observed for lognormal clonotype model (Figure 3B) distributions (not shown).
In contrast, there is a much better fit between observed and predicted data is obtained when the model clonotype frequencies are modelled by a powerlaw distribution (Figure 4B and D). Like the distributions discussed above, in the parameter range where clonesize heterogeneity is limited (i.e. a steep slope), a powerlaw distribution predicts both the number of TCRα sequences found in 2 and 3 samples, and their larger median $\mathcal{P}(\sigma )$. The number of TCRβ sequences is also predicted well if the slope is close to 2.3 (Figure 4B). Remarkably, for this slope the median $\mathcal{P}(\sigma )$ of TCRβ sequences found in two samples is higher than the median $\mathcal{P}(\sigma )$ of TCRβ sequences found in three samples (Figure 4D). Intuitively, we can understand this observation as reflecting the properties of powerlaw distributions, combined with the lower generation probabilities of TCRβ. Identical TCRβ recombinations occur frequently enough to make a detectable contribution to the TCR sequences observed in two samples, but not to those detected in three samples. Therefore, a significant proportion of TCRβ sequences observed twice are in fact derived from two or more different naive Tcell clones. In contrast, TCR sequences observed three times (or more) must be derived from large naive Tcell clones. Abundant TCRα sequences arise both from large clones and summation of identical TCRα from multiple smaller clones, but due to their higher generation probabilities, the latter dominates the $\mathcal{P}(\sigma )$ for TCRα sequences found twice and three times. Finally, we note that although TCR sequence abundance in the single samples from experiment one is likely to incorporate multiple mRNA from single cells, the powerlaw distribution also predicts abundances in the single samples of experiment one reasonably well (Figure 4—figure supplement 1).
The vast majority of TCR sequences in samples of naive T cells are observed only once, and hence we cannot infer anything about their frequency in the whole repertoire, except that it is likely to be below a given abundance threshold. Therefore we explored whether a more generalised model, which does not make any assumptions about the distribution of the low abundance T cells, would predict our experimental data as well as the powerlaw model. In this simple mixture model we generate a population in which the majority of cells are present only once, and a minority are present many times. We scanned the parameter space of this model, varying both the proportion of cells in each population, and the size of clones in the larger population. The prediction of the model for each parameter pair was compared to the experimental data from experiment 2, both for the number of TCRs (combining α and β sequences) observed in one, two or three subsamples, and for the median $\mathcal{P}(\sigma )$ of these TCR sequences. The best agreement between model and data was observed when 1–5% of the cells were derived from abundant Tcell clones (between 10^{5} and 10^{6} cells in the whole repertoire) (Figure 4E).
Abundant Tcell sequences are enriched for zero insertions and for antigenassociation
In human prenatal thymocytes, the enzyme terminal deoxynucleotidyl transferase (TdT) is not expressed, leading to the production of TCR sequences with zero insertions of Nnucleotides. Pogorelyy and colleagues showed that enrichment of zero insertion TCR sequences can be used to detect fetal clones even in adults, and that their contribution to the overall repertoire decays slowly with age (Pogorelyy et al., 2017). Interestingly, the proportion of zeroinsertion sequences was strongly enhanced in those sequences observed more than once in the three subsamples examined in experiment 2 (Figure 5A). The interpretation of this finding is not straightforward, since zeroinsertion TCR sequences have higher median generation probabilities, and this is also a property of abundant sequences as discussed above. Nevertheless, the data are compatible with a model in which the large clones observed in the repertoire are generated preferentially during early prenatal development of the naive Tcell repertoire.
We next examined if the abundant sequences in our data showed characteristics of semiinvariant NKT and MAIT cell populations. Classical NKT cells are characterized by an invariant TRAV24TRAJ18 α chain and β chains with TRBV11 (Dellabona et al., 1994). MAIT cells are enriched for TCRα rearrangements of TRAV12 with TRAJ33, TRAJ12 and TRAJ20 (Reantragoon et al., 2013), and TCRβ sequences predominantly using TRBV20 and TRBV6 (Lepore et al., 2014). Since our HTS data does not contain information on αβ pairing, we studied both chains separately. A substantial fraction of the observed TCRβ sequences matches the characteristics of MAIT cells, and to a lesser extent NKT cells (Figure 5B and C). For both cell types, however, this fraction does not show a clear relation to incidence, and does not suggest enrichment for MAIT or NKT cells among abundant sequences. The most abundant TCRα sequences are enriched for NKT sequences, but these still account for only a small fraction of the total (0.3% and 1.7% for CD4^{+} and CD8^{+}, respectively, Figure 5B). Hence, we conclude that only a small fraction of the abundant sequences are derived from clones with a MAIT or NKT cell phenotype.
Finally we analysed whether the abundant TCR sequences in the naive population could be detected in a database of TCR sequences with known antigen specificity (Shugay et al., 2018). Interestingly, there was a striking enrichment of TCR sequences with known antigenspecific annotation within the high abundance TCRα sequences observed in more than one subsample from experiment 2, and to a lesser extent for TCRβ sequences (Figure 5C). Interpretation is again not straightforward, because the high generation probabilities of the abundantly observed chains could lead to these sequences being overrepresented in the database (ascertainment bias). Additionally, the observation may also reflect the fact that the naive Tcell populations we sequenced contained some antigenexperienced T cells with a naive phenotype (Pulko et al., 2016). Finally, the observation is also compatible with the hypothesis that TCR recombination has evolved to preferentially generate TCRs specific to common pathogens like CMV or EBV as discussed in Thomas and Crawford (2019).
Discussion
The diversity and clone size distribution of the naive Tcell repertoire has been the subject of considerable debate, fueled by the difficulty of obtaining more than a very small sample of the total repertoire, and by a variety of other technical considerations which we address in this study. We use a quantitative UMIbased sequencing protocol, and careful error correction to analyse the naive and memory repertoires from three healthy human volunteers. We convincingly demonstrate that a small proportion of the TCR sequences are present more than once in a sample of naive T cells from blood, corresponding to expected frequencies greater than 1 in 10^{5}. This number of abundant TCRα sequences is higher than the number of abundant TCRβ sequences.
We carefully considered different mechanisms that could give rise to these abundant TCR sequences. We examined the contribution of potential contamination of the naive population with abundant T cells from the memory compartment during the sorting process, but the extent of such contamination was small (for CD8^{+} cells) or not detectable (for CD4^{+} cells). Furthermore, exclusion of all TCR sequences which occurred in both memory and naive populations did not alter the subsequent conclusions of the analysis. We also considered the possibility that abundant TCR sequences were observed due to sampling multiple mRNA molecules from the same cell. In order to exclude this possibility, we carried out an experiment where we divided up a sample of sorted naive T cells into three subsamples prior to lysis, and sequencing. In this experimental paradigm, TCR sequences found in more than one subsample must arise from different T cells. We observed that repeat sampling of mRNA from the same T cell did indeed occur, and might account for as much as 75% of the high abundance TCRβ sequences (for which there are more mRNA molecules per cell, [Oakes et al., 2017]), and as much as 25% of the TCRα sequences. However, this effect was mostly restricted to TCR sequences observed twice, and made little contribution to TCR sequences observed three or more times.
Having excluded methodological causes of high abundance TCR sequences, we examined two biological mechanisms which could explain the data. The first mechanism we consider is that abundant sequences derive from identical TCRα and TCRβ rearrangements occurring in multiple clones. In this model, abundance arises not from multiple sampling from the same large clone of T cells, but from summation over many different clones of T cells, each of which share an α or β chain. The second mechanism is that the naive repertoire clone size distribution is not uniform, but contains many small and some large clones. We combine computational models with experimental data to provide evidence that both mechanisms are required to explain the observed data. The first mechanism dominates the repertoire of TCRα, and is likely to contribute to the majority of observed abundant sequences. Interestingly, the model suggests that those TCRα which have the highest probability of generation are produced hundreds of thousands, or even millions of times within an individual, and must therefore be produced extremely frequently in the thymus. In contrast, the first mechanism has a smaller impact on the TCRβ repertoire, and abundant TCRβ sequences are more likely to arise from large clones in the naive repertoire.
The experimental limitations of sampling small volume of blood which contains only a tiny proportion of the total repertoire has dramatic effects on the observed TCR frequency distribution. One can use the analytical solution of the neutral model (Section 'Neutral model for dynamics of naive T cells') with thymic introduction size $c=1$ to illustrate this extreme sampling effect: ${\widehat{F}}_{i}\approx {F}_{i}{(\frac{s}{\theta})}^{i}$, where $\widehat{{F}_{i}}$ and ${F}_{i}$ are the number of clones present with $i$ cells in the sample, and in the pool, respectively, and $s$ is the fraction of the repertoire that was sampled (here s ∼ 10^{6}). Since $s/\theta $ is of order 10^{5} and this is raised to the $i$^{th} power, even very large TCR clones become rare in such a sample. Because of this, it is difficult to be definitive about the exact underlying Tcell clone frequency distribution which gives rise to the abundant TCR sequences we observe. The data are certainly compatible with a powerlaw distribution, as has been suggested previously (Desponds et al., 2016). But many distributions made up of a mixture of rare clones and a small proportion (1–5%) of large clones (10^{5}  10^{6}) are compatible with the data we observe.
The demonstration of large clones in the naive repertoire raises the question of what determines the different sizes of different clonotypes. The neutral model already excludes repeated thymic production as explanation for large clones, because the combined probability of repeated αβclone production is very low (Dupic et al., 2019). We confirmed that the abundant TCR sequences were not strongly enriched for sequences characteristic of iNKT and MAIT cells (Figure 5B&C). An alternative explanation is that the large clones may actually be antigenexperienced, but with a naive phenotype such as memory stem T cells (Gattinoni et al., 2011; Lugli et al., 2013a; Lugli et al., 2013b; Fuertes Marraco et al., 2015; Pulko et al., 2016). However, a number of alternative explanations for this enrichment exist, as discussed above. Furthermore, antigenexperienced T cells should be present in the memory populations, and large clones could still be observed even after removal of all cells which occur in both memory and naive populations. So, although we cannot exclude that some of the frequent TCR sequences may be derived from T cells that are not truly naive, we believe the data argue for the existence of truly naive large clones.
We speculate that the most likely mechanism for large clones is preferential growth/survival of some clones, presumably due to preferential selection on selfpeptide/MHC (Rudd et al., 2011; Lythe et al., 2016). Intriguingly, the abundant TCR sequences we observed were enriched for sequences without Ninsertions, a characteristic of TCRs produced prenatally (Pogorelyy et al., 2017). The large clones may therefore be established very early in the development of Tcell adaptive immunity, before homeostasis of the immune system is achieved and when more rapid division and clonal expansion may be favoured.
In conclusion, our study highlights the huge impact of subsampling on correct interpretation of TCR repertoire data. It provides evidence for two different mechanisms which give rise to abundant TCR sequences in the naive human repertoire. The first mechanism, driven by multiple identical recombination events, is frequently overlooked in the analysis of Tcell repertoires, but has important implications in interpretation of observed sharing between different Tcell subpopulations of an individual, and between individuals (public TCR sequences). The second mechanism suggests that the TCR sequence plays a critical role in naive Tcell homeostasis. Further experiments will be required to fully elucidate the cellular and molecular mechanisms which underlie the heterogeneity of the naive Tcell repertoire.
Materials and methods
Cell sorting and sequencing
Request a detailed protocolSequence reads came from T cells extracted from blood samples of three healthy volunteers, between 30 and 40 years old. Using CD27 and CD45RA markers, FACSsorting was performed, identifying naive (CD27^{+}CD45RA^{+}), CM (central memory, CD27^{+}CD45RA^{}), EM (effector memory, CD27^{}CD45RA^{}) and EMRA (effector memory RA, CD27^{}CD45RA^{+}) cells. Barcoded TCRα and TCRβ cDNA libraries were obtained by reverse transcription of RNA molecules coding for either the α or β chain, respectively, followed by single strand DNA ligation to attach unique molecular identifiers (UMIs) of 12 nucleotides. These were PCRamplified and sequenced using the Illumina MiSeq platform. For full description of the sequencing procedure, we refer to Oakes et al. (2017) and Uddin et al., 2019. The raw sequence files are available on the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra, RRID:SCR_004891) as experiment SRP109035.
Sequence analysis
Request a detailed protocolWe used the Decombinator pipeline (Thomas et al., 2013) (Version 3.1, RRID:SCR_006732) to demultiplex, annotate, and errorcorrect the raw sequencing reads. Our reads contain UMIs of 12 base pairs that can be used to identify which TCRα or TCRβ sequences are derived from the same cDNA molecule. Decombinator performs error correction on sequences by collapsing those that are similar and are associated with the same UMI. The pipeline also error corrects UMIs, collapsing those UMIs that are associated with the same TCRα or TCRβ sequence and differ from each other by 2 or fewer sequence edits (i.e. the default barcode threshold). This error correction assumes it is unlikely for any sequence, irrespective of its frequency, to contain two UMIs that are nearly identical, concluding the UMIs are different because of PCR or sequencing errors.
We improved this by setting the barcode threshold to 0 and replacing it by an UMI error correction algorithm that takes the number of UMIs into account. Consider a TCRα or TCRβ sequence supported by $i$ different UMIs, that is with frequency $i$. The Hamming distance, $H$, between two random UMIs of 12 base pairs can be represented by a binomial random variable, $H\sim \mathrm{B}(n,p)$, where $n=12$ and $p=\frac{3}{4}$ (assuming uniform frequencies of the 4 different bases). There are $\left(\begin{array}{l}i\\ 2\end{array}\right)$ distinct comparisons between the $i$ UMIs, and assuming that every comparison is independent, the expected distribution of Hamming distances is ${n}_{i}(h)=\left(\begin{array}{l}i\\ 2\end{array}\right)\mathcal{\wp}(H=h)$. To determine whether two UMIs are unexpectedly similar, we define a threshold distance that depends on the frequency of their TCRα or TCRβ sequence ($i$):
Our algorithm corrects UMIs for a given sequence as follows: From $d=1$ to $d={D}_{\alpha}$, for all UMI pairs with $H\le d$, add the read count of the less frequent UMI to the more frequent UMI and remove the former. We applied this algorithm to every TCRα and TCRβ sequence in our HTS data using α = 0.05. The effects of this correction method are shown in Figure 6. After the improved correction, the distribution of Hamming Distances within and between distinct TCRα and TCRβ sequences is very similar, indicating that most erroneous UMIs have been removed. Our improved correction decreases the estimated frequency of many sequences at low frequencies, which indicates that many TCRα and TCRβ sequences that were observed two or three times, are actually singletons for which the UMI was mutated once or a few times. In the example given in Figure 6, the number of sequences that were observed more than once decreased with 66% by our improved correction (from 11491 to 3855), whereas the default correction estimated 9342 (only 19% reduction) of the sequences to have more than one true UMI.
Because our analysis focuses on the naive Tcell repertoire, we combined the different memory populations by adding the abundance of identical TCR sequences (V and J annotation as well as CDR3 nucleotide sequence) in the corresponding CM, EM and EMRA samples. We included for analysis the sequences that were reported as functional by Decombinator and had nonzero $\mathcal{P}(\sigma )$. We also processed the HTS reads with RTCR (Gerritsen et al., 2016) (Version 0.4.3). This pipeline determines a samplebased error rate and uses this rate to perform clustering on reads. Compared to Decombinator, RTCR estimates our reads to contain more PCR and sequencing errors and therefore tends to be more conservative in terms of reported diversity. Because RTCR reports fewer distinct rearrangements per sample, the overlap between samples (i.e., the number of chains with incidence 2 and 3) is lower than in Decombinator output. For each of the maintext figures, a supplemental RTCRbased version is provided. Although the quantitative results are not identical, the RTCR results qualitatively match those of the Decombinator output, confirming that our results are not algorithmdependent.
Subsampling to exclude inflated abundance through multiple RNA contributions by single cells
Request a detailed protocolAn important step in our analysis is the additional experiment in which the naive cells were split into three parts before mRNA extraction. The probability for a naive cell to be sampled from the pool is very low (< 10^{5}), but once a cell has been sampled it may likely contribute multiple RNA molecules. These would then be sequenced with different UMIs, inflating the abundance we measure in a sample. Hence, we use subsampling to avoid the noise on TCRα and TCRβ abundance introduced by variable TCR expression between cells. To quantify the possible effect of single cells contributing multiple RNA molecules, we performed a permutation test. We computationally joined the sequences observed in the three independently sequenced replicates, adding the abundance (as measured by UMIs) in each of the three subsamples together. We then randomly assigned the UMIs of these sequences to one of three artificial portions and again scored the incidence of all TCRα and TCRβ sequences. In this setting, RNAs contributed by single cells in a single sample, can be distributed over multiple permuted samples. This was done 10 times for each set of sequences and we found that permutation led to a large increase in the number of sequences occurring in multiple samples (Figure 2—figure supplement 1A). We quantified the number of abundant chains, by counting sequences observed in multiple samples (Supplementary file 2).
When multiple RNA molecules from a single cell can contribute a UMI (i.e., in the permuted set and within a single sample), the number of abundant sequences is greatly overestimated. About 25% of abundant α chains in this setting is actually due to inflated counts. For β chains the effect is much larger, with over 75% of abundance due to RNA content. This difference is consistent with our previous finding that T cells contain in the order of 300 TCRB and 100 TCRA RNA molecules per cell (Oakes et al., 2017). Moreover, the lower $\mathcal{P}(\sigma )$ values of β chains readily explains that there are fewer true duplets and triplets than for α chains. Subsampling appears to be very important when obtaining our most surprising result that high $\mathcal{P}(\sigma )$ values are enriched for β chains with incidence 2, but not incidence 3. After permutation, most duplets are due to RNA content (Supplementary file 2) and therefore no longer enriched for high $\mathcal{P}(\sigma )$ (Figure 2—figure supplement 1B). These results highlight the importance of our additional step of taking a single blood sample, dividing it into three portions and then analyzing all three subsamples separately.
Sharing of TCRα and TCRβ sequences
Request a detailed protocolWe sequenced TCRα and TCRβ from whole blood samples taken from 28 healthy volunteers. The study was carried out in accordance with the recommendations of the UK Research Ethics Committee with written informed consent of all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the University College London Hospital Ethics Committee 06/Q0502/92. The raw sequence files are available on the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra, RRID:SCR_004891) as experiments SRP045430 and SRP151125. In order to measure how public the individual sets of sequences were, we measured their degree of sharing between our naive samples and these whole blood repertoires.
As shown in Figure 2, we have three sets of sequences, those with incidence 1, 2 and 3. For each set, we measured which fraction is also found in the 28 independent whole blood samples, which delivers 28 estimates of sharing. More precisely, we counted the number of shared TCRα and TCRβ sequences between the sets of sequences observed in two and three naive subsamples, and compared these to sharing with an equal size sample of naive sequences which were only observed in one subsample. Since the number of sequences which occurred more than once was much smaller than the number of sequences which only occurred once, we subsampled the set of unique sequences 10 times. The results are shown as the number of shared TCRα or TCRβ for each whole blood repertoire, as a proportion of their number of sequences in the samples being tested (Figure 2—figure supplement 3A). In order to study the sharing of the β chains in our data with higher resolution, we also analyzed overlap of the sets of sequences with the TCRβ data from a large cohort of 786 people published in Emerson et al. (2017); Figure 2—figure supplement 3B.
Neutral model for dynamics of naive T cells
Request a detailed protocolTo model naive Tcell dynamics in the absence of peripheral selection, we developed a model that is similar to the Neutral Community Model (NCM) of Hubbell (2001). Naive T cells, viewed through an ecological lens, are individuals, and all naive T cells sharing the same TCRα and TCRβ sequence are part of the same species (αβclone). Neutrality, as defined by Hubbell, means that all species have the same per capita probability of birth (peripheral division) and death. When considering the model, we ignore the very small chance that an existing αβclone is produced again by the thymus. Hence, in our simulations we assume that the thymus produces Tcell clones that are unique and novel.
Consider a pool of $N$ naive T cells belonging to clones, each consisting of $i$ cells, which changes by thymic production, cell division and cells leaving the naive pool (as a result of cell death or activation). During each event, one randomly selected cell exits the pool, causing the corresponding clone to decrease in size from $i$ to $i1$ cells. With probability $1\theta $, another randomly selected cell will divide, causing the corresponding clone to increase its size from $i$ to $i+1$ cells. Alternatively, with probability $\theta $, thymic production can occur: every $c$ events in which no peripheral division occurred, the thymus will release $c$ cells of a newly produced clone. So, the pool size $N$ only fluctuates by $c$ cells, and because $N\gg c$, the total number of cells stays almost constant during the entire simulation. The per capita birth rate ($(1\theta )/N$) and death rate ($1/N$) are equal for all Tcell clones, which makes this a neutral model. In this discretetime model, exit and production are coupled, but its dynamics can be approximated by a continuoustime model, in which thymic production, cell division, and deaths are uncoupled Poisson processes. This is illustrated by the Markov chain depicted in Figure 7, in which the states are clone sizes and the rates show the probabilities of clones moving to another state.
This Markov process describes the dynamics of the clonesize distribution $F$, that is the total number of clones ${F}_{i}$ consisting of $i$ cells. After many birth and death events, individual clones still change in clone size over time, but the clonesize distribution approaches equilibrium. At this steady state, the rate at which new clones enter the naive pool, $\theta /c$, equals the rate at which clones leave the pool, that is ${F}_{1}(1/N)$. Hence, in equilibrium, the number of singletons, clones with only one cell, approaches ${F}_{1}=\theta N/c$. The total rate at which the cells of clones with $i$ cells divide and die depends on the total number of cells belonging to ${F}_{i}$ clones: $i{F}_{i}$. For clone sizes up to $c$ cells, the rate at which the cells of the ${F}_{i}$ clones die, ($i{F}_{i}/N$), balances the division the cells of ${F}_{i1}$ clones ($(i1){F}_{i1}(1\theta )/N$) and the rate at which new clones enter the pool ($\theta /c$). The analytical solution to this recurrence relation $i{F}_{i}/N=(i1){F}_{i1}(1\theta )/N+\theta /c$ is:
For states with $i>c$, only birth and death of cells need to balance between states $i1$ and $i$ (as there is no net flux from clones introduced by the thymus): $i{F}_{i}/N=(i1){F}_{i1}(1\theta )/N$. This recurrence relation has the following analytical solution:
When predicting the full clonesize distribution, we use Equations 2 and 3 to calculate the steadystate distribution. The total number of all distinct clones (i.e. the richness) in the steadystate repertoire is simply the sum over all their frequencies ${F}_{i}$, $R={\sum}_{i=1}^{\mathrm{\infty}}{F}_{i}$, which has a simple closedform solution for $c=1$,
The Simpson’s diversity of the steady state repertoire also has a simple form,
which equals ${F}_{1}=\theta N$ for $c=1$, and is a saturated function of $\theta $ if $c>1$.
We consider the sampling process of a small fraction $s$ from a naive Tcell pool of $N$ cells, which clones follow the distribution $F$ in Equation 2 and Equation 3. Assuming the naive pool is large and wellmixed, the number of T cells, $X$, sampled from the $j$ cells belonging to a particular clone, can be approximately represented by a binomial random variable, ${X}_{j}\sim \mathrm{B}(n=j,p=s)$. The expected clonesize distribution of the sample, $\widehat{F}$, is then given by
The strong distortion of sampling from clonesize distributions can be illustrated using the analytical solution of Equation 6 for the neutral model for $c=1$:
Since $s$ is typically very small, this equation can be simplified to ${\widehat{F}}_{i}\approx {F}_{i}{(\frac{s}{\theta})}^{i}$ (as $s\ll \theta $), which clearly shows that even very abundant clones will become rare or absent in a small sample.
Clonesize distributions of the naive Tcell pools
Request a detailed protocolSince our data contains separate data on both CD4^{+} and CD8^{+} T cells, we predicted the clonesize distributions of both subsets separately. To account for the larger CD4^{+} pool (Wertheimer et al., 2014; Westera et al., 2015), we set its pool size $N=7.5\times {10}^{10}$ cells, while we used $N=2.5\times {10}^{10}$ for the naive CD8^{+} pool.
When analyzing the neutral model, we used its steadystate distribution (Equation 2 and Equation 3). Since the β chain rearranges first, followed by a few divisions before rearrangement of the α chain (Gonçalves et al., 2017), we use $c=100$ for TCRβ and $c=10$ for TCRα. We also used various phenomenological clonesize distributions that are not based on a mechanistic model. To allow for exploration of a wide range of distributions, we chose mathematical functions which form can be changed by a single parameter, such as the slope of the powerlaw distribution.
The powerlaw distribution with form ${F}_{i}={F}_{1}\times {i}^{k}$ shows a straight line on a loglog plot. Since all ${F}_{i}$ are written as a function of ${F}_{1}$, the total number of cells $N={F}_{1}(1+2\times {2}^{k}+3\times {3}^{k}+\mathrm{\dots})={F}_{1}{\sum}_{i=1}^{\mathrm{\infty}}{i}^{1k}$. This sum is convergent for $k>2$ and gives
for the powerlaw clonesize distribution, in which $\zeta $ is the Riemann zeta function.
We also studied repertoires with lognormal distributions of clonesizes by drawing from a normal distribution and raising 10 to the power of these numbers for clone sizes. For this we used varying µ and $\sigma =\mu /10$. These distributions yielded results that were qualitatively similar to those from the neutral model (not shown). For the simple mixture model (Figure 4E), we defined two populations of clones: (1) singletons (clones of just one cell that can only contribute to high TCRα or TCRβ abundances by sharing a chain with many other clones) and (2) large clones of equal size. We varied the fractions of both populations as well as the size of the large clones to find which fraction of the cells in the naive repertoire is expected to belong to large clones. A similar analysis, combining the aforementioned distribution following from the neutral model with a lognormal distribution for the population of large clones, produced very similar results (not shown).
In silico samples from modelled clonesize distributions
Request a detailed protocolTo compare the clonesize distributions with the HTS data of the blood samples, we generated TCRα and TCRβ repertoires using IGoR (Marcou et al., 2018). We generated 10^{8} TCRα and TCRβ sequences using IGoR’s default recombination model and parameters. We selected the rearrangements which CDR3 nucleotide sequence consisted of a multiple of 3 nucleotides (in frame) and did not contain inframe stop codons, in line with the inclusion criteria of productive rearrangements in our HTS samples (∼28%). Next, we calculated generation probabilities $\mathcal{P}(\sigma )$ for all these rearrangements. This may seem a detour, but this is needed as many different scenarios can lead to the same TCRα or TCRβ rearrangement.
Only a small percentage of thymocytes that undergo rearrangements in the thymus will eventually be exported as a naive T cell. This is due to outofframe rearrangements, but also as a result of both positive and negative selection. Moreover, the generation probability distributions of pre and postselection TCRα and TCRβ repertoires are markedly different (Elhanati et al., 2014). To account for these observations, we train a $\mathcal{P}(\sigma )$dependent selection model to account for the effects of thymic selection on our IGoRproduced TCRα and TCRβ sequences. Note that this selection method is based on single chains rather than on αβTCRs. This is because recombination of β and α chains occurs at different points in Tcell differentiation. The first step in selection, after formation of the β chain, is based on correct folding and expression, using a preα pseudochain for pairing. If the T cell survives this step, it undergoes multiple rounds of divisions, by which its β chain can pair with many different α chains. The second step is positive and negative selection based on MHCpeptide interactions, which is likely to operate on a joint αβ pair. It is unknown how much each of these two steps contributes to the overall selection process.
For TCRβ selection, we reason that selection on pairing with the invariant preα chain acts exclusively on the level of single β chains, and once the T cell survives this first step, it is expected to survive with at least one of the many α chains it can pair with during the second step. The absence of strong structural constraints on αβ pairing supports this idea (Tanno et al., 2020). Additionally, the large $\mathcal{P}(\sigma )$ shift between pre and postselection TCRβ repertoires is indicative of selection acting on the level of single β chains (i.e., the probability for a β chain to be selected is largely irrespective of the α chain). For α chains this shift is less pronounced, and a newly generated α chain only pairs with a single β chain. We therefore also tested the effect of an alternative selection model in which a given α chain survives selection with a given probability for repeated production events, reflecting different selection outcomes when pairing with different β chains. This approach decreases the average frequency of α chains in the postselection repertoire (since in this case they will on average survive only in a fraction of selection events, instead of our default allornothing model). This did not affect our results in a qualitative manner and we proceeded with selection on the level of single chains for both TCRα and TCRβ.
We use each of the HTS data sets from the single sample experiment (shown in Figure 1) to calculate the relative enrichment or depletion of 100 log10 $\mathcal{P}(\sigma )$ bins (ranging from −50 to 0) compared to 100 equally sized samples of the IGoR output, for TCRα and TCRβ separately. If the HTS data contained few rearrangements for a given bin, we joined adjacent bins in such a way that the binspecific selection factor was always based on at least 1% of the experimental observations (Figure 8). This approach yielded $\mathcal{P}(\sigma )$specific selection factors ${f}_{\mathcal{P}(\sigma )}$ ranging from 0.6 to 1.15 (i.e., our data suggests that sequences with a preferable $\mathcal{P}(\sigma )$ are about 2 times as likely to be selected as those in the least preferable $\mathcal{P}(\sigma )$ domain). We assumed an overall selection factor of 1/3, meaning that one out of 3 productive TCRα or TCRβ rearrangements would survive selection. We then allowed sequences to be part of the postselection repertoire with probability
and stored the outcome to make a consistent decision when multiple copies of the same TCRα or TCRβ sequence were present in the preselection repertoire. This approach yielded postselection repertoires with $\mathcal{P}(\sigma )$ distributions similar to the single sample HTS data. Other values for the overall selection probability, ranging from 1/10 to 1, were also tested, but yielded similar qualitative results (not shown).
We could have assigned all clones in the clonesize distribution an α and β chain with this approach. However, since only a very small part of the repertoire is sampled, we chose to only assign an identity to those clones present in the samples. Hence, we started with predicting the presence of all clones, as a function of their size, in each of the samples. The probability that a clone with $i$ cells is represented by at least one cell in a sample of $n$ cells from a pool of $N$ cells is
Given ${F}_{i}$, which is the number of clones in the pool with clone size $i$, the number of these clones present in the sample of $n$ cells can be approximately represented by a binomial random variable, ${X}_{i}\sim B(n={F}_{i},p={p}_{i})$. We evaluate this for the entire clonesize distribution $F$. $N$ and $F$ are known from the model but one cannot directly determine the number of sampled cells $n$. This is because individual cells may contribute multiple mRNA molecules and many cells may have been present in the FACSsorted sample without contributing mRNA to the eventual sequenced fraction. Therefore we learn the sample size by assigning α or β to sampled clones and choosing $n$ such that the predicted diversity (i.e., number of distinct chains) matches the experimental observations. We took the number of distinct TCRα or TCRβ sequences as lower bound for the sample size, since in this model individual cells are assumed to express one functional α or β chain. The total number of cells reported by the FACSsorter was used as upper bound. We also checked the implications of the observation that some T cells contain two functional α and/or β chains, but this did not qualitatively change our results (not shown).
Thus, we adjusted the generation probability distribution by training a $\mathcal{P}(\sigma )$dependent selection model on independent HTS data and based the sample size on the corresponding subsamples. Hence, the predicted individual subsamples reflect the experimental observations in terms of diversity and generation probabilities. We use the chains occurring in multiple samples (i.e., those with incidence 2 and 3) to assess the agreement between model predictions and the HTS data. We repeated the sampling process and assignment of α and β chains 10 times for each modelparameter combination to account for the stochastic nature of sampling and V(D)J recombination.
Data availability
Sequencing data have been deposited in the Short Read Archive as experiments SRP109035, SRP045430 and SRP151125.

SRP151125ID NCBI Sequence Read Archive. T cell receptor repertoire sequencing of healthy individuals.

NCBI Sequence Read ArchiveID SRP109035. TCR repertoire sequencing of T cell subsets from healthy individuals.

NCBI Sequence Read ArchiveID SRP045430. Tcell receptor repertoires in HIVinfected patients and healthy controls.

immuneACCESSImmunosequencing identifies signatures of cytomegalovirus exposure history and HLAmediated effects on the Tcell repertoire.https://doi.org/10.21417/B7001Z
References

T cell dynamics in HIV1 infectionAdvances in Immunology 73:256–304.https://doi.org/10.1016/s00652776(08)607890

T cell repertoires and competitive exclusionJournal of Theoretical Biology 169:375–390.https://doi.org/10.1006/jtbi.1994.1160

An invariant V alpha 24J alpha Q/V beta 11 T cell receptor is expressed in all individuals by clonally expanded CD48 T cellsThe Journal of Experimental Medicine 180:1171–1176.https://doi.org/10.1084/jem.180.3.1171

Modelling naive Tcell homeostasis: consequences of heritable cellular lifespan during ageingImmunology and Cell Biology 87:445–456.https://doi.org/10.1038/icb.2009.11

Genesis of the αβ Tcell receptorPLOS Computational Biology 15:e1006874.https://doi.org/10.1371/journal.pcbi.1006874

Predicting the spectrum of TCR repertoire sharing with a datadriven model of recombinationImmunological Reviews 284:167–179.https://doi.org/10.1111/imr.12665

Longlasting stem celllike memory CD8+ T cells with a naïvelike profile upon yellow fever vaccinationScience Translational Medicine 7:282ra48.https://doi.org/10.1126/scitranslmed.aaa3700

A human memory T cell subset with stem celllike propertiesNature Medicine 17:1290–1297.https://doi.org/10.1038/nm.2446

BookThe Unified Neutral Theory of Biodiversity and BiogeographyPrinceton University Press.https://doi.org/10.4249/scholarpedia.8822

Superior T memory stem cell persistence supports longlived T cell memoryJournal of Clinical Investigation 8:33.https://doi.org/10.1172/JCI66327

How many TCR clonotypes does a body maintain?Journal of Theoretical Biology 389:214–224.https://doi.org/10.1016/j.jtbi.2015.10.016

Highthroughput immune repertoire analysis with IGoRNature Communications 9:561.https://doi.org/10.1038/s4146701802832w

How many thymocytes audition for selection?The Journal of Experimental Medicine 186:1149–1158.https://doi.org/10.1084/jem.186.7.1149

BookQuantifying lymphocyte receptor diversityIn: Das J, Jayaprakash C, editors. Systems Immunology. CRC Press. pp. 183–198.

The many important facets of Tcell repertoire diversityNature Reviews Immunology 4:123–132.https://doi.org/10.1038/nri1292

Persisting fetal clonotypes influence the structure and overlap of adult human T cell receptor repertoiresPLOS Computational Biology 13:e1005572.https://doi.org/10.1371/journal.pcbi.1005572

Antigenloaded MR1 tetramers define T cell receptor heterogeneity in mucosalassociated invariant T cellsThe Journal of Experimental Medicine 210:2305–2320.https://doi.org/10.1084/jem.20130958

Overlap and effective size of the human CD8+ T cell receptor repertoireScience Translational Medicine 2:47ra64.https://doi.org/10.1126/scitranslmed.3001442

Why must T cells be crossreactive?Nature Reviews Immunology 12:669–677.https://doi.org/10.1038/nri3279

VDJdb: a curated database of Tcell receptor sequences with known antigen specificityNucleic Acids Research 46:D419–D427.https://doi.org/10.1093/nar/gkx760

The involution of the ageing human thymic epithelium is independent of puberty. A morphometric studyScandinavian Journal of Immunology 22:563–575.https://doi.org/10.1111/j.13653083.1985.tb01916.x

Stochastic niche structure and diversity maintenance in the T cell repertoireJournal of Theoretical Biology 255:237–249.https://doi.org/10.1016/j.jtbi.2008.07.017

Stochastic competitive exclusion in the maintenance of the naïve T cell repertoireJournal of Theoretical Biology 265:396–410.https://doi.org/10.1016/j.jtbi.2010.05.004

Naive T cell homeostasis: from awareness of space to a sense of placeNature Reviews Immunology 9:823–832.https://doi.org/10.1038/nri2657

Selected before selection: a case for inherent antigen Bias in the Tcell receptor repertoireCurrent Opinion in Systems Biology 18:36–43.https://doi.org/10.1016/j.coisb.2019.10.007

A mechanism for TCR sharing between T cell subsets and individuals revealed by pyrosequencingThe Journal of Immunology 186:4285–4294.https://doi.org/10.4049/jimmunol.1003898

Aging and Cytomegalovirus infection differentially and jointly affect distinct circulating T cell subsets in humansThe Journal of Immunology 192:2143–2155.https://doi.org/10.4049/jimmunol.1301721
Decision letter

Aleksandra M WalczakReviewing Editor; École Normale Supérieure, France

Satyajit RathSenior Editor; Indian Institute of Science Education and Research (IISER), India

Mikhail V PogorelyyReviewer; ShemyakinOvchinnikov Institute of Bioorganic Chemistry, Russian Federation
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
The paper addresses experimentally and theoretically the question of clone size distributions in Tcell repertoires. While we have known for a while that these distributions have long tails, experimental data has often not been extremely careful in correcting for biases and a careful measurement of different repertoire subsets is useful. Combining this with a model that tries to explain the origin of this distribution is an important element of the paper.
Decision letter after peer review:
Thank you for submitting your article "The naive Tcell receptor repertoire has an extremely broad distribution of clone sizes" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Satyajit Rath as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Mikhail V Pogorelyy (Reviewer #1).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
All reviewers agree that the topic of the paper is interesting and important, however the consensus is that in its current form the paper will not appeal to a broad readership, since it is not very conclusive. At the same time the expert reader would be happier seeing all the biases clearly quantified and error bars clearly presented and discussed. As a result, the reviewers are calling for a really major revision (they provide concrete suggestions) and only if they are convinced the message of the paper is clear (e.g. predict observed bulk naive repertoire clone size distributions from a model, discuss differences subsets) will they recommend publication.
In general, the reviewers (the full reviews are attached for clarity on the details) make comments on presentation (citing previous work, putting this work in context for the general reader), being very quantitative and clear about biases and the experimental approach (see comments by reviewer 1) and (comments by reviewer 1 and 3) about the lack of conclusive modeling. Currently the reviewers feel the conclusion of the paper is "the naive Tcell receptor repertoire has an extremely broad distribution of clone sizes" but "we do not have a model that explains it". The reviewers understand that coming up with a completely conclusive model and ruling out all others is probably impossible. However, a more thorough theoretical discussion is needed, considering other models and arriving at one that fits the data (even if it may not be the only one that fits the data).
Finally, the reviewers make concrete suggestions to focus the manuscript on the most interesting aspects of this study worth developing further: a more detailed quantitative analysis of the CD4/CD8 and TCRα/TCRβ repertoires, the reported differences between TCRα and TCRβ, and the differing role of generation probability.
Reviewer #1:
The manuscript by de Greef et al. presents a computational framework to infer the clone size distribution of naive T cells. The main conclusion from this work is that there are large clones in the naive repertoire, the emergence of which could not be explained by the neutral model.
I think both methodology and data are interesting, but additional analysis is needed to address possible concerns outlined below.
1) The first dataset authors use consists of TCRA and TCRB repertoires of FACSsorted naive and memory cells. As the authors note, FACS is not precise, and a result, naive subpopulation could be contaminated with abundant memory clonotypes. It would be useful to quantify the extent of this contamination. One way to visualize this is to make scatterplots of concentrations of each clonotype (both overlapping and nonoverlapping) in naive vs. memory repertoire. As a control, such plots could also be done for naive and memory subsets of different individuals: there should be little correlation between clone frequencies and small overlap, since there should be no contamination on FACS in this case.
2) The end of the subsection “Abundant TCR sequences are frequently shared between naive and memory populations, and are enriched for high VDJ recombination probabilities”:
"if the overlap were the result of contamination only, the P(σ) of the [overlapping] sequences would be expected to reflect those of the memory subsets. Since the overlap is markedly enriched for high generation probabilities, most of it cannot be caused by contamination"
However, green dashed line (which is P(σ) of overlapping sequences) on Figure 1A for β chains (especially for CD8 in volunteer 2) seems to be very close to red lines (probs for memory clonotypes), suggesting that most overlap observed in β repertoires is caused by the contamination between two cell samples on FACS. I suggest modifying the text to avoid contradiction. Also, I think it would be useful to add theoretical prediction for the distribution of generative probabilities of overlapping sequences between mem and naive. If both memory and naive seqs have the same distribution of probs P(σ), then for overlapping(=recombined twice) sequences it should be P(σ) squared, if there is no contamination on FACS. Another way to obtain a prediction for the probability of overlapping sequences in the absence of contamination during FACS is to plot P(σ) for naive sequences of volunteer 1 overlapping with memory of volunteer 2 (and vice versa).
3) In the Discussion section, authors suggest that abundant clones in naive compartment could correspond to naivelike antigenexperienced subpopulation. It might be interesting to look for TCR amino acid sequences of these clones in existing databases of antigenspecific TCRs, such as VDJdb and McPAS.
4) Authors made additional experiment with splitting naive cells into three parts before the mRNA extraction to avoid the potential noise introduced by variance in TCR expression by different cells. Is it possible to quantify, how much bias is removed by using this design? E.g., what happens if we computationally join these three independently sequenced replicates back together, and then randomly assign each of UMIs to one of 3 portions? Are there many clones with elevated TCR transcription levels and thus inflated counts in bulk repertoires? Is this variance in TCR expression different for α and β chains? I think answers for these questions would be interesting for many groups sequencing TCR repertoires with RNAbased technology.
5) Authors fit parameters for different clone size distributions using this additional experiment with splitting naive repertoire into three parts. It would be interesting to compare resulting best fit prediction for clone size distributions to clone size distributions observed in bulk naive repertories of two volunteers, which are analyzed in the first part of the paper (e.g. rankfrequency scatterplots for model vs. data).
6) Another potential explanation for large naive clonotypes may be in the early repertoire development: it is known that TdT is not working, and thus first Tcells lack Ninsertions. Previously our group have shown (see Figure 3 in Pogorelyy et al., 2017), that most abundant TCRβ from naive and cord blood (but not memory) repertoires are enriched with zero insertion clonotypes. It would be interesting to see, if abundant naive clonotypes observed in this study are also enriched with zeroinsertion clonotypes (e.g. Figure 1B analog with some estimate of total number of Ninsertions in each bin on yaxis).
Reviewer #2:
In this manuscript de Greef et al. study the clone size distribution in human naive Tcell receptor repertoires. They do it by a combination of data analysis of TCRA and TCRB sequences and computational modelling. After using FACS to sort the cells they utilized an UMI approach to sequence α and β chains of TCRs from naive and memory compartments of CD4 and CD8 cells. They first make a series of very interesting observations about the differences and commonalities between the generation probabilities (calculated by IGoR) in each cell type (CD4/CD8; naive/memory).
To explain these observations they adapt an ecological neutral model to T cell repertoires. By using this model they come to a surprising conclusion that the clone size distribution in naive T cells is not consistent with a neutral model and nor with a power law distribution. Instead they find that a subset of very large clones in the naive repertoire is essential to explain the data. They discuss an experimental potential artifact that could cause these observations and reason that this is likely not the case. Hence, their final conclusion and strong result is that naive T cell repertoires include a subpopulation of very large clones. This conclusion opens a door to an undiscovered biology that determines the construction and dynamics of naive T cell repertoires.
I believe that this manuscript is of high interest to the whole immunology field.
Reviewer #3:
This work investigates potential determinants of the distributions of naïve CD4 and CD8 T cell clone sizes using computational analysis and mathematical modelling of highthroughput TCRα and TCRβ sequence data.
There are several major concerns about this manuscript:
1) Although this work improves on previous studies by considering both TCRα and TCRβ sequences in naive CD4 and CD8 T cell populations and using potentially more accurate quantification of clone sizes (using UMIs), the main conclusion of the manuscript, that the naïve TCR repertoire has a broad distribution of clone sizes, is not substantially novel. Heterogeneous clone sizes in naïve T cell repertoires have been reported, and investigated, in many previous studies (e.g. Robins et al., 2009, Quigley et al., 2010, Venturi et al., 2011, Qi et al., 2014, Pogorelyy et al., 2017). Previous studies have also reported substantial overlap of TCR sequences between naïve and memory CD8^{+} T cell compartments that is enriched for high abundance TCR sequences that have limited numbers of nadditions (e.g. Robins et al., 2010, Venturi et al., 2011). Furthermore, many studies have previously investigated the associations between V(D)J recombination / TCR production probability, naïve TCR clone size and interindividual sharing of TCRs (e.g. Robins et al., 2009, Robins et al., 2010, Quigley et al., 2010, Venturi et al., 2011, Li PNAS 2014, Pogorelyy et al., 2017). Importantly, this manuscript has not acknowledged this substantial body of previously published and highly relevant research.
2) While the various models of the naïve T cell clone sizes may be novel, they did not provide sufficient new insights into the primary mechanisms driving the naïve T cell distribution. This was largely due to the fact that no one model considered by the authors could fully explain the observed TCR clone size distributions. One possible reason for this is the growing evidence for developmental and agelinked heterogeneity in naïve T cell populations (e.g. Hogan et al., 2015, Rane et al., 2018, Reynaldi et al., 2019). Although the authors show model results for a range of parameters, this analysis does not account for the potential impact on the adult T cell repertoire of changes in naïve T cell dynamics over the lifespan of an individual. For example, it has been recently suggested that high abundance zero naddition TCRs in the adult naïve repertoire have survived since early development and their high abundance is due to different homeostatic pressures in the peripheral repertoire during early development (Pogorelyy et al., 2017). This previous relevant research has not been considered in this manuscript.
3) A potentially interesting conclusion in this study is the limited association between TCRβ abundance, TCRβ production probability, and TCRβ sharing. This result is not consistent with the findings of many previous studies focused on TCRβ repertoires (listed above for concern #1). However, this discrepancy with previous studies was not discussed in the manuscript. Moreover, the authors have not undertaken further investigation to determine the robustness of this result to various parameters/assumptions in the computational analysis or potential explanations for this discrepancy.
https://doi.org/10.7554/eLife.49900.sa1Author response
The paper addresses experimentally and theoretically the question of clone size distributions in Tcell repertoires. While we have known for a while that these distributions have long tails, experimental data has often not been extremely careful in correcting for biases and a careful measurement of different repertoire subsets is useful. Combining this with a model that tries to explain the origin of this distribution is an important element of the paper.
All reviewers agree that the topic of the paper is interesting and important, however the consensus is that in its current form the paper will not appeal to a broad readership, since it is not very conclusive. At the same time the expert reader would be happier seeing all the biases clearly quantified and error bars clearly presented and discussed. As a result, the reviewers are calling for a really major revision (they provide concrete suggestions) and only if they are convinced the message of the paper is clear (e.g. predict observed bulk naive repertoire clone size distributions from a model, discuss differences subsets) will they recommend publication.
In general, the reviewers (the full reviews are attached for clarity on the details) make comments on presentation (citing previous work, putting this work in context for the general reader), being very quantitative and clear about biases and the experimental approach (see comments by Reviewer 1) and (comments by reviewer 1 and 3) about the lack of conclusive modeling. Currently the reviewers feel the conclusion of the paper is "the naive Tcell receptor repertoire has an extremely broad distribution of clone sizes" but "we do not have a model that explains it". The reviewers understand that coming up with a completely conclusive model and ruling out all others is probably impossible. However, a more thorough theoretical discussion is needed, considering other models and arriving at one that fits the data (even if it may not be the only one that fits the data).
Finally, the reviewers make concrete suggestions to focus the manuscript on the most interesting aspects of this study worth developing further: a more detailed quantitative analysis of the CD4/CD8 and TCRα/TCRβ repertoires, the reported differences between TCRα and TCRβ, and the differing role of generation probability.
We carefully read your input and value your effort to make excellent suggestions for additional analyses and interpretation, and for providing a more complete context of previous studies.
We have carried out a major revision of the text, and have added a number of additional analyses to address the comments of the editors and reviewers. Our pointbypoint response to the reviewers’ comments is detailed below. As you point out, while analysis of large clones has been performed previously, and various hypotheses for their origin have been advanced (we have added a much more complete review of this literature to the Introduction), careful comparison of different possible models has often been lacking, and analysis of the experimental data has sometimes been insufficiently rigorous in considering potential technical confounders.
We believe that our study has reexamined this question, and both by thorough analysis of new sets of experimental data and by comparison to mathematical models, has identified several different mechanisms which may give rise to observed abundant TCR chains in the naïve pool. Critically, we have been able to identify evidence for several different underlying mechanisms which may contribute to the observed data on TCR abundance, and distinguish carefully and quantitatively between experimental bias (e.g. FACS sorter contamination and repeated sampling of RNA from the same cell), and bona fide examples of multiple copies of the same TCR chain from different cells. We have also been able to examine the relative contribution of different generation probabilities, and large clones to the observation of abundant TCR sequences. We clearly show that the former play a major role for abundant α sequences, and the latter plays the major role for the smaller number of abundant β sequences. We show that our data are compatible with an underlying power law distribution of clone sizes, or a more general mixture model of rare and abundant clones. Finally, we provide evidence that abundant TCRs may arise both from early expansion of T cells during the initial formation of the repertoire (indicated by enrichment for TCR sequences without insertions), and by antigen experienced T cells with naïve phenotype (indicated by enrichment in TCR chains with known antigenspecific annotations).
Taken together, we think that this represents the most comprehensive analysis of observed abundant TCR chains in the naïve repertoire of humans, and as such will be of broad interest to many readers of this journal.
Reviewer #1:
The manuscript by de Greef et al. presents a computational framework to infer the clone size distribution of naive T cells. The main conclusion from this work is that there are large clones in the naive repertoire, the emergence of which could not be explained by the neutral model.
I think both methodology and data are interesting, but additional analysis is needed to address possible concerns outlined below.
1) The first dataset authors use consists of TCRA and TCRB repertoires of FACSsorted naive and memory cells. As the authors note, FACS is not precise, and a result, naive subpopulation could be contaminated with abundant memory clonotypes. It would be useful to quantify the extent of this contamination. One way to visualize this is to make scatterplots of concentrations of each clonotype (both overlapping and nonoverlapping) in naive vs. memory repertoire. As a control, such plots could also be done for naive and memory subsets of different individuals: there should be little correlation between clone frequencies and small overlap, since there should be no contamination on FACS in this case.
We have plotted the count in memory versus naïve as the reviewer suggested. The CD4 and CD8 data show clear differences, which reflect the different TCR abundance distributions of the two populations. The CD4 memory contain very few TCR sequences present more than 1000 times and show little evidence of contamination. The CD8 memory contain quite a large number of TCR sequences present more than 1000 times; for these larger clones, there is clear evidence of contamination, as shown by a linear dependence between the count in naïve versus the count in memory. Based on this data, we estimate a rate of contamination of 0.11.5% from sorting for CD8 cells, and <0.05% for CD4 cells. In order to mitigate against a possible confounding due to this sorting, we have repeated all the analysis in the paper, using “cleaned” data in which all TCR sequences found in memory and naïve are removed. This dual analysis is now incorporated in several figures. The additional data processing step did not alter any of the major conclusions of the study. The plots discussed above, and a detailed consideration of potential sort contamination have been added to the first section of the Results (Figure 1A and B).
2) The end of subsection “Abundant TCR sequences are frequently shared between naive and memory populations, and are enriched for high VDJ recombination probabilities”:
"if the overlap were the result of contamination only, the P(σ) of the [overlapping] sequences would be expected to reflect those of the memory subsets. Since the overlap is markedly enriched for high generation probabilities, most of it cannot be caused by contamination"
However, green dashed line (which is P(σ) of overlapping sequences) on Figure 1A for β chains (especially for CD8 in volunteer 2) seems to be very close to red lines (probs for memory clonotypes), suggesting that most overlap observed in β repertoires is caused by the contamination between two cell samples on FACS. I suggest modifying the text to avoid contradiction. Also, I think it would be useful to add theoretical prediction for the distribution of generative probabilities of overlapping sequences between mem and naive. If both memory and naive seqs have the same distribution of probs P(σ), then for overlapping(=recombined twice) sequences it should be P(σ) squared, if there is no contamination on FACS. Another way to obtain a prediction for the probability of overlapping sequences in the absence of contamination during FACS is to plot P(σ) for naive sequences of volunteer 1 overlapping with memory of volunteer 2 (and vice versa).
We agree with the reviewer that the original text was confusing, and have rewritten it to clarify these points. The difference between α and β chains in this respect is one of the key messages of the paper, and needs to be crystal clear. We think the confusion has been caused by not clearly distinguishing between the generative probability of a TCR sequence, and the probability of observing a TCR sequence once, twice or more in a sample. As the reviewer correctly points out, the P(σ) distribution of the overlapping β chains is the same as that of the memory population. This does not, however imply they are derived from each other, but only that the overlapping β chains do not have unusually high P(σ), and thus this factor is unlikely to contribute to overlap. In the case of α chains, the overlapping sequences do have a larger P(σ), and so this factor needs to be considered in deciding on the mechanism of the overlap.
In terms of the theoretical distribution the reviewer discusses, the same distinction applies. The probability that a TCR sequence is actually found in two samples (i.e. overlap) depends on its frequency in the underlying population, which is dependent in turn on P(σ) among other factors. As an example, if a TCR sequence has a P(σ) of 10^6 then there will be on average 1 such TCR sequence in every 10^6 cells (ignoring selection, etc.); the probability of seeing it twice in such a sample is given by the binomial probability, with a chance of success p = P(σ). As an illustration, the probability of seeing a TCR sequence with P(σ) 10^6 in a sample of 10^6 cells at least twice is 0.08; the corresponding probability of seeing a TCR sequence with P(σ) 10^10 is 10^13 (essentially zero). Thus TCR α chains could occasionally theoretically turn up in memory and naïve repertoires independently, simply because of their high P(σ); this is much less likely for β chains.
We have substantially rewritten the manuscript to address this fundamental point, and have included the additional plot that the reviewer suggests of naïve from one individual overlapping with memory of another (Figure 1C).
3) In the Discussion section, authors suggest that abundant clones in naive compartment could correspond to naivelike antigenexperienced subpopulation. It might be interesting to look for TCR amino acid sequences of these clones in existing databases of antigenspecific TCRs, such as VDJdb and McPAS.
We thank the reviewer for this interesting suggestion, and matched the data from the subsampling experiment with the VDJdb database (see Figure 5D). Remarkably, and as implied by the reviewer, we do in fact find a considerable enrichment of annotated TCR sequences in the overlapping population. This is true both in the total repertoires, and in the repertoires “cleaned” of all sequences which are also found in memory. As such an enrichment was to be expected, this only provides some additional evidence that the abundant clones may be enriched for naïvelike antigenexperienced subpopulation, although we discuss alternative interpretations of this result. These results have been added to the final section of the Results section.
4) Authors made additional experiment with splitting naive cells into three parts before the mRNA extraction to avoid the potential noise introduced by variance in TCR expression by different cells. Is it possible to quantify, how much bias is removed by using this design? E.g., what happens if we computationally join these three independently sequenced replicates back together, and then randomly assign each of UMIs to one of 3 portions? Are there many clones with elevated TCR transcription levels and thus inflated counts in bulk repertoires? Is this variance in TCR expression different for α and β chains? I think answers for these questions would be interesting for many groups sequencing TCR repertoires with RNAbased technology.
This is an excellent suggestion, and we performed the requested permutation 10 times. We find that the true number of sequences found in multiple samples is much lower than after redistributing the samples (Figure 2—figure supplement 1). We quantify the difference in estimated number of abundant chains in Author response table 1. Interestingly, we observe that the differences occur for both TCRα and TCRβ data, but the effect is much larger for β chains. We estimate that over 75% of duplets within a sample are likely to be due to sampling multiple TCRβ RNA molecules from one cell, and about 25% for TCRα.
This new finding is consistent with our previous publication where we document that T cells on average contain in the order of 300 β RNA molecules, and 100 α RNA molecules per cell (Oakes et al., 2017). Moreover, since the efficiency of our ligation protocol is in the order of 15%, repeat sampling of RNA from the same cell is quite likely. As the reviewer points out this highlights the importance of our additional step of taking a single blood sample, dividing it into three and then analyzing repertoire on all three subsamples separately. We thank the reviewer for this excellent suggestion and have added the permutation analysis results in the Materials and methods section (subsection “Subsampling to exclude inflated abundance through multiple RNA contributions by single cells” and Figure 2—figure supplement 1) and discussed the issue in detail in the main Results section.
5) Authors fit parameters for different clone size distributions using this additional experiment with splitting naive repertoire into three parts. It would be interesting to compare resulting best fit prediction for clone size distributions to clone size distributions observed in bulk naive repertories of two volunteers, which are analyzed in the first part of the paper (e.g. rankfrequency scatterplots for model vs. data).
We agree with the reviewer that this is an interesting analysis, and we have added it to the manuscript (Figure 4—figure supplement 1). As expected, the predictions underestimate the number of doublets, especially from β chains, because of the impact of sampling multiple RNAs from the same cell as discussed in detail above (point 4). Apart from this, the model fits well to the “uncleaned” data in most cases, but the cleaned data underrepresents abundant clones, suggesting that the clones shared between memory and naïve represent genuine abundant naïve clones. The only exception is for CD4 β, where we observe fewer large clones than the model predicts.
6) Another potential explanation for large naive clonotypes may be in the early repertoire development: it is known that TdT is not working, and thus first Tcells lack Ninsertions. Previously our group have shown (see Figure 3 in Pogorelyy et al., 2017), that most abundant TCRβ from naive and cord blood (but not memory) repertoires are enriched with zero insertion clonotypes. It would be interesting to see, if abundant naive clonotypes observed in this study are also enriched with zeroinsertion clonotypes (e.g. Figure 1B analog with some estimate of total number of Ninsertions in each bin on yaxis).
We thank the reviewer for pointing out this important alternative explanation, performed the analysis, and added the results to the last part of the Results section. Interestingly, the more abundant TCRα sequences are indeed strongly enriched for zero insertion rearrangements, as the reviewer suggested. For TCRβ, this effect is smaller, but we still observe that abundant β chains (measured as incidence > 1) are enriched for zero insertions. So, these observations are consistent with a role for large clones that are created at or just before birth. This new and interesting observation is now included and discussed in the manuscript in Figure 5A, Figure 1—figure supplement 1 and Figure 5—figure supplement 5A.
Reviewer #3:
This work investigates potential determinants of the distributions of naïve CD4 and CD8 T cell clone sizes using computational analysis and mathematical modelling of highthroughput TCRα and TCRβ sequence data.
There are several major concerns about this manuscript:
1) Although this work improves on previous studies by considering both TCRα and TCRβ sequences in naive CD4 and CD8 T cell populations and using potentially more accurate quantification of clone sizes (using UMIs), the main conclusion of the manuscript, that the naïve TCR repertoire has a broad distribution of clone sizes, is not substantially novel. Heterogeneous clone sizes in naïve T cell repertoires have been reported, and investigated, in many previous studies (e.g. Robins et al., 2009, Quigley et al., 2010, Venturi et al., 2011, Qi et al., 2014, Pogorelyy et al., 2017). Previous studies have also reported substantial overlap of TCR sequences between naïve and memory CD8^{+} T cell compartments that is enriched for high abundance TCR sequences that have limited numbers of nadditions (e.g. Robins et al., 2010, Venturi et al., 2011). Furthermore, many studies have previously investigated the associations between V(D)J recombination / TCR production probability, naïve TCR clone size and interindividual sharing of TCRs (e.g. Robins et al., 2009, Robins et al., 2010, Quigley et al., 2010, Venturi et al., 2011, Li PNAS 2014, Pogorelyy et al., 2017). Importantly, this manuscript has not acknowledged this substantial body of previously published and highly relevant research.
We thank the reviewer for pointing out this body of literature and apologize for not referring to these publications. We now include a more thorough review of the previous literature in the Introduction.
2) While the various models of the naïve T cell clone sizes may be novel, they did not provide sufficient new insights into the primary mechanisms driving the naïve T cell distribution. This was largely due to the fact that no one model considered by the authors could fully explain the observed TCR clone size distributions. One possible reason for this is the growing evidence for developmental and agelinked heterogeneity in naïve T cell populations (e.g. Hogan et al., 2015, Rane et al., 2018, Reynaldi et al., 2019). Although the authors show model results for a range of parameters, this analysis does not account for the potential impact on the adult T cell repertoire of changes in naïve T cell dynamics over the lifespan of an individual. For example, it has been recently suggested that high abundance zero naddition TCRs in the adult naïve repertoire have survived since early development and their high abundance is due to different homeostatic pressures in the peripheral repertoire during early development (Pogorelyy et al., 2017). This previous relevant research has not been considered in this manuscript.
We completely agree with the reviewer (and with the very similar point made by reviewer 1). We have now added the analysis of zero nadditions as described above, and acknowledge the previous literature.
3) A potentially interesting conclusion in this study is the limited association between TCRβ abundance, TCRβ production probability, and TCRβ sharing. This result is not consistent with the findings of many previous studies focused on TCRβ repertoires (listed above for concern #1). However, this discrepancy with previous studies was not discussed in the manuscript. Moreover, the authors have not undertaken further investigation to determine the robustness of this result to various parameters/assumptions in the computational analysis or potential explanations for this discrepancy.
Previous studies showed that production probability of TCRα and TCRβ play a role in predicting abundance and sharing. Using IGoR and incidence rather than abundance, we confirm this for the TCRα sequences, and we suggest that this plays a much smaller role for TCRβ. The latter is the discrepancy that the reviewer is correctly pointing out.
We agree with previous results, as in our data, the TCRβ with incidence > 1 are also more public than TCRβ with incidence 1 (Figure 2—figure supplement 2) and the TCRβ with incidence 2 are enriched for high P(σ) (Figure 2B). At the same time, this confirmation shows that our results do not depend on our computational analysis. Moreover, the results are qualitatively similar when using two different pipelines (Decombinator and RTCR) and with and without cleaning of naïve sequences that overlap with memory samples. Our new finding is that in the naïve repertoire, there is a small population of very large clones (as measured with incidence 3), that are not explained by high generation probability. In line with this, these TCRβ sequences tend to be not public. Thus, there is no discrepancy, but only a new insight on very large naïve clones. We discuss this in detail in the manuscript.
https://doi.org/10.7554/eLife.49900.sa2Article and author information
Author details
Funding
Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Graduate Program 22.005.023)
 Peter C de Greef
Netherlands Genomics Initiative (FES0908)
 Bram Gerritsen
Unilever
 Theres Oakes
 Benjamin Chain
National Institute for Health Research UCL Hospitals Biomedical Research
 Benjamin Chain
Medical Research Council (Studentship)
 James M Heather
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Laurens Krah for mathematical advice and helpful discussions. This work was supported by The Netherlands Organization for Scientific Research (NWO) Graduate Program 22.005.023 (to PCdG), the VIRGO consortium, which is funded by the Netherlands Genomics Initiative and by the Dutch government (FES0908) (to BG), by a grant to BC from Unilever PLC and supported by the National Institute for Health Research UCL Hospitals Biomedical Research. JH was supported by an MRC studentship.
Ethics
Human subjects: The study was carried out in accordance with the recommendations of the UK Research Ethics Committee with written informed consent of all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the University College London Hospital Ethics Committee 06/Q0502/92.
Senior Editor
 Satyajit Rath, Indian Institute of Science Education and Research (IISER), India
Reviewing Editor
 Aleksandra M Walczak, École Normale Supérieure, France
Reviewer
 Mikhail V Pogorelyy, ShemyakinOvchinnikov Institute of Bioorganic Chemistry, Russian Federation
Publication history
 Received: July 3, 2019
 Accepted: March 3, 2020
 Version of Record published: March 18, 2020 (version 1)
Copyright
© 2020, de Greef et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 7,529
 Page views

 574
 Downloads

 22
 Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Further reading

 Cancer Biology
 Computational and Systems Biology
Late advances in genome sequencing expanded the space of known cancer driver genes severalfold. However, most of this surge was based on computational analysis of somatic mutation frequencies and/or their impact on the protein function. On the contrary, experimental research necessarily accounted for functional context of mutations interacting with other genes and conferring cancer phenotypes. Eventually, just such results become 'hard currency' of cancer biology. The new method, NEAdriver employs knowledge accumulated thus far in the form of global interaction network and functionally annotated pathways in order to recover known and predict novel driver genes. The driver discovery was individualized by accounting for mutations' cooccurrence in each tumour genome  as an alternative to summarizing information over the whole cancer patient cohorts. For each somatic genome change, probabilistic estimates from two lanes of network analysis were combined into joint likelihoods of being a driver. Thus, ability to detect previously unnoticed candidate driver events emerged from combining individual genomic context with network perspective. The procedure was applied to ten largest cancer cohorts followed by evaluating error rates against previous cancer gene sets. The discovered driver combinations were shown to be informative on cancer outcome. This revealed driver genes with individually sparse mutation patterns that would not be detectable by other computational methods and related to cancer biology domains poorly covered by previous analyses. In particular, recurrent mutations of collagen, laminin, and integrin genes were observed in the adenocarcinoma and glioblastoma cancers. Considering constellation patterns of candidate drivers in individual cancer genomes opens a novel avenue for personalized cancer medicine.

 Computational and Systems Biology
 Evolutionary Biology
Studies of protein fitness landscapes reveal biophysical constraints guiding protein evolution and empower prediction of functional proteins. However, generalisation of these findings is limited due to scarceness of systematic data on fitness landscapes of proteins with a defined evolutionary relationship. We characterized the fitness peaks of four orthologous fluorescent proteins with a broad range of sequence divergence. While two of the four studied fitness peaks were sharp, the other two were considerably flatter, being almost entirely free of epistatic interactions. Mutationally robust proteins, characterized by a flat fitness peak, were not optimal templates for machinelearningdriven protein design – instead, predictions were more accurate for fragile proteins with epistatic landscapes. Our work paves insights for practical application of fitness landscape heterogeneity in protein engineering.