Statistical inference reveals the role of length, GC content, and local sequence in V(D)J nucleotide trimming
Abstract
To appropriately defend against a wide array of pathogens, humans somatically generate highly diverse repertoires of B cell and T cell receptors (BCRs and TCRs) through a random process called V(D)J recombination. Receptor diversity is achieved during this process through both the combinatorial assembly of V(D)Jgenes and the junctional deletion and insertion of nucleotides. While the Artemis protein is often regarded as the main nuclease involved in V(D)J recombination, the exact mechanism of nucleotide trimming is not understood. Using a previously published TCRβ repertoire sequencing data set, we have designed a flexible probabilistic model of nucleotide trimming that allows us to explore various mechanistically interpretable sequencelevel features. We show that local sequence context, length, and GC nucleotide content in both directions of the wider sequence, together, can most accurately predict the trimming probabilities of a given Vgene sequence. Because GC nucleotide content is predictive of sequencebreathing, this model provides quantitative statistical evidence regarding the extent to which doublestranded DNA may need to be able to breathe for trimming to occur. We also see evidence of a sequence motif that appears to get preferentially trimmed, independent of GCcontentrelated effects. Further, we find that the inferred coefficients from this model provide accurate prediction for V and Jgene sequences from other adaptive immune receptor loci. These results refine our understanding of how the Artemis nuclease may function to trim nucleotides during V(D)J recombination and provide another step toward understanding how V(D)J recombination generates diverse receptors and supports a powerful, unique immune response in healthy humans.
Editor's evaluation
Russell et al. study and reveal compelling evidence for potential sequencebased factors that may drive VDJ trimming, a mechanism involved in VDJ recombination that shapes adaptive immune repertoire generation. The work is based on a rigorous statistical comparison of logistic regression models to reveal the role and function of cutting enzymes in shaping T and Bcell receptor diversity which could provide fundamental new insights into these processes.
https://doi.org/10.7554/eLife.85145.sa0Introduction
Cells throughout the body regularly present protein fragments, known as antigens, on cellsurface molecules called major histocompatibility complex (MHC). Receptors on the surface of T cells can bind to these MHCbound antigens, recognize them, and, if necessary, initiate an immune response. For an individual to be capable of defending against a wide array of potential foreign pathogens, they somatically generate a massive diversity of T cell receptors (TCRs) through a random process called V(D)J recombination. After generation, TCRs undergo a selection process to ensure proper expression, MHC recognition, and limited autoreactivity. The collection of TCRs in an individual comprises their TCR repertoire.
The majority of human T cells express αβ receptors that consist of an α and a β protein chain. During the V(D)J recombination process of the β chain, a single V, D, and Jgene are randomly chosen from a pool of Vgene, Dgene, and Jgene segments within the germline TCRβ locus over a series of steps. To begin this process, the recombination activating gene protein complex aligns a randomly chosen D and Jgene, removes the intervening chromosomal DNA between the two genes, and forms a hairpin loop at the end of each gene (Gellert, 1994; Fugmann et al., 2000; Schatz and Swanson, 2011). Each hairpin loop is then nicked open, typically in an asymmetrical fashion, by the Artemis:DNAPKcs protein complex (Ma et al., 2002; Lu et al., 2007). This asymmetrical hairpin opening creates a singlestranded DNA overhang at the end of both genes that, depending on the location of the hairpin nick, may contain Pnucleotides (short palindromes of gene terminal sequence) (Gauss and Lieber, 1996; Nadel and Feeney, 1997; Ma et al., 2002; Jackson et al., 2004; Lu et al., 2007). The most dominant hairpin opening position leads to a singlestranded 3’ overhang that is 4 nucleotides in length (2 nucleotides of which are Pnucleotides) (Lu et al., 2007). From here, nucleotides may be deleted from each gene end through an incompletely understood mechanism suggested to involve Artemis (Feeney et al., 1994; Nadel and Feeney, 1995; Nadel and Feeney, 1997; Jackson et al., 2004; Gu et al., 2010; Chang et al., 2015; Chang and Lieber, 2016; Zhao et al., 2020; Russell et al., 2022b). This nucleotide trimming can remove traces of Pnucleotides (Gauss and Lieber, 1996; Srivastava and Robins, 2012). Nontemplateencoded nucleotides, known as Ninsertions, can also be added to each gene end by the enzyme terminal deoxynucleotidyl transferase (Kallenbach et al., 1992; Gilfillan et al., 1993; Komori et al., 1993). Once the nucleotide addition and deletion steps are completed, the gene segments are paired and ligated together (Zhao et al., 2020). From here, the process is repeated between a random Vgene and this combined DJ junction to complete the TCRβ chain. A similar TCR chain recombination then proceeds, though without a Dgene, to complete the αβ TCR. Other adaptive immune receptor loci, such as TRG, TRD, and all IG loci, also undergo V(D)J recombination during the development of γδ T cells and B cells, respectively.
Junctional diversity created by the deletion and nontemplated insertion of nucleotides during V(D)J recombination contributes substantially to the resulting diversity of the TCR repertoire. Small variations in gene sequence have been shown to lead to large changes in the extent of nucleotide deletion (Nadel and Feeney, 1995; Gauss and Lieber, 1996; Nadel and Feeney, 1997; Jackson et al., 2004). For example, sequences with high AT content suffer greater nucleotide loss than sequences with high GC content (Gauss and Lieber, 1996). These findings are suggestive of a nuclease that either binds an ATrich sequence motif or requires an ATspecific structure (e.g. a sequence that breathes, Tsai et al., 2009), however, further work is required to quantify this mechanistic preference.
The Artemis protein is often regarded as the main nuclease involved in V(D)J recombination (Chang et al., 2015; Chang and Lieber, 2016; Zhao et al., 2020). Artemis is a member of the metallo$\beta $lactamase family of nucleases (Moshous et al., 2001) and is widely regarded as a structurespecific nuclease as opposed to a nuclease that binds specific DNA sequences (Ma et al., 2005; Chang et al., 2015; Chang and Lieber, 2016; Yosaatmadja et al., 2021). Members of this family are characterized by their conserved metalloβlactamase and βCASP domains and their ability to nick DNA or RNA in various configurations (Dominski, 2007; Pettinati et al., 2016). Alone, Artemis possesses intrinsic 5’to3’ exonuclease activity on singlestranded DNA (Li et al., 2014). On doublestranded DNA, Artemis, in complex with DNAPKcs, has endonuclease activity on 5’ and 3’ DNA overhangs and on DNA hairpins (Ma et al., 2002; Lu et al., 2007; Lu et al., 2008). It has been proposed that the Artemis:DNAPKcs complex binds singlestrandedtodoublestranded DNA boundaries prior to nicking (Ma et al., 2002; Ma et al., 2005; Lu et al., 2007; Chang and Lieber, 2016); for blunt DNA ends, previous work has concluded that sequencebreathing is required to achieve this structural configuration prior to Artemis action (Chang et al., 2015). Further, Artemis, in complex with XRCC4DNA ligase IV, has additional endonuclease activity on 3’ DNA overhangs and preferentially nicks one nucleotide at a time from the singlestranded 3’ end (Chang et al., 2016; Gerodimos et al., 2017). Despite these diverse nucleolytic functions, the extent of involvement and exact mechanism of action for the Artemis protein during the nucleotide trimming step of V(D)J recombination, and how it relates to observed sequencedependent changes in trimming (Nadel and Feeney, 1995; Gauss and Lieber, 1996; Nadel and Feeney, 1997; Jackson et al., 2004), has yet to be fully understood.
While molecular experiments using model organisms have been essential for establishing the current mechanistic understanding of the nucleotide trimming process, studies in humans have been limited. Statistical inference on highthroughput repertoire sequencing data sets allows for exploration of the in vivo V(D)J recombination mechanism outside of model organisms. In particular, analysis of trimming in highthroughput data sets should lead to insights about the natural underlying process, in the same way that analysis of large data sets has led to insight into the process of somatic hypermutation. There, researchers have found quite significant connections between local sequence identity and mutation patterns, leading to a rich literature (Rogozin and Kolchanov, 1992; DunnWalters et al., 1998; Cohen et al., 2011; Yaari et al., 2013; Elhanati et al., 2015; Wei et al., 2015; Cui et al., 2016; Feng et al., 2019; Spisak et al., 2020).
In contrast, we are only aware of one statistical analysis connecting sequence identity to trimming lengths (Murugan et al., 2012). This one existing analysis (Murugan et al., 2012) has shown that a simple positionweightmatrix style (PWM) model does a surprisingly good job of predicting the distribution of trimming lengths for a variety of Vgenes. However, while this trimming model has good model fit and predictive accuracy, it is limited by the assumption that the trimming mechanism relies solely on a sequence motif and, as such, is not designed in a way that allows us to explore alternative hypotheses.
In this paper, we explore the sequencelevel determinants of nucleotide trimming during V(D)J recombination using statistical inference on highthroughput TCRβ repertoire sequencing data (Emerson et al., 2017). With the goal of informing our mechanistic understanding in a quantitative way, we have designed a flexible probabilistic model of nucleotide trimming that allows us to explore various sequencelevel features. Our results show that trimming probabilities are highest for DNA positions near the end of the sequence that contain high GC content upstream, quantifying the role of sequencebreathing dynamics in the trimming process. We also see evidence of a sequence motif that appears to get preferentially trimmed, independent of possible sequencebreathing effects. As such, we can predict trimming probabilities most accurately using a model that includes features for local sequence context, length, and GC nucleotide content in both directions of the wider sequence. We show that this model has high predictive accuracy for V and Jgene sequences from an independent TCRβsequencing data set, and also extends well to TCRα, TCRγ, and IGH sequences. Further, we demonstrate that genetic variations within the gene encoding the Artemis protein that were previously identified as being associated with increasing the extent of trimming (Russell et al., 2022b) are also associated with changes in several model coefficients.
Results
Training data description
We worked with TCRβimmunosequencing data representing 666 individuals (Emerson et al., 2017). V(D)J recombination scenarios were assigned to each sequence from each individual using the IGoR software which is designed to learn unbiased V(D)J recombination statistics from immune sequence reads (Marcou et al., 2018). Using these V(D)J recombination statistics, IGoR output a list of potential recombination scenarios with their corresponding likelihoods for each TCRβchain sequence in the training data set. We annotated each sequence with a single V(D)J recombination scenario by sampling from these potential scenarios according to the posterior probability of each scenario (see Materials and methods for further details).
Annotated TCR sequences can be separated into two categories: ‘productive’ rearrangements which code for a complete, fulllength protein and ‘nonproductive’ rearrangements which do not. Nonproductive sequences are generated when the V(D)J recombination process produces a sequence that is either outofframe or contains a stop codon. Each T cell contains two loci which can undergo the V(D)J recombination process. When the first recombination fails to generate a functional receptor (creating a nonproductive sequence), followed by a successful rearrangement on the T cell’s second chromosome (a productive sequence), the nonproductive rearrangement can be sequenced as part of the repertoire. Nonproductive sequences do not generate proteins that undergo functional selection in the thymus, and their recombination statistics should reflect only the V(D)J recombination generation process (Robins et al., 2010; Murugan et al., 2012; Sethna et al., 2019). In contrast, the recombination statistics of productive sequences should reflect both V(D)J recombination generation and functional selection. Because we are interested in nucleotide trimming during the V(D)J recombination generation process, prior to selection, we only include nonproductive sequences in our training data set. Further, because Vgene sequences within the TRB locus contain more sequence variation than D and/or Jgenes, we only include Vgene sequences in our training data set.
Replicating a previous model of nucleotide trimming
The extent of nucleotide trimming varies substantially from gene to gene (Nadel and Feeney, 1995; Nadel and Feeney, 1997; Jackson et al., 2004; Murugan et al., 2012). Previous work has identified an interesting impact of sequence features, such as sequence nucleotide context, on trimming probabilities using a PWM model (Murugan et al., 2012). To our knowledge, this is the only model that takes nucleotide sequence identity into account when predicting trimming probabilities. Specifically, this model leverages a ‘trimming motif’ containing 2 nucleotides 5’ of the trimming site and 4 nucleotides 3’ of the trimming site to predict the probability of trimming at a given site. It was designed and trained using sequencing data from just nine individuals (Murugan et al., 2012), and has surprisingly good model fit and predictive accuracy across many Vgenes despite its simplicity. Using a different, and much larger, repertoire sequencing data set, we have trained this PWM model and replicated previous work (Figure 2—figure supplement 1). We will refer to this model as the 2×4 motif model. It is important to note that this PWM model is not the primary model described in Murugan et al., 2012, but again is the only one that relates nucleotide identity to trimming.
Model setup overview
While the 2×4 motif model has good predictive accuracy and model fit (Murugan et al., 2012), it is limited by its assumption that the trimming mechanism relies solely on a sequence motif. Here, we have generalized this PWM model to a model that allows for arbitrary sequence features, and trained each new model using conditional logistic regression (see Materials and methods). With this setup, we were able to evaluate the relative importance of new mechanistically interpretable features for predicting trimming probabilities. Specifically, we designed features to measure the effects of DNAshape, length, and GC nucleotide content in both directions of the wider sequence on the probability of trimming at a given position in a gene sequence.
We parameterize each of these features as follows. An example of how an arbitrary Vgene sequence is transformed into features for modeling is shown in Figure 1. To parameterize DNAshape, we used previously developed methods (Zhou et al., 2013; Chiu et al., 2016) to estimate various DNAshape values (i.e. roll, twist, electrostatic potential, minor groove width, etc.) for each singlenucleotide position within a sequence window surrounding the trimming site. To parameterize length, we measure the sequenceindependent distance from the end of the gene (i.e. the number of nucleotides from the 3’end of the sequence) as an integervalued variable. We parameterize GC nucleotide content using the raw counts of AT and GC nucleotides on both sides of the trimming site (the twoside basecount). By using raw nucleotide counts, this measure also serves to parameterize length. Because AT nucleotides have a greater potential for sequencebreathing compared to GC nucleotides within a sequence (Jose et al., 2009), these twoside basecount terms may be serving as a proxy for the capacity of a sequence to breathe. As such, because sequencebreathing potential is only relevant for nucleotides that are paired, we do not include the nucleotides within the 3’ singlestrandedoverhang when counting 3’ AT and GC nucleotides (see Appendix 2).
With these features, we designed models containing various feature combinations (Figure 1B). Collectively, these models allow us to explore other possible sequencelevel determinants of nucleotide trimming, in addition to the previously proposed (Murugan et al., 2012) “trimming motif” hypothesis. We trained each model using the Vgene training data set described above (see Materials and methods for further model training details), and evaluated performance using a suite of different heldout data groups (Figure 2). Specifically, to evaluate model fit, we computed the expected persequence conditional log loss of each model using the full Vgene training data set.
To evaluate model generalizability, we computed the expected persequence conditional log loss using the following heldout groups:
many random, heldout subsets of the Vgene training data set;
heldout subsets of the Vgene training data set containing groups of Vgenes defined to be the ‘mostdifferent’ from all other genes using either the terminal sequences (last 25 nucleotides of each sequence) or the full gene sequences;
the full Jgene data set.
For each of these heldout group analyses, each model was retrained using the full Vgene training data set with the heldout groupofinterest removed (see Materials and methods and Appendix 3 for further details) prior to computing the loss. A lower expected persequence conditional log loss indicated better model fit and/or model generalizability. Following this model evaluation, we validated a subset of the models by using the model coefficients from the previous TCRβ Vgene training run and computing the expected persequence conditional log loss of the model using several independent testing data sets (Figure 2).
Local sequence context, length, and GC nucleotide content in both directions of the wider sequence, together, accurately predict the trimming probabilities of a given Vgene sequence
In an effort to capture the complex underlying biochemistry of the deletion process, we trained models containing various combinations of sequencelevel feature types (Figure 1B) and evaluated their ability to accurately predict Vgene trimming probabilities. With this approach, we found that a model containing parameterizations of local sequence context, length, and GC nucleotide content in both directions of the wider sequence (the 1×2 motif + twoside basecount beyond model) had the best model fit and generalizability across different data sets (Figure 3 and Figure 3—figure supplement 1). This model contains a 1×2 motif, including 1nucleotide position 5’ of the trimming site and 2nucleotide positions 3’ of the trimming site within the trimming window, and includes only bases beyond this trimming window in the AT and GC twoside basecount terms (Figure 1). Despite containing fewer total parameters than the original 2×4 motif model (Murugan et al., 2012) (12 parameters compared to 18 parameters), the 1×2 motif + twoside basecount beyond model had better predictive accuracy (Figure 4 and Figure 4—figure supplement 1).
We considered the significance of the inferred model coefficients using a Bonferronicorrected significance threshold of 0.0033 (corrected for the total number of model coefficients). With this threshold, we found that many of the inferred model coefficients were significant and quantified mechanistic patterns. Each coefficient represents the change in log10 odds of trimming at a given site resulting from an increase in the feature value, given that all other features are held constant. Within the nucleotides immediately surrounding the trimming site, bases 5’ of the trimming site have a slightly stronger effect on the trimming probability than bases 3’ of the trimming site (Figure 4B). Specifically, 5’ of the trimming site, C nucleotides have a strong positive effect on the trimming probability (${\mathrm{log}}_{10}\text{coefficient}=0.2388$) whereas A and T nucleotides have a negative effect (${\mathrm{log}}_{10}{\text{coefficient}}_{\mathrm{A}}=0.108$ and ${\mathrm{log}}_{10}{\text{coefficient}}_{\mathrm{T}}=0.137$). In contrast, immediately 3’ of the trimming site, G and T nucleotides have a positive effect on the trimming probability (${\mathrm{log}}_{10}{\text{coefficient}}_{\mathrm{G}}=0.093$ and ${\mathrm{log}}_{10}{\text{coefficient}}_{\mathrm{T}}=0.125$) whereas C nucleotides have a negative effect (${\mathrm{log}}_{10}\text{coefficient}=0.174$). These results suggest a different possible mechanistic pattern than previous motifonly models (Murugan et al., 2012; Figure 2—figure supplement 1B). Further, beyond the 1×2 motif sequence window, the count of GC nucleotides 5’ of the motif (within a 10nucleotide window) has a strong positive effect on the trimming probability (${\mathrm{log}}_{10}\text{coefficient}=0.164$) (Figure 4C). The counts of both AT and GC nucleotides 3’ of the motif have a strong negative effect on the trimming probability (${\mathrm{log}}_{10}{\text{coefficient}}_{\mathrm{A}\mathrm{T}}=0.123$ and ${\mathrm{log}}_{10}{\text{coefficient}}_{\mathrm{G}\mathrm{C}}=0.126$). Interestingly, the magnitude of these negative effects are very similar between AT and GC counts. This suggests that the raw number of nucleotides 3’ of the motif (e.g. the length) is more important for predicting the trimming probability at a given site compared to the identity of the nucleotides. pvalues for each of these coefficients were reported to be smaller than machine tolerance ($2.23\times {10}^{308}$). We noted minimal variation in the magnitude of each inferred coefficient even when changing the number of sequences included in the training data set (Figure 4—figure supplement 7).
Because we were interested in parameterizing sequencebreathing effects using the twoside basecount terms, we only included nucleotides that are considered to be doublestranded after hairpinopening within each count. In our modeling, we assume that the DNA hairpin is opened at the +2 position, leading to a 4nucleotidelong 3’singlestrandedoverhang (the 2 nucleotides furthest 3’ are considered Pnucleotides) (Ma et al., 2002; Lu et al., 2007). As such, the first 2 nucleotides of the gene sequence can be considered singlestranded, and we do not include them in the twoside basecount terms. When we train a model that ignores this distinction, and include all gene sequence nucleotides in the twoside basecount terms, we note very similar inferred coefficients and model fit (Figure 4—figure supplement 2). We acknowledge that other hairpinopening positions may be possible. To explore whether the +2hairpinopeningposition assumption could be affecting our inferences, we trained the 1×2 motif + twoside basecount beyond model with other possible hairpinopeningposition assumptions and noted minimal variation in model fit (Figure 4—figure supplement 3).
We also evaluated the predictive accuracy of motif + twoside basecount beyond models containing different ‘trimming motif’ sizes. We find that models containing a small motif (e.g. a 1×2 motif) achieve similar predictive accuracy and are more generalizable compared to models containing a larger motif (Figure 4—figure supplement 4).
Because the trimming mechanism is thought to be consistent across V, D, and Jgenes from both productive and nonproductive sequences, we were also interested in whether the inferred coefficients for the 1×2 motif + twoside basecount beyond model would be consistent between the model trained using the nonproductive Vgene training data set, a model trained using a nonproductive Jgene data set, and a model trained using a productive Vgene data set. As such, we trained a new 1×2 motif + twoside basecount beyond model using only nonproductive Jgene sequences and a separate, new 1×2 motif + twoside basecount beyond model using only productive Vgene sequences (both sequence sets were from the same cohort of individuals as the Vgene training data set). We found that the inferred coefficients were highly similar between the three models (Figure 4—figure supplement 5 and Figure 4—figure supplement 6).
When evaluating models containing only a single feature type, we find that the twoside basecount model which parameterizes GC nucleotide content on both sides of the trimming site (and, indirectly, length) has the best model fit and generalizability across all heldout groups tested (Figure 3). As such, these GCcontent features, which are likely parameterizing the capacity for the sequence to breathe, are more predictive of Vgene trimming probabilities than local sequence context or DNAshape alone. This finding supports previous observations that Artemis may act as a structurespecific nuclease as opposed to a nuclease that binds specific DNA sequences (Ma et al., 2005; Chang et al., 2015; Chang and Lieber, 2016; Yosaatmadja et al., 2021).
Inferred local sequence context coefficients suggest a biological trimming motif
A persistent concern with the 1×2 motif + twoside basecount beyond model was that the motif coefficients could be driven by certain genes, instead of representing an actual genesegmentwide signal. When comparing the inferred trimming profiles from the twoside basecount model to those from the 1×2 motif + twoside basecount beyond model, we identified a group of Vgenes which had drastically lower prediction error when the 1×2 motif terms were included. These Vgenes had a difference in pergene root mean squared error between the two models that was greater than –0.13 (Figure 5A). The genes included in this group were TRBV53, TRBV73*01, TRBV73*04, TRBV74, TRBV9, TRBV11, and TRBV13. To evaluate whether these genes could be driving the observed motif signal, we explored whether the prediction error for these genes changed when they were removed from the model training data set.
In fact, we found that the inferred trimming profiles for these genes still had very low prediction error despite the genes not being included in the model training data set (Figure 5B and C), showing the generalizability of these features. The inferred model coefficients from this 1×2 motif + twoside basecount beyond model fit using the subsetted training data set were highly similar to those from the original model fit using the full training data set. Because genes which are highly similar sequencewise to the group of heldout genes could still be present in the training data set and be driving these similarities, we defined a new data set that excluded this larger group of genes. When we repeated the same experiment with this new, morerestricted training data set, we observed similar results (Figure 5B and C). As such, both of these experiments provided evidence that the motif signal may actually represent a genesegmentwide sequence motif that appears to get preferentially trimmed, independent of GCcontentrelated effects.
Trimmingassociated variation within the Artemis locus is associated with a change in model coefficients
Using a subset of the Vgene training data set used here, we previously identified a set of single nucleotide polymorphisms (SNPs) within the gene encoding the Artemis protein that are associated with increasing the extent of V and Jgene trimming (Russell et al., 2022b). This result suggested that trimming profiles may subtly vary in the context of these SNPs. As such, we were interested in whether these SNPs could be mediating (or serving as a proxy for) a change in the trimming mechanism. To explore this, we worked with paired SNP array and TCRβimmunosequencing data representing 611 of the original 666 individuals in the Vgene training data set used here. Our previous work Russell et al., 2022b used data from only 398 of these individuals, however, the conclusions of that paper held when using this expanded group of 611 individuals in the analysis. With these data, we asked whether the inferred coefficients from the Vgenespecific 1×2 motif + twoside basecount beyond model varied significantly in the context of the noncoding Artemislocus SNP (rs41298872) that was found to be most strongly associated with increasing the extent of Vgene trimming in our previous work (Russell et al., 2022b). As such, we redefined the model to include an interaction coefficient between the SNP genotype and each model parameter (see Materials and methods). We then used a Bonferronicorrected significance threshold of 0.0033 (corrected for the total number of interaction coefficients) to evaluate the significance of each interaction coefficient. For each significant interaction coefficient, we concluded that the corresponding model coefficient varied significantly in the context of the SNP genotype.
Using these methods, we found that several of the 1×2 motif + twoside basecount beyond model coefficients varied significantly in the context of the Artemislocus SNP rs41298872 (Figure 6). Specifically, we found that 3’ of the trimming site, the negative effect of A nucleotides on the trimming odds varied in the context of the SNP for the position immediately 3’ of the trimming site (${\mathrm{log}}_{10}\text{interaction coefficient}=0.006$, $\mathrm{p}=0.0006$) and one position away (${\mathrm{log}}_{10}\text{interaction coefficient}=0.007$, $\mathrm{p}=0.0006$). Further, we found that the negative effect of the count of AT nucleotides 3’ of the motif varied strongly in the context of the SNP (${\mathrm{log}}_{10}\text{interaction coefficient}=0.010$, $\mathrm{p}=1.47\times {10}^{12}$). No other motif or twoside basecount coefficients were found to significantly vary.
Because the 3’side basecountbeyond terms parameterize both GC nucleotide content and length in their definition, we were interested in whether the significance of the 3’ATnucleotide count SNP variation effect was related to GC nucleotide content, length, or both. To do this, we redefined the 3’side basecountbeyond parameters to be a proportion instead of raw AT/GC nucleotide counts and included an additional length term in the model to remove lengthrelated effects from the inferred 3’side basecountbeyond coefficients. Using this new model, we repeated the analysis and found that the length coefficient varied significantly in the context of the SNP (${\mathrm{log}}_{10}\text{interaction coefficient}=0.005$, $\mathrm{p}=6.24\times {10}^{23}$), but the 3’ATnucleotideproportion term did not (Figure 6—figure supplement 1). This result is fully consistent with the fact that the Artemislocus SNP is known to be associated with increasing the extent of trimming (a proxy for length).
Local sequence context, length, and GC nucleotide content in both directions of the wider sequence can also accurately predict the trimming probabilities of a given sequence from other receptor loci
To validate our previously trained models, we worked with TCRα and TCRβimmunosequencing data representing 150 individuals, TCRγimmunosequencing data representing 23 individuals, and IGHimmunosequencing data representing 9 individuals from three independent validation cohorts. Before analyzing these data, we ‘froze’ our trained model coefficients in git commit 093610a on our repository. In contrast to the training data cohort, these validation cohorts contain different demographics and were each processed using different sequence annotation methods (see Materials and methods). To explore the potential effects of using a different sequence annotation method, we reannotated the TCRβ training data set using the same annotation method as the TCRαβ testing data and found that it had little to no effect on the model fit or performance (Figure 7—figure supplement 1).
To evaluate the performance of the 1×2 motif + twoside basecount beyond model using these testing data, we used the model coefficients from the previous TCRβ Vgene training run and computed the expected persequence conditional log loss of the model using each testing data set (TCRβ Vgene sequences, TCRα Vgene sequences, TCRγ Vgene sequences, IGH Vgene sequences, TCRβ Jgene sequences, etc.). We found that the model has high predictive accuracy (i.e. low expected persequence conditional log loss) for both nonproductive V and Jgene sequences from the TCRβ testing data set (Figure 7). The model also extends well to nonproductive V and Jgene sequences from the TCRα and TCRγ testing data sets and to nonproductive Vgene sequences from the IGH testing data set. The model has relatively poor predictive accuracy for nonproductive IGH Jgene sequences, however. We noted very similar results when validating model performance using productive V and Jgene sequences from each testing data set (Figure 7—figure supplement 2).
We hypothesized that the weight of the 1×2 motif and twoside basecount beyond model terms may vary across each testing data set. To explore this for each data set, we again used the model coefficients from the previous TCRβ Vgene training run and trained a new twoparameter model containing one coefficient scaling the 1×2 motif terms and a second coefficient scaling the twoside basecount beyond terms (see Materials and methods). With this approach, we found that the twoside basecount beyond terms were dominant compared to the 1×2 motif terms for every data set (Figure 7—figure supplement 3A). The scale coefficient for the 1×2 motif terms was very small for several of the data sets, especially the IGH data set, indicating only a weak motifrelated signal. The sequence motifs that lead to a large increase in trimming probabilities in the model appear at relatively low frequencies within the germline IGH genes (Figure 7—figure supplement 4), perhaps explaining the weakness of the motifrelated signal. When evaluating the expected persequence conditional log loss of these partially retrained models, we note a small improvement in model fit for each retrained model compared to the original model (Figure 7—figure supplement 3B).
Discussion
The junctional deletion and insertion steps of the V(D)J recombination process are essential for creating diversity within the TCR repertoire. While the Artemis protein is often regarded as the main nuclease involved in V(D)J recombination, the exact mechanism of nucleotide trimming has yet to be understood in a human system. Using a previously published highthroughput TCRβ sequencing data set, we designed a flexible probabilistic model of nucleotide trimming that allowed us to explore the relative importance of various sequencelevel features. While we recognize that these general model features may not capture the full complexity of the trimming mechanism and establish causation, we were primarily interested in identifying mechanistically interpretable features which could confirm and extend our current understanding of the nucleotide trimming process. With this framework, we have (1) revealed a set of sequencelevel features which can be used to accurately predict trimming probabilities across various adaptive immune receptor loci, (2) shown that length and GC nucleotide content in both directions of the wider sequence are highly predictive of trimming probabilities, quantifying how doublestranded DNA needs to be able to breathe for trimming to occur, (3) identified a sequence motif that appears to get preferentially trimmed, independent of length and GCcontentrelated effects, and (4) demonstrated that a genetic variant within the gene encoding the Artemis protein is associated with changes in several model coefficients.
Specifically, we find that a model containing parameterizations of both local sequence context, length, and GC nucleotide content in both directions of the wider sequence can most accurately predict the trimming probabilities of a given TCRβ gene sequence. In addition to having fewer parameters, this model also had better predictive accuracy than a previously proposed sequence context model (Murugan et al., 2012). Models containing other sequencelevel parameters such as DNAshape and length also had relatively worse predictive accuracy. The TR and IG V(D)J recombination processes, including trimming profiles, have previously been suggested to vary substantially across individuals (Slabodkin et al., 2021; Russell et al., 2022b). Here, our results support a universal, sequencebased trimming mechanism underlying this variation across TR and IG loci in humans. Specifically, in addition to TCRβ sequences, we find that local sequence context, length, and GC nucleotide content in both directions of the wider sequence can be used to accurately predict trimming probabilities across TCRα, TCRγ, and IGH sequences. For all of these loci, we find that length and GC nucleotide content are relatively more important than local sequence context terms for making accurate model predictions.
The Artemis protein, in complex with DNAPKcs, is responsible for opening the DNA hairpin during the early steps of V(D)J recombination to generate a 4nucleotidelong 3’singlestranded overhang at the end of each gene, and has been suggested to continue on to trim nucleotides from this resulting DNA structure (Feeney et al., 1994; Nadel and Feeney, 1995; Nadel and Feeney, 1997; Jackson et al., 2004; Gu et al., 2010; Chang et al., 2015; Chang and Lieber, 2016; Zhao et al., 2020; Russell et al., 2022b). The Artemis protein, with and without DNAPKcs, has been shown to bind singlestrandedtodoublestranded DNA boundaries prior to nicking DNA (Ma et al., 2002; Ma et al., 2005; Chang et al., 2015; Chang and Lieber, 2016). While the singlestranded overhang created during hairpinopening may create a natural singlestrandedtodoublestranded DNA substrate for Artemis binding near the end of the gene sequence, we find that many trimming events occur further into the doublestranded gene sequence. Indeed, previous in vitro DNA nuclease assays involving Artemis have shown that sequencebreathing dynamics are often required to generate a transient singlestrandedtodoublestranded DNA substrate prior to Artemis action (Chang et al., 2015). Using our model of nucleotide trimming, we have shown that trimming probabilities are highest for DNA positions closer to the end of the sequence. Because these DNA positions have fewer doublestranded nucleotides on the 3’side of the trimming site, they may have more capacity for sequencebreathing. On the 5’side of the trimming site, we find that having a larger number of GC nucleotides, and perhaps less sequencebreathing capacity, increases the trimming probability. Perhaps this breathing transition can create a transient singlestrandedtodoublestranded DNA substrate that is suitable for Artemis to bind and trim. As such, this finding quantifies sequencebreathing effects that were previously identified through in vitro DNA nuclease assay studies involving Artemis (Chang et al., 2015).
Independent of GCcontentrelated effects, we have also identified a genesegmentwide sequence motif that appears to get preferentially trimmed. This motif is suggestive of sequencespecific nucleolytic activity, however, Artemis is widely regarded as a structurespecific nuclease as opposed to a nuclease that binds specific DNA sequences (Ma et al., 2005; Chang et al., 2015; Chang and Lieber, 2016; Yosaatmadja et al., 2021). This suggests that either (1) Artemis actually does possess some ability to recognize specific nucleotides, (2) the observed sequence motif is serving as a proxy for DNA structure induced by the motif, or (3) another nuclease, in addition to Artemis, is responsible for the sequencespecific trimming we observe. However, because the strength of this sequence motif signal varied across receptor loci, further work will be required to explore its mechanistic basis and presence.
We found that several model coefficients related to local sequence context, length, and GC nucleotide content in both directions of the wider sequence varied significantly in the context of the noncoding Artemislocus SNP rs41298872. We previously identified this Artemislocus SNP as being associated with increasing the extent of TCRβ V and Jgene trimming (Russell et al., 2022b). While many previous studies have reported a high consistency of TCRβ trimming profiles across individuals (Murugan et al., 2012; Marcou et al., 2018; Sethna et al., 2020), our results begin to explore how the trimming mechanism may vary across individuals in the context of Artemis genetic variation. We reported that trimming probabilities decrease as the number of doublestranded nucleotides 3’ of the trimming site increases. In the context of the SNP rs41298872, we found that as the number of doublestranded AT nucleotides 3’ of the trimming site increases, the trimming probabilities do not decrease as quickly. This suggests that individuals homozygous (or heterozygous) for rs41298872 may be more capable of trimming at positions that have a larger number of doublestranded nucleotides 3’ of the trimming site, especially if the additional nucleotides are AT bases. This may be possible if, for example, rs41298872 increases Artemis expression. If there is more Artemis available, then trimming at less optimal positions (i.e. positions further into the sequence which have less breathing) may be possible. Additional work will be required to define the relationship between rs41298872 genotype and Artemis expression.
We also identified several local sequence context coefficients that varied in the context of rs41298872, however, their mechanistic interpretation remains unclear. Earlier, we noted that A nucleotides 3’ of the trimming site have a negative effect on the trimming probability while T nucleotides have a strong positive effect. In the context of rs41298872, we found that the magnitude of the negative effect of 3’ A nucleotides on the trimming probability was reduced. This may suggest that individuals homozygous (or heterozygous) for rs41298872 may trim in a less motifdependent fashion, and are instead more reliant on sequence openness 3’ of the trimming site. In this way, having A or T nucleotides 3’ of the trimming site would create a more open local sequence for trimming.
There are several key limitations of our approach which are intrinsic to the use of adaptive immune receptor repertoire data. First, we have used trimming statistics from nonproductive rearrangements as a means of studying the nucleotide trimming process in the absence of selection. Nonproductive sequences can be sequenced as part of the repertoire when they are present within a cell expressing a productive rearrangement that survived the selection process. While we are not aware of a mechanism through which nonproductive and productive rearrangements within a single cell could be correlated, we also acknowledge that the repertoire of nonproductive rearrangements may be an imperfect proxy for a preselection repertoire. However, as is common in the literature (Robins et al., 2010; Murugan et al., 2012; Marcou et al., 2018; Sethna et al., 2019; Sethna et al., 2020), we assume that the two recombination events are independent and that the nonproductive rearrangements reflect the statistics of the repertoire prior to selection. Next, because many V(D)J rearrangement scenarios can give rise to the same final nucleotide sequence, possible error related to the annotation of each sequence may have restricted our ability to model the actual trimming distributions of each gene. Although we cannot rule out some effect of incorrect sequence annotation on our model inferences, we found that the exact sequence annotation method used, including sampling from the posterior distribution of rearrangement events, had little to no effect on the model fit or performance.
In summary, we have found that local sequence context, length, and the GC nucleotide content in both directions of the wider sequence can accurately predict the trimming probabilities of TR and IG gene sequences. These results refine our understanding of how nucleotides are trimmed during V(D)J recombination. The sequencelevel features identified here lay the groundwork for further exploration into the trimming mechanism and how it may vary across individuals. Such insights will provide another step toward understanding how V(D)J recombination generates diverse receptors and supports a powerful, unique immune response in humans.
Materials and methods
Training data set
Request a detailed protocolTCRβ repertoire sequence data for 666 healthy bone marrow donor subjects was downloaded from the Adaptive Biotechnologies immuneACCESS database using the link provided in the original publication (Emerson et al., 2017). V(D)J recombination scenarios were assigned to each sequence for each individual using the IGoR software (version 1.4.0) (Marcou et al., 2018) as follows. The IGoR software can learn unbiased V(D)J recombination statistics from immune sequence reads. Using these statistics, IGoR can output a list of potential recombination scenarios with their corresponding likelihoods for each sequence. As such, using the default IGoR V(D)J recombination statistics, the 10 highest probability V(D)J recombination scenarios were inferred for each TCRβchain sequence in the training data set (Marcou et al., 2018). We then annotated each TCRβchain sequence with a single V(D)J recombination scenario by sampling from these 10 scenarios according to the posterior probability of each scenario. We filtered these sequences for rearrangements which contained more than 1 trimmed nucleotide and less than 15 trimmed nucleotides (see the ‘Notation’ section for further details). We further subset the data to include only nonproductive sequences, and used these data for all subsequent model training. After these processing and filtering steps, we used Vgene trimming length distributions from 21,193,153 nonproductive sequences for all model training. To test each trained model, we used Vgene trimming length distributions from the remaining 107,121,841 productive sequences (as described in Appendix 3). From this same data set, we also used Jgene trimming length distributions from 107,255,406 productive sequences and 20,204,801 nonproductive sequences to test each model.
Testing data sets
TCRα and TCRβ testing data sets
Request a detailed protocolAnnotated TCRα and TCRβ repertoire sequence data for 150 healthy subjects was downloaded using the link provided in the original publication (Russell et al., 2022b). In contrast to the training data cohort, this cohort contains different demographics, shallower RNAseqbased TCR sequencing, and was processed using a different sequence annotation methods (i.e. TCRdist [version 0.0.2] [Dash et al., 2017] as described in a previous publication [Russell et al., 2022b]). Sequences were split into nonproductive and productive groups for model validation. From the TCRα data set, we used Vgene trimming length distributions from 123,496 nonproductive sequences and 862,096 productive sequences and Jgene trimming length distributions from 141,451 nonproductive sequences and 1,101,114 productive sequences to test each model. From the TCRβ data set, we used Vgene trimming length distributions from 64,738 nonproductive sequences and 1,435,153 productive sequences and Jgene trimming length distributions from 59,608 nonproductive sequences and 1,496,953 productive sequences to test each model.
TCRγ testing data set
Request a detailed protocolAnnotated TCRγ repertoire sequence data for 23 healthy bone marrow donor subjects was downloaded from the Adaptive Biotechnologies immuneACCESS database (Robins and Pearson, 2015). Sequences were split into nonproductive and productive groups for model validation. We used Vgene trimming length distributions from 2,403,293 nonproductive sequences and 1,002,662 productive sequences and Jgene trimming length distributions from 568,824 nonproductive sequences and 250,493 productive sequences to test each model.
IGH testing data sets
Request a detailed protocolAnnotated IgG class nonproductive IGH repertoire sequence data for nine healthy subjects was obtained from the authors of a previous publication (Spisak et al., 2020). The raw sequence data is available using the link provided in the original publication (Briney et al., 2019). In contrast to the training data cohort, this cohort contains different demographics, shallower RNAseq based IGHsequencing, and was processed using a different sequence annotation method (i.e. a combination of Immcantation [Vander Heiden et al., 2014] and IgBlast [Ye et al., 2013] as described in a previous publication [Spisak et al., 2020]). Further, these data are restricted to rearrangements that lead to a clonal family with at least six members.
Likewise, productive IGH repertoire sequence data for four healthy subjects was downloaded using the link provided in the original publication (Jaffe et al., 2022) and the sequences were annotated using partis (version 0.16.0) (Ralph and Matsen, 2016). Due to the large size of this data set, 100k sequences were randomly sampled from the original data set prior to model validation. For both IGH data sets, only a single sequence from each inferred clonal family was included in each model testing data set. From these data sets, we used Vgene trimming length distributions from 160,714 nonproductive sequences and 32,245 productive sequences and Jgene trimming length distributions from 297,298 nonproductive sequences and 74,884 productive sequences to test each model.
Artemislocus SNP data set
Request a detailed protocolGenomewide SNP array data corresponding to 611 of the training data set individuals was downloaded from The database of Genotypes and Phenotypes (accession number: phs001918). Details of the SNP array data set, genotype imputation, and quality control have been described previously (Martin et al., 2020). We only used SNP data corresponding to the Artemis locus (rs41298872) which we previously found to be strongly associated with increasing the extent of Vgene trimming (Russell et al., 2022b).
Notation
Request a detailed protocolLet $I$ be a set of individuals. For each subject $i\in I$, assume we have a TCR repertoire consisting of sequences indexed by $k$ such that $k=1,\mathrm{\dots},{K}_{i}$. We assume that each sequence can be unambiguously annotated with being from a specific Vgene and Jgene sequence, and having a number of deleted nucleotides from each gene. For modeling purposes, we combine TRB Vgene or Jgene alleles that have identical terminal nucleotide sequences (last 24 nucleotides of each sequence) into TRB Vgene allele groups and TRB Jgene allele groups. As such, each TCR sequence is annotated with being from a Vgene allele group and Jgene allele group. Because we are requiring that each gene allele group originates from the same TRB Vgene or Jgene, there may still be overlap in terms of sequence identity between allele groups. For simplicity, we orient all sequences in the 5’to3’ direction, and use the top strand for Vgene sequences and the bottom strand for Jgene sequences. We will be introducing modeling methods as they relate to Vgenes and Vgene trimming, however, with this sequence orientation, the same methods can be applied to Jgenes and Jgene trimming. We will use $\sigma $ to represent a gene sequence oriented in the 5’to3’ direction and $n$ to represent the number of nucleotides deleted from the 3’ end of this sequence as we describe our modeling.
We are interested in modeling the probability of trimming $n$ nucleotides from a given gene sequence $\sigma $, $P(n\mid \sigma )$. We can define an empirical conditional probability density function to estimate this probability. To start, we can uniformly sample from any given individual’s repertoire. Let $S$ be a random variable that represents the geneallelegroup sequence from such a sample. Let $N$ be a random variable that represents the number of deleted nucleotides, which for notational convenience we assume take on a nonnegative integer value (nonsensical values will have probability zero). Let $0\le {C}^{\left(i\right)}\left(\sigma \right)\le {K}_{i}$ represent the number of TCRs that use gene allele group $\sigma $. Let $0\le {C}^{\left(i\right)}(n,\sigma )\le {K}_{i}$ represent the number of TCRs that have gene allele group $\sigma $ and $n$ gene nucleotides deleted. With these data, we can form the empirical conditional probability density function:
Using these TCRβ repertoire data, we want to model the influence of various sequencelevel parameters on $P(n\mid \sigma )$ . With this assumption, let $L$ and $U$ be lower and upper bounds, respectively, on $n$ such that ${N}^{\prime}=\{L,\mathrm{\dots},U\}$ is the set of all reasonable nucleotide deletion amounts. The precise location of hairpin opening and its relationship to deletion is unclear. Hence, we have chosen to define $L=2$ since smaller trimming amounts may result from an alternative, hairpinopeningpositionrelated (or other) trimming mechanism. Likewise, we have chosen to define $U=14$ since trimming amounts greater than 14 nucleotides are uncommon and could also result from an alternative trimming mechanism. We will subset the training data set, after IGoR annotation (see details in a previous section), such that we will only consider TCRs that have $2\le n\le 14$. Similarly, the one existing analysis (Murugan et al., 2012) exploring the relationship between sequence context and nucleotide trimming only considered TCRs that had $2\le n\le 12$ for their modeling. We summarize all of the notation discussed in this section, as well as in the following sections, in Appendix 1—table 1.
V(D)J recombination modeling assumptions
Request a detailed protocolFor our model, we make the following assumptions about V(D)J recombination biology:
During the V(D)J recombination process, the gene DNA hairpin is nicked open by a singlestranded break (Gauss and Lieber, 1996; Nadel and Feeney, 1997; Ma et al., 2002; Jackson et al., 2004; Lu et al., 2007).
This hairpin nick occurs at the +2 position, leading to a 4nucleotidelong 3’singlestrandedoverhang (the 2 nucleotides furthest 3’ are considered Pnucleotides) (Ma et al., 2002; Lu et al., 2007). We will discuss a sensitivity analysis to this assumption, which showed that the assumed hairpinnick position had little impact on our model fitting, in the appendix.
If any nonzero amount of the original gene sequence is deleted, all Pnucleotides will also be deleted (Gauss and Lieber, 1996; Srivastava and Robins, 2012).
Nucleotide trimming occurs before Ninsertion.
With these assumptions, we can resolve the nucleotide sequence on both sides of the trimming site and define mechanistically interpretable model features using these two sequences. Specifically, we define a ‘trimming motif’ consisting of several nucleotides on either side of the trimming site, the predicted ‘DNAshape’ of the nucleotides and bonds in close proximity to the trimming site, the counts of GC or AT nucleotides on either side of the trimming site beyond the ‘trimming motif’ region (e.g. the ‘twoside basecount beyond’), and the sequenceindependent ‘length’ from the end of the gene to the trimming site (see Appendix 2 for further details). An example of how an arbitrary Vgene sequence is transformed into features for modeling is shown in Figure 1. We will assume that observations can be drawn from a model in which these features vary across trimming lengths $n$ for a given gene allele group $\sigma $. We can then explore the influence of these features on the probability of trimming at a certain site given a gene sequence.
Defining a model covariate function
Request a detailed protocolWith the features summarized above, we can define a model covariate function $f$ than contains any unique combination of parameterspecific covariate functions (Table 1). This function $f$ will be the sum of each of the desired parameterspecific covariate functions. This framework allows us to generalize the existing PWM model (Murugan et al., 2012) to a model that allows for arbitrary sequence features. For example, we replicate this PWM model using the model covariate function, ${f}_{1}(n,\sigma ;{\mathit{\beta}}^{\text{mtif}},a=2,b=4)$, where $n$ represents the number of trimmed nucleotides, $\sigma $ represents the geneallelegroup sequence, ${\mathit{\beta}}^{\text{mtif}}$ represents motifspecific parameter coefficients, and $a$ and $b$ are nonnegative integer values that represent the number of nucleotides 5’ and 3’ of the trimming site, respectively, that are included in the ‘trimming motif’. This function is described further in (Equation 14). To extend this model to a model containing motif parameters and basecountbeyond parameters, the model covariate function will be
where ${f}_{2}$ represents the basecountbeyond model covariate function (Equation 17), ${\mathit{\beta}}^{\text{AT}}$ and ${\mathit{\beta}}^{\text{GC}}$ represent basecountbeyondspecific parameter coefficients, and $c$ represents the number of nucleotides 5’ of the trimming site to be included in the basecount. We will use this motif and basecountbeyond model example to discuss the model formulation in the following sections, however, many other parameter combinations are possible. We will not define a model covariate function that contains two parameters that model the same feature. For example, length and basecountbeyond coefficients will never be included in a model covariate function together (since they both parameterize length). Likewise, motif and DNAshape coefficients will never both be included in a model covariate function.
Predicting trimming probabilities using conditional logistic regression
Request a detailed protocolWe will be using the motif and basecountbeyond parameters given by (Equation 2) as examples for the remainder of this section, however, we could also formulate a model with any other parameter of interest, as described in the previous section (Table 1). As such, we can fit a conditional logit model which posits that
where ${N}^{\prime}$ is the set of all reasonable trimming lengths, $a$ and $b$ represent the number of nucleotides 5’ and 3’ of the trimming site to be included in the ‘trimming motif,’ respectively, $c$ represents the number of nucleotides 5’ of the trimming site to be included in the basecount parameters, and $f(n,\sigma ;{\mathit{\beta}}^{\mathtt{\text{motif}}},{\mathit{\beta}}^{\mathtt{\text{AT}}},{\mathit{\beta}}^{\mathtt{\text{GC}}},a,b,c)$ is the model covariate function for the motif and basecountbeyond model given by (Equation 2). We will let $P(n\mid \sigma ;{\mathit{\beta}}^{\mathtt{\text{motif}}},{\mathit{\beta}}^{\mathtt{\text{AT}}},{\mathit{\beta}}^{\mathtt{\text{GC}}},a,b,c)$ denote the conditional probability that a given gene will be trimmed by $n$ nucleotides.
Let ${y}_{ik\sigma n}$ equal 1 if a gene allele group $\sigma $ is trimmed by $n$ nucleotides for TCR $k$ from subject $i$, and equal 0 otherwise. With this, we can define a likelihood function, $L({\mathit{\beta}}^{\text{mtif}},{\mathit{\beta}}^{\text{AT}},{\mathit{\beta}}^{\text{GC}},a,b,c)$, such that for a random sample of subjects, $L({\mathit{\beta}}^{\text{mtif}},{\mathit{\beta}}^{\text{AT}},{\mathit{\beta}}^{\text{GC}},a,b,c)$, is the likelihood of the model parameters, ${\mathit{\beta}}^{\text{mtif}}$, ${\mathit{\beta}}^{\text{AT}}$, and ${\mathit{\beta}}^{\text{GC}}$, given that we observed a set of trimming amounts for a set of given genes. As such, the loglikelihood function can be written as
where $P(n\mid \sigma ;{\mathit{\beta}}^{\mathtt{\text{motif}}},{\mathit{\beta}}^{\mathtt{\text{AT}}},{\mathit{\beta}}^{\mathtt{\text{GC}}},a,b,c)$ is given by (Equation 3). Instead of maximizing this loglikelihood directly, we may wish to aggregate the data to reduce the number of observations and simplify model fitting. Recall that for subject $i$, ${C}^{\left(i\right)}\left(\sigma \right)$ represents the number of TCRs which use gene allele group $\sigma $ and ${C}^{\left(i\right)}(n,\sigma )$ represents the number of TCRs which have gene allele group $\sigma $ and $n$ gene nucleotides deleted. As such, ${C}^{\left(i\right)}(n,\sigma )$ is the count of observations which will have the same trimming probabilities $P(n\mid \sigma ;{\mathit{\beta}}^{\mathtt{\text{motif}}},{\mathit{\beta}}^{\mathtt{\text{AT}}},{\mathit{\beta}}^{\mathtt{\text{GC}}},a,b,c)$ and will have been trimmed by $n$ for subject $i$ and gene allele group $\sigma $. Thus, using this aggregated data from all subjects $i\in I$, we can rewrite the loglikelihood function equivalently as
As above, for a random sample of subjects, $L({\mathit{\beta}}^{\text{mtif}},{\mathit{\beta}}^{\text{AT}},{\mathit{\beta}}^{\text{GC}},a,b,c)$ is the likelihood of the model parameters, ${\mathit{\beta}}^{\text{mtif}}$, ${\mathit{\beta}}^{\text{AT}}$, and ${\mathit{\beta}}^{\text{GC}}$, given that we observed a set of trimming amounts for a set of given genes.
With this likelihood formulation, all observations in the sample get uniform treatment in the construction of the likelihood. However, subjects may differ in their repertoire size and composition for reasons other than trimming. For example, it is known that gene usage differs across subjects. Thus, to avoid having these differences pollute our $\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}}$, $\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}}$, and $\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}$ inference, we propose a subject and gene weighting scheme.
As such, we can define the expected likelihood of a process where we first draw a subject $i$ uniformly at random, then we sample TCR sequences from their repertoire according to a given distribution, as follows. For a single TCR sequence from such a sample, let $S$ be a random variable representing the gene of the sequence, and let $N$ be a random variable representing the number of deleted nucleotides. We can sample each TCR sequence with probability ${P}_{\text{samp}}(N=n,S=\sigma )$ which we will specify later. Also, given random $S$ and $N$, the loglikelihood of the model parameters, ${\mathit{\beta}}^{\text{mtif}}$, ${\mathit{\beta}}^{\text{AT}}$, and ${\mathit{\beta}}^{\text{GC}}$, is given by
With this, the expected loglikelihood of the model parameters, $\mathit{\beta}}^{\mathtt{\text{motif}}$, ${\mathit{\beta}}^{\text{AT}}$, and ${\mathit{\beta}}^{\text{GC}}$ given this random sample is given by
We can define a new, weighted loglikelihood function, $\mathrm{log}{L}_{\text{expected}}({\mathit{\beta}}^{\text{mtif}},{\mathit{\beta}}^{\text{AT}},{\mathit{\beta}}^{\text{GC}},a,b,c)$, equivalent to this expected loglikelihood:
For a random sample of subjects, the weighted likelihood, ${L}_{\text{expected}}({\mathit{\beta}}^{\text{mtif}},{\mathit{\beta}}^{\text{AT}},{\mathit{\beta}}^{\text{GC}},a,b,c)$, represents the likelihood of the model parameters, ${\mathit{\beta}}^{\text{mtif}}$, ${\mathit{\beta}}^{\text{AT}}$, and ${\mathit{\beta}}^{\text{GC}}$, given that we observed a set of trimming amounts for a given set of gene allele groups after weighting observations according to the sampling procedure ${P}_{\text{samp}}(N=n,S=\sigma )$. We can use whichever sampling procedure, ${P}_{\text{samp}}(N=n,S=\sigma )$, we want. For example, recall that we originally formed the empirical conditional PDFs in (Equation 1) for each subject $i$ by uniformly sampling from each TCR repertoire to get a total repertoire size of ${K}_{i}$:
and
With this, we can define a sampling procedure equivalent to this empirical joint PDF as follows:
With this sampling procedure,
As such, each subject, instead of each observation, gets uniform treatment in the construction of the weighted likelihood.
While this procedure would correct for individual subjects having different repertoire sizes, it does not account for gene usage differences. To avoid having these differences pollute our $\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}}$, $\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}}$, and $\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}$ inference, we propose a subjectindependent geneallelegroup sampling scheme. While we could use any distribution on $\sigma $, including a uniform weight by gene allele groups, we have chosen to define:
We can reformulate the sampling procedure which is an empirical average pergeneallelegroup frequency such that:
With this subjectindependent gene sampling procedure, we can define a weighted likelihood ${L}_{W}({\mathit{\beta}}^{\text{mtif}},{\mathit{\beta}}^{\text{AT}},{\mathit{\beta}}^{\text{GC}},a,b,c)$ such that
As such, each gene and each subject get uniform treatment in the construction of the weighted likelihood.
From here, we can maximize this weighted loglikelihood, $\mathrm{log}{L}_{W}({\mathit{\beta}}^{\text{mtif}},{\mathit{\beta}}^{\text{AT}},{\mathit{\beta}}^{\text{GC}},a,b,c)$, to estimate the logprobabilities ${\mathit{\beta}}^{\text{mtif}}$, ${\mathit{\beta}}^{\text{AT}}$, and ${\mathit{\beta}}^{\text{GC}}$, where ${\mathit{\beta}}^{\text{mtif}}$ is equivalent to a (log) positionweightmatrix. To estimate each coefficient, we can solve the weighted maximum likelihood estimation problem:
using the mclogit package in R. We can formulate a weighted maximum likelihood problem in a similar way for any model covariate function $f$ containing a unique combination of parameterspecific covariate functions (Table 1).
We compare our inferred coefficients to the existing PWM model which was designed and trained using least squares (Murugan et al., 2012). When replicating this model using our methods described above (i.e. the 2×4 motif model), we note highly similar results (Figure 2—figure supplement 1).
Evaluating model fit and generalizability across genes
Request a detailed protocolIn order to evaluate the model fit and generalizability of each model, we use a variety of training and testing data sets to train each model and calculate the log loss. We will describe our general model evaluation procedure here. We describe variations of this general model evaluation procedure in Appendix 3. Let $\mathbf{\text{T}}$ represent a training data set and $\mathbf{\text{H}}$ represent a heldout testing data set. With the training set $\mathbf{\text{T}}$, we can train each model of interest as described above in (Equation 10). After this model fitting, we can calculate the expected persequence conditional log loss of the model with given coefficients, $\mathcal{M}$, for a given heldout testing set, $\mathbf{\text{H}}$, such that
where $i$ represents a subject, $n$ represents a trimming length, and $\sigma $ represents a gene allele group. Because we are incorporating the empirically observed frequency of each subject, trimming length, and gene allele group within each ‘heldout testing set,’ ${P}_{{\text{emp}}_{\mathbf{\text{H}}}}(n,\sigma ,i)$, in this formulation, the expected persequence conditional log loss values are guaranteed to be directly comparable between heldout testing sets with varying compositions. Models that have lower expected persequence conditional log loss will indicate that the model has a better fit.
Assessing significance of model coefficients
Request a detailed protocolDuring model fitting, we estimated the model coefficients $\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}}$, $\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}}$, and $\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}$ by maximizing the weighted likelihood function given by (Equation 9). To measure the significance of each of these model coefficients $\hat{\beta}\in \{\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}},\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}},\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}\}$ we want to test whether each coefficient $\widehat{\beta}=0$. To do this, we can first estimate the standard error of each inferred coefficient using a clustered bootstrap (with subjectgene pairs as the sampling unit). As such, for each bootstrap iterate, we sampled subjectgene pairs from the full Vgene training data set with replacement. Using this resampled data, we maximized the weighted likelihood function given by (Equation 9) to reestimate each coefficient. We repeated this bootstrap process 1000 times and used the resulting 1000 coefficient estimates to estimate a standard error for each model coefficient. With this estimated standard error of each inferred model coefficient $\hat{\beta}\in \{\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}},\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}},\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}\}$, we test whether $\hat{\beta}=0$ by calculating the test statistic
and comparing $T(\hat{\beta})$ to a $N(0,1)$ distribution to obtain each pvalue. We consider the significance of each model coefficient using a Bonferronicorrected threshold. To establish the threshold, we corrected for the total number of model coefficients being evaluated in the given model.
Evaluating model coefficient variation in the context of SNPs
Request a detailed protocolWith the motif and basecountbeyond model, we are interested in quantifying variation in model coefficients in the context of genetic variations within the gene encoding the Artemis protein that were previously identified as being associated with increasing the extent of trimming (Russell et al., 2022b). Recall that we trained this model using the model covariate function given by (Equation 2). During model fitting, we estimated the model coefficients $\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}}$, $\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}}$, and $\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}$ by maximizing the weighted likelihood function given by (Equation 9).
We have previously identified a set $X$ of SNPs within the gene encoding the Artemis protein that are significantly associated with increasing the extent of trimming (Russell et al., 2022b). For each SNP $x\in X$ and individual $i\in \{1,\dots ,I\}$, we measure the number of minor alleles in the genotype, $g}_{ix}\in \{0,1,2\$. We are interested in whether each of the inferred model coefficients $\hat{\beta}\in \{\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}},\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}},\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}\}$ vary in the context of genotype for each genetic variant $x\in X$. As such, for each SNP of interest, we can adapt the 1×2 motif + twoside basecount beyond model covariate function to allow for genotypespecific variation of each model coefficient by incorporating additional interaction coefficients $\beta}_{x}\in \{{\mathit{\beta}}_{x}^{\mathtt{\text{motif}}},{\mathit{\beta}}_{x}^{\mathtt{\text{AT}}},{\mathit{\beta}}_{x}^{\mathtt{\text{GC}}}\$ to model the relationship between each model parameter and the SNP $x$ genotype. We can then estimate the coefficients of this new model, $\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}}$, $\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}}$, $\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}$, $\hat{{\mathit{\beta}}_{x}^{\mathtt{\text{motif}}}}$, $\hat{{\mathit{\beta}}_{x}^{\mathtt{\text{AT}}}}$, and $\hat{{\mathit{\beta}}_{x}^{\mathtt{\text{GC}}}}$, as before by maximizing the weighted likelihood given by (Equation 9) using the adapted model covariate function. We can measure the significance of each of the model coefficients using the methods described in the previous section. Ultimately if a SNPcoefficient interaction term $\hat{\beta}}_{x}\in \{\hat{{\mathit{\beta}}_{x}^{\mathtt{\text{motif}}}},\hat{{\mathit{\beta}}_{x}^{\mathtt{\text{AT}}}},\hat{{\mathit{\beta}}_{x}^{\mathtt{\text{GC}}}}\$ is significant, we can conclude that the corresponding model coefficient $\hat{\beta}$ varies significantly in the context of the genotype of SNP $x\in X$. We use this same procedure to evaluate whether each model coefficient varies in the context of each SNP of interest.
Appendix 1
Extended notation
Appendix 2
Extended parameter description
Defining the ‘trimming motif’ and positionweightmatrix weight for a given gene and trimming site
Existing probabilistic models of nucleotide trimming using repertoire sequencing data have shown that the local nucleotide context around the trimming site, which we refer to as the ‘trimming motif,’ do a surprisingly good job of predicting the distribution of trimming lengths for a variety of genes (Murugan et al., 2012). This simple PWM model uses a trimming motif containing 2 nucleotides 5’ of the trimming site and 4 nucleotides 3’ of the trimming site to predict the probability of trimming at that site. In practice, we can define the trimming motif to be any size. Let $a$ and $b$ be nonnegative integer values that represent the number of nucleotides 5’ and 3’ of the trimming site, respectively. Together, these $a+b$ nucleotides will compose the trimming motif. For a geneallelegroup sequence $\sigma $ and a number of deleted nucleotides $n$, let $\sigma \left(n+j\right)$ represent the nucleotide identity at the trimming motif position $j\in \{a,\mathrm{\dots},b1\}$ where positions $j<0$ represent motif positions 5’ of the trimming site and positions $j\ge 0$ represent motif positions 3’ of the trimming site. As such, the trimming motif sequence is given by
Depending on $n$, this trimming motif may or may not include Pnucleotides. For example, for $n\ge b$, the $b$ 3’ trimming motif nucleotides will include the $b$ deleted gene sequence nucleotides 3’ of the trimming site (and no Pnucleotides) (Appendix 2—figure 1A). Since we are assuming that the initial hairpin nick occurs at the +2 position, there will be two Pnucleotides present in the 5’to3’ gene sequence. For $b2\le n<b$, where the 2 represents the total Pnucleotide count in the full sequence, Pnucleotides will be included in the trimming motif sequence. Specifically, the $b$ total 3’ trimming motif nucleotides will include $bn$ Pnucleotides and $n$ deleted gene sequence nucleotides (Appendix 2—figure 1B, C). Likewise, as a result of the +2 hairpin nick position assumption, TCRs that have $n<b2$ will not have a full, $\left(a+b\right)$length nucleotide trimming motif (Appendix 2—figure 1C). For these ‘offtheend’ motif cases, we assign zero influence to the missing nucleotides during model fitting.
With this trimming motif, let ${\beta}_{js}^{\text{mtif}}$ be a (log) positionweightmatrix coefficient for trimming motif position $j\in \{a,\mathrm{\dots},b1\}$ and nucleotide $s\in \{A,T,C,G\}$. We can define an unnormalized positionweightmatrix weight
that will serve as a motifspecific model covariate function in subsequent modeling. As described above, since we are considering ‘offtheend’ motif cases, $\sigma \left(n+j\right)$ represent the nucleotide identity at sequence position $j$ where positions $j<0$ represent sequence positions 5’ of the trimming site and positions $j\ge 0$ represent sequence positions 3’ of the trimming site.
AT and GC basecountbeyond the trimming motif
For an arbitrary sequence $x$, we can count the number of AT and GC nucleotides within the sequence as
and
respectively.
Because the count of AT or GC nucleotides within the sequences 5’ and 3’ of the trimming site may influence the probability of trimming differently, we will calculate the counts separately. We will not include nucleotides that were already included in the motif parameterization. As above, for a geneallelegroup sequence $\sigma $ and a number of deleted nucleotides $n$, let $\sigma \left(n+j\right)$ represent the nucleotide identity at sequence position $j$ where positions $j<0$ represent sequence positions 5’ of the trimming site and positions $j\ge 0$ represent sequence positions 3’ of the trimming site. Let $c$ be a nonnegative integer value that represents the number of nucleotides 5’ of the trimming site that will be included in the 5’nucleotide counts (Appendix 2—figure 2). Recall that $a$ is a nonnegative integer value that represents the number of nucleotides 5’ of the trimming site that are included in the ‘trimming motif’ described in the previous section. As such, the nucleotide sequence 5’ of the trimming site, beyond the ‘trimming motif,’ is given by
Within this sequence ${\text{seq}}_{5}(n,\sigma ,a,c)$, we can count the number of AT and GC nucleotides as
and
respectively
To count the number of AT and GC nucleotides in the sequence 3’ of the trimming site, we will include all nucleotides located 3’ of the trimming site that are beyond the ‘trimming motif.’ However, because we are interested in using GC nucleotide content in both directions of the wider sequence as a proxy for the capacity for sequencebreathing and since sequencebreathing is only relevant for nucleotides that are paired, we will not include the nucleotides within the 3’ singlestrandedoverhang when counting 3’ AT and GC nucleotides (Appendix 2—figure 2). Since we are assuming that the initial hairpin nick occurs at the +2 position leading to a 4nucleotidelong 3’ singlestrandedoverhang, for $n>2$, the nucleotide sequence 3’ of the trimming site, beyond the ‘trimming motif,’ is given by
where $b$ is a nonnegative integer value that represents the number of nucleotides 3’ of the trimming site that are included in the ‘trimming motif‘ described in the previous section. For $(n3)<b$, all nucleotides 3’ of the trimming site are considered singlestranded and, thus, no nucleotides will be included in the sequence used to calculate the AT and GC basecounts (Appendix 2—figure 2C). Within this sequence ${\text{seq}}_{3}(n,\sigma ,b)$, we can count the number of AT and GC nucleotides as
and
respectively. As defined, these GC and AT basecounts for the 3’ sequence are dependent on sequence length and provide a parameterization of both GC nucleotide content in both directions of the wider sequence and length.
With these 5’ and 3’ base counts, we can define ${\beta}_{5}^{\text{AT}}$, ${\beta}_{3}^{\text{AT}}$, ${\beta}_{5}^{\text{GC}}$, and ${\beta}_{3}^{\text{GC}}$ to be basecountbeyond model coefficients for 5’ and 3’ sequence basecounts of AT and GC beyond the ‘trimming motif,’ respectively. With these coefficients, we can define a basecountbeyond covariate function for each trimming site $n$ and gene $\sigma $:
DNAshape around the trimming site
Methods have been previously developed to estimate DNAshape features at a singlenucleotide position using the sequence context of 2 neighboring nucleotides on both sides of the nucleotide of interest (Zhou et al., 2013; Chiu et al., 2016). As such, these methods use a slidingpentamer model, centered at each nucleotide of interest, to derive the structural features of nucleotides within a sequence window of any length. These structural features include estimations of electrostatic potential (E), minor groove width (W), and propeller twist (P) for each nucleotide in the sequence window and estimations of roll (R) and helical twist (H) for each dinucleotide pair in the sequence window. For simplicity, we will use the term ‘DNAshape parameters’ to refer to all five of these structural features.
For our purposes, we can define a ‘trimming sequence window’ of size $a+b$, as introduced in the ‘trimming motif’ section with (Equation 13), where $a$ and $b$ are nonnegative integer values that represent the number of nucleotides 5’ and 3’ of the trimming site, respectively. In order to estimate the DNAshape for all nucleotides within this window, we will expand the ‘trimming sequence window’ by 2 nucleotides on both sides such that there are $a+2$ nucleotides 5’ and $b+2$ nucleotides 3’ of the trimming site included in an ‘expanded trimming sequence window.’ For a geneallelegroup sequence $\sigma $ and a number of deleted nucleotides $n$, let $\sigma \left(n+j\right)$ represent the nucleotide identity at the ‘expanded trimming sequence window’ position $j\in \{\left(a+2\right),\mathrm{\dots},\left(b+2\right)1\}$ where positions $j<0$ represent expanded trimming sequence window positions 5’ of the trimming site and positions $j\ge 0$ represent expanded trimming sequence window positions 3’ of the trimming site. As such, the expanded trimming sequence window is given by
Depending on $n$, this expanded trimming sequence window may or may not include Pnucleotides. For example, for $n\ge \left(b+2\right)$, the $\left(b+2\right)$ 3’ expanded trimming sequence window nucleotides will include the $\left(b+2\right)$ deleted gene sequence nucleotides 3’ of the trimming site (and no Pnucleotides) (Appendix 2—figure 3A). For $b\le n<b+2$, the $\left(b+2\right)$ 3’ expanded trimming sequence window nucleotides will include $\left(b+2\right)n$ Pnucleotides and $n$ deleted gene sequence nucleotides (Appendix 2—figure 3B). Since we are assuming that the initial hairpin nick occurs at the +2 position, TCRs that have $n<b$ will not have a full, $\left(a+b+4\right)$length nucleotide expanded trimming sequence window (Appendix 2—figure 3C). The slidingpentamer model (Zhou et al., 2013; Chiu et al., 2016) requires a full pentamer for estimating the DNAshape of each base of interest, and, thus, for these ‘offtheend’ expanded trimming sequence window cases, we cannot estimate DNAshape parameters for all nucleotides within the trimming sequence window. As such, when estimating DNAshape parameters, we must choose $b$ such that $b\le n$ for all trimming lengths $n$ in the data set.
For each nucleotide position $j\in \{a,\mathrm{\dots},b1\}$ within the expanded trimming sequence window ${\text{seq}}_{\text{expd}}(n,\sigma ,a,b)$, we can estimate the nucleotide electrostatic potential,${\text{shape}}^{\text{E}}(j,{\text{seq}}_{\text{expd}}(n,\sigma ,a,b))$, minor groove width, ${\text{shape}}^{\text{W}}(j,{\text{seq}}_{\text{expd}}(n,\sigma ,a,b))$, and propeller twist, ${\text{shape}}^{\text{P}}(j,{\text{seq}}_{\text{expd}}(n,\sigma ,a,b))$. We then standardize the estimated values for each shape type. We can define ${\beta}_{uj}^{\text{Shape}}$ to be a nucleotide shape model coefficient for nucleotide shape type $u\in \{\text{E},\text{W},\text{P}\}$ and trimming sequence window nucleotide position $j\in \{a,\mathrm{\dots},b1\}$. Let $d\in \{a+1,\mathrm{\dots},b1\}$ be the location of each dinucleotide in the trimming sequence window such that $d=0$ represents the location of the trimming site, $d<0$ represents dinucleotide positions 5’ of the trimming site, and $d>0$ represents dinucleotide positions 3’ of the trimming site. For each dinucleotide $d\in \{a+1,\mathrm{\dots},b1\}$ within the expanded trimming sequence window ${\text{seq}}_{\text{expd}}(n,\sigma ,a,b)$, we can estimate the dinucleotide roll, ${\text{shape}}^{\text{R}}(d,{\text{seq}}_{\text{expd}}(n,\sigma ,a,b))$ and helical twist, ${\text{shape}}^{\text{H}}(d,{\text{seq}}_{\text{expd}}(n,\sigma ,a,b))$. As above, we then standardize the estimated values for each dinucleotide shape type. We can define ${\beta}_{vd}^{\text{Shape}}$ to be a dinucleotide shape model coefficient for dinucleotide shape type $v\in \{\text{R},\text{H}\}$ and trimming sequence window dinucleotide position $d\in \{a+1,\mathrm{\dots},b1\}$. We use the R package DNAshapeR (Chiu et al., 2016) to estimate these DNAshape parameters for each trimming sequence window. With these standardized DNAshape estimates, we can define a DNAshape covariate function for each trimming site $n$ and gene $\sigma $
Length
We can think of the trimming amount $n$ as a measure of the sequenceindependent length from the end of the gene for each gene and trimming site, and define ${\beta}^{\text{ldiSt}}$ to be a length model coefficient. As such, we can define a length covariate function for each trimming site $n$
Appendix 3
Extended model validation methods
Calculating the expected persequence conditional log loss across the full Vgene training data set
With the full Vgene training set, we can train each model of interest as described above in (Equation 10) to obtain a trained model $\mathcal{M}$. After this model fitting, we can calculate the expected persequence conditional log loss of the model, $\mathcal{M}$, for the full Vgene training data set, $\mathbf{\text{V}}$, using the procedure described above in (Equation 11). Here, we use the full Vgene data set as both the training data set and the testing data set. Models that have lower expected persequence conditional log loss on the Vgene training data set will indicate that the model has a better fit. Model evaluation using heldout testing sets, as described below, is required for evaluating model generalizability.
Calculating the expected persequence conditional log loss across heldout samples
Because our goal is to learn a model that is geneagnostic, we will evaluate the performance and generalizability of each model by calculating the expected persequence conditional log loss using many different heldout data sets. A model that is generalizable across many genes will perform well and have a good fit across all heldout samples despite their varying gene compositions. To test this, we will create each random, heldout sample from the original training data set by clustersampling all observations from Vgene allele groups, $\sigma}_{\mathrm{V}$, uniformly at random. We will refer to each random, heldout sample as the ‘heldout testing set.’ Let $G$ be the total number of unique Vgene allele groups in the original data set. Let ${G}_{\text{test}}=\text{Round}\left(0.3\cdot G\right)$ be an integer which represents the number of unique genes included in each ‘heldout testing set.’ As such, we can sample each gene $\sigma}_{\mathrm{V}$ with probability
such that the probability of each ‘heldout testing set’ H is given by
The remaining genes not sampled as part of the ‘heldout testing set’ $\mathbf{\text{H}}$ will compose the ‘training set’ $\mathbf{\text{T}}$. Using this ‘training set,’ we can train each model of interest as described above in (Equation 10). After this model training, we can calculate the expected persequence conditional log loss of the model, $\mathcal{M}$, for the ‘heldout testing set,’ $\mathbf{\text{H}}$, as described above in (Equation 11). To achieve an unbiased estimate of the model performance, we will repeat the above procedure across 20 unique heldout testing sets and calculate the expected persequence conditional log loss across all samples. As such, the expected persequence conditional log loss across these random samples is given by
We use the same, unique heldout testing sets to calculate the expected persequence conditional log loss of each model of interest, and thus, we can compare model fit and generalizability by directly comparing the expected persequence conditional log loss of each model. Models that have lower expected persequence conditional log loss will indicate that the model is a better fit and is more generalizable across genes.
Calculating the expected persequence conditional log loss across heldout samples of the ‘mostdifferent’ Vgenes
While the previously described procedure for evaluating the expected persequence conditional log loss across heldout samples of the Vgene data set provided a metric for evaluating model generalizability across different gene sets, we were interested in evaluating model performance for groups of genes which were considered ‘mostdifferent’ sequencewise. Many of the germline Vgene sequences are quite similar, however, there are subgroups of these sequences which share unique sequence traits. We can characterize these ‘mostdifferent’ Vgenes by either using only the ‘terminal’ Vgene sequences (e.g. that last 24 nucleotides of each sequence which is directly parameterized in the models) or using the entire Vgene sequences.
To define the ‘mostdifferent’ Vgene allele group using the ‘terminal‘ Vgene sequences, we first calculate the pairwise hamming distance between each geneallelegroup pair. We then use hierarchical clustering to cluster Vgene allele groups based on their pairwise hamming distances (Appendix 3—figure 1A). The cluster that has the smallest average pairwise hamming distance within the cluster and the largest average pairwise hamming distance outside of the cluster is defined to be the ‘mostdifferent’ Vgeneallelegroup cluster. To define the ‘mostdifferent’ Vgene allele group using the entire Vgene sequences, we first align all gene sequences using the DECIPHER package in R. Using these aligned sequences, we can then proceed with the same procedure as described for the ‘terminal’ Vgene sequences to define the ‘mostdifferent’ Vgene allele group (Appendix 3—figure 1B).
Once we have defined a cluster of the ‘mostdifferent’ Vgene allele groups, using either the ‘terminal’ Vgene sequences or the full sequences, we can define a heldout testing data set $\mathbf{\text{H}}$ containing all data observations from the Vgene allele groups within this ‘mostdifferent’ Vgeneallelegroup cluster. All data observations from the remaining gene allele groups that were not defined to be part of the ‘mostdifferent’ cluster will compose the ‘training set’ $\mathbf{\text{T}}$. Using this ‘training set,’ we can train each model of interest as described above in (Equation 10). After this model training, we can calculate the expected persequence conditional log loss of the model, $\mathcal{M}$, for the ‘heldout testing set,’ $\mathbf{\text{H}}$, as described above in (Equation 11). Models that have lower expected persequence conditional log loss will indicate better fit and generalizability across even the ‘mostdifferent’ genes. We can repeat this process for other Vgeneallelegroup clusters (e.g. the ‘secondmostdifferent’ Vgeneallelegroup cluster) as desired.
Calculating the expected persequence conditional log loss across the full Jgene data set
With the full Vgene training set, we can train each model of interest as described above in (Equation 10) to obtain a trained model $\mathcal{M}$. After this model fitting, we can calculate the expected persequence conditional log loss of the model, $\mathcal{M}$, for the full Jgene training data set, $\mathbf{\text{J}}$, using the procedure described above in (Equation 11). Here, we use the full Vgene data set as the training data set and the full Jgene data set as the testing data set. Models that have lower expected persequence conditional log loss on the Jgene data set will indicate that the model is a better fit and is more generalizable.
Evaluating TCRβ Vgene trimming models using the expected persequence conditional log loss across testing data sets
To validate the performance of each model, we worked with TCRα and TCRβimmunosequencing data representing 150 individuals, TCRγimmunosequencing data representing 23 individuals, and IGHimmunosequencing data representing 9 individuals from three independent validation cohorts (described above). With these data, we used the model coefficients from the previous TCRβ Vgene training run (‘frozen’ in git commit 093610a on our repository) and then compute the expected persequence conditional log loss of the model using each independent validation data set of interest. Models that have low expected persequence conditional log loss across all testing data sets will indicate that the model is more generalizable and less overfit to the training data. We validated each model using V and Jgene sequences separately.
Appendix 4
Extended experimental analyses
Exploring the gene specificity of the ‘trimming motif’
To evaluate the specificity of the motif coefficients across different genes, we can compare the pergene model predictions for the motif and basecount beyond model to a model that only contains basecount beyond parameters. To do this, we first use the entire Vgene data set $\mathbf{\text{V}}$ to train both the motif and basecount beyond model as before in (Equation 10) and a model that contains only basecount beyond parameters (and no motif parameters). We can then use these models to predict the probability of trimming each possible trimming amount, $2\le n\le 14$, for each geneallelegroup sequence $\sigma $. For each of these models, we can then calculate the pergene root mean squared error, $\mathrm{RMSE}$, for each gene $\sigma $ such that
where $\mathcal{M}$ is a model trained using the Vgene training data set $\mathbf{\text{V}}$, $I$ is the set of all individuals in the data set, $\leftI\right$ is the length of the set of individuals $I$, ${P}_{\text{emp}}\left(n\mid \sigma ,i\right)$ is the empirical conditional PDF given by (Equation 1) for trimming length $n$, gene $\sigma $, and individual $i\in I$, and $P\left(n\mid \sigma ;\mathcal{M}\right)$ is the predicted trimming probability from a specified model $\mathcal{M}$. We can then compare this pergene root mean squared error for the model trained using both motif and basecount beyond parameters with a model trained using just basecount beyond parameters.
Sensitivity analysis for hairpin nick position
For our modeling, we assume that the initial hairpin nick occurs at the +2 position and will create two Pnucleotides at the end of the 5’to3’ gene sequence. Assuming a different hairpin nick position would incorporate a different number of Pnucleotides at the end of the gene sequence (Appendix 4—figure 1). While the hairpins are assumed to be nicked at the +2 position most frequently (Ma et al., 2002; Lu et al., 2007), we wanted to test the sensitivity of our models to this hairpin nick position assumption. To do this, we assumed each of the other possible hairpin opening positions (e.g. −2,–1, 0,+1,+3) oneatatime and appended the appropriate number of associated of Pnucleotides given the assumed hairpin nick position to the 3’end of each Vgeneallelegroup sequence in the data set. With each of these hairpin position data sets, we retrained the motif and basecount beyond model as before in (Equation 10) and calculate the expected persequence conditional log loss of the model using (Equation 11). We can compare these expected persequence conditional log losses to evaluate the sensitivity of the model to the +2 hairpin nick assumption.
Evaluating the weight of the 1×2 motif and twoside basecount beyond model terms across data sets
For each testing data set, we can measure the weight of the 1×2 motif and twoside basecount beyond model terms within the full 1×2 motif + twoside basecount beyond model. Recall that we trained the full 1×2 motif + twoside basecount beyond model using the model covariate function given by (Equation 2)
where $n$ represents the number of trimmed nucleotides, $\sigma}_{\mathrm{V}$ represents the Vgeneallelegroup sequence, ${\mathit{\beta}}^{\text{mtif}}$ represents motifspecific parameter coefficients, ${\mathit{\beta}}^{\text{AT}}$ and ${\mathit{\beta}}^{\text{GC}}$ represent basecountbeyondspecific parameter coefficients, $a$ and $b$ are nonnegative integer values that represent the number of nucleotides 5’ and 3’ of the trimming site, respectively, that are included in the ‘trimming motif, and $c$ represents the number of nucleotides 5’ of the trimming site to be included in the basecount. As such, for each training data set, we can use the inferred coefficients, $\hat{{\mathit{\beta}}^{\mathtt{\text{motif}}}}$, $\hat{{\mathit{\beta}}^{\mathtt{\text{AT}}}}$, and $\hat{{\mathit{\beta}}^{\mathtt{\text{GC}}}}$, from a previous training run and define a new twoparameter model containing a scale coefficient for the 1×2 motif terms and a second scale coefficient for the twoside basecount beyond terms. The covariate function for this new model is given by
where ${\alpha}_{\text{mtif}}$ is the scale coefficient for the 1×2 motif terms and ${\alpha}_{\text{Cunt}}$ is the scale coefficient for the twoside basecount beyond terms. We can then train this new model as described previously for each data set of interest and compare the inferred scale coefficients.
Data availability
The current manuscript is a computational study, so no data have been generated for this manuscript. Code is available on GitHub (copy archived at Russell et al., 2022a). Numerical data used to generate figures is available as source data for Figures 3, 4, 5, 6, and 7.

ImmuneACCESSImmunosequencing identifies signatures of cytomegalovirus exposure history and HLA mediated effects on the T cell repertoire.https://doi.org/10.21417/B7001Z

NCBI BioProjectID PRJNA762269. Combining genotypes and T cell receptor distributions to infer genetic loci determining V(D)J recombination probabilities.

ImmuneACCESSID TCRBTCRGcomparison. Normal Human PBMC Deep Sequencing TCRB versus TCRG comparison.

NCBI BioProjectID PRJNA406949. Commonality despite exceptional diversity in the baseline human antibody repertoire.

FigshareFunctional antibodies exhibit light chain coherence.https://doi.org/10.25452/figshare.plus.20338177

NCBI dbGaPID phs001918. Recipient and donor genetic variants associated with mortality after allogeneic hematopoietic cell transplantation.
References

Unifying the DNA endprocessing roles of the ARTEMIS nucleaseJournal of Biological Chemistry 290:24036–24050.https://doi.org/10.1074/jbc.M115.680900

StructureSpecific nuclease activities of ARTEMIS and the ARTEMIS: DNAPKcs complexNucleic Acids Research 44:4991–4997.https://doi.org/10.1093/nar/gkw456

Different DNA end configurations dictate which NHEJ components are most important for joining efficiencyThe Journal of Biological Chemistry 291:24377–24389.https://doi.org/10.1074/jbc.M116.752329

A model of somatic hypermutation targeting in mice based on highthroughput Ig sequencing dataJournal of Immunology 197:3566–3574.https://doi.org/10.4049/jimmunol.1502263

Nucleases of the metallobetalactamase family and their role in DNA and RNA metabolismCritical Reviews in Biochemistry and Molecular Biology 42:67–93.https://doi.org/10.1080/10409230701279118

BaseSpecific sequences that bias somatic hypermutation deduced by analysis of outofframe human igvh genesJournal of Immunology 160:2360–2364.

Inferring processes underlying Bcell repertoire diversityPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 370:20140243.https://doi.org/10.1098/rstb.2014.0243

Survival analysis of DNA mutation motifs with penalized proportional hazardsThe Annals of Applied Statistics 13:1268–1294.https://doi.org/10.1214/18aoas1233

The RAG proteins and V (D) J recombination: complexes, ends, and transpositionAnnual Review of Immunology 18:495–527.https://doi.org/10.1146/annurev.immunol.18.1.495

Mechanistic constraints on diversity in human V (D) J recombinationMolecular and Cellular Biology 16:258–269.https://doi.org/10.1128/MCB.16.1.258

Dna doublestrand breaks and hairpins in V (D) J recombinationSeminars in Immunology 6:125–130.https://doi.org/10.1006/smim.1994.1018

Effects of DNA end configuration on XRCC4DNA ligase IV and its stimulation of Artemis activityThe Journal of Biological Chemistry 292:13914–13924.https://doi.org/10.1074/jbc.M117.798850

Evidence that the DNA endonuclease ARTEMIS also has intrinsic 5′exonuclease activityJournal of Biological Chemistry 289:7825–7834.https://doi.org/10.1074/jbc.M113.544874

HighThroughput immune repertoire analysis with igorNature Communications 9:561.https://doi.org/10.1038/s4146701802832w

Influence of codingend sequence on codingend processing in V (D) J recombinationJournal of Immunology 155:4322–4329.

Nucleotide deletion and P addition in V (D) J recombination: a determinant role of the codingend sequenceMolecular and Cellular Biology 17:3768–3778.https://doi.org/10.1128/MCB.17.7.3768

The chemical biology of human metalloβlactamase fold proteinsTrends in Biochemical Sciences 41:338–355.https://doi.org/10.1016/j.tibs.2015.12.007

Overlap and effective size of the human CD8+ T cell receptor repertoireScience Translational Medicine 2:47.https://doi.org/10.1126/scitranslmed.3001442

Somatic hypermutagenesis in immunoglobulin genes. II. Influence of neighbouring base sequences on mutagenesisBiochimica et Biophysica Acta 1171:11–18.https://doi.org/10.1016/01674781(92)90134l

SoftwareMechanistictrimming, version swh:1:rev:2ba723b3bd4a354fe78f677230b8a6dfb506422dSoftware Heritage.

V (D) J recombination: mechanisms of initiationAnnual Review of Genetics 45:167–202.https://doi.org/10.1146/annurevgenet110410132552

Population variability in the generation and selection of Tcell repertoiresPLOS Computational Biology 16:e1008394.https://doi.org/10.1371/journal.pcbi.1008394

Learning the heterogeneous hypermutation landscape of immunoglobulins from highthroughput repertoire dataNucleic Acids Research 48:10702–10712.https://doi.org/10.1093/nar/gkaa825

Conformational variants of duplex DNA correlated with cytosinerich chromosomal fragile sitesThe Journal of Biological Chemistry 284:7157–7164.https://doi.org/10.1074/jbc.M806866200

IgBLAST: an immunoglobulin variable domain sequence analysis toolNucleic Acids Research 41:W34–W40.https://doi.org/10.1093/nar/gkt382

Structural and mechanistic insights into the ARTEMIS endonuclease and strategies for its inhibitionNucleic Acids Research 49:9310–9326.https://doi.org/10.1093/nar/gkab693

The molecular basis and disease relevance of nonhomologous DNA end joiningNature Reviews. Molecular Cell Biology 21:765–781.https://doi.org/10.1038/s41580020002978
Article and author information
Author details
Funding
National Institutes of Health (R01 AI146028)
 Magdalena L Russell
 Philip Bradley
 Noah Simon
 Frederick A Matsen IV
National Institutes of Health (R01 AI136514)
 Philip Bradley
National Institutes of Health (R35 GM141457)
 Philip Bradley
Howard Hughes Medical Institute (Investigator)
 Frederick A Matsen IV
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
The authors thank David Schatz and Thayer Fisher for helpful discussions regarding this paper, as well as Duncan Ralph for processing the productive IGH sequence data from Jaffe et al., 2022, and Nathaniel Spisak, Thierry Mora, and Aleksandra Walczak for sharing preprocessed data from Spisak et al., 2020. The authors would also like to thank Fred Hutch scientific computing, supported by the National Institutes of Health award S10OD028685. This work was supported by the National Institutes of Health under awards R01 AI146028, R01 AI136514, and R35 GM141457. Dr. Matsen is an Investigator of the Howard Hughes Medical Institute (HHMI). This article is subject to HHMI’s Open Access to Publications policy. HHMI lab heads have previously granted a nonexclusive CC BY 4.0 license to the public and a sublicensable license to HHMI in their research articles. Pursuant to those licenses, the authoraccepted manuscript of this article can be made freely available under a CC BY 4.0 license immediately upon publication.
Copyright
© 2023, Russell et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 535
 views

 65
 downloads

 0
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.