General notation | |
| Set of all individuals |
| Index for an individual in the set of all individuals |
| Total number of TCRs in the repertoire of individual |
| Index of a sequence in the TCR repertoire of individual |
| Random variable that represents the gene sequence |
| General notation for a gene-allele-group sequence oriented 5’-to-3’ |
| V-gene-allele-group sequence (‘top’ strand oriented 5’-to-3’) |
| J-gene-allele-group sequence (‘bottom’ strand oriented 5’-to-3’) |
| Random variable that represents the number of deleted nucleotides |
| Number of deleted nucleotides from the 3’-side of a gene sequence |
| Lower bound of ‘reasonable’ trimming amounts, we have defined |
| Upper bound of ‘reasonable’ trimming amounts, we have defined |
| The number of TCRs that use gene allele group in the sampled repertoire of individual |
| The number of TCRs that have gene allele group and nucleotides deleted in the sampled repertoire of individual |
| Set of all ‘reasonable’ trimming amounts; |
| Empirical conditional probability density function (Equation 1) |
Motif parameter-specific notation | |
| Non-negative integer value that represents the number of nucleotides 5’ of the trimming site to be included in the ‘trimming motif’ |
| Non-negative integer value that represents the number of nucleotides 3’ of the trimming site to be included in the ‘trimming motif’ |
| ‘Trimming motif’ sequence (Equation 13) |
| (Log) position weight matrix coefficient for trimming motif position and nucleotide |
| Set of all motif coefficients for all positions and nucleotide |
| Motif-specific covariate function (Equation 14) |
Base-count-beyond parameter-specific notation | |
| Non-negative integer value that represents the number of nucleotides 5’ of the trimming site to be included in the 5’ base-count-beyond the ‘trimming motif’ |
| Count of nucleotides that are A or T in an arbitrary sequence |
| Count of nucleotides that are G or C in an arbitrary sequence |
| The nucleotide sequence 5’ of the trimming site, beyond the ‘trimming motif’ (Equation 15) |
| The nucleotide sequence 3’ of the trimming site, beyond the ‘trimming motif’ (Equation 16) |
and | Base-count-beyond model coefficients for the 5’ and 3’ sequence base-counts of A and T nucleotides beyond the trimming motif |
| Set of AT-base-count-beyond model coefficients (includes and ) |
and | Base-count-beyond model coefficients for the 5’ and 3’ sequence base-counts of G and C nucleotides beyond the trimming motif |
| Set of GC-base-count-beyond model coefficients (includes and ) |
| Base-count-beyond-specific covariate function (Equation 14) |
DNA-shape parameter-specific notation | |
| ‘Expanded trimming sequence window’ (Equation 18); consists of the ‘trimming motif’ sequence extended by 2 nucleotides in both the 5’ and 3’ direction |
E | Nucleotide electrostatic potential |
W | Nucleotide minor groove width |
P | Nucleotide propeller twist |
R | Di-nucleotide roll |
H | Di-nucleotide helical twist |
| Measure of nucleotide shape for the nucleotide at position within the ‘expanded trimming sequence window’ |
| Measure of di-nucleotide shape for the di-nucleotide at position within the ‘expanded trimming sequence window’ |
| DNA-shape coefficients for nucleotide shape type and ‘expanded trimming sequence window’ nucleotide position |
| DNA-shape coefficients for di-nucleotide shape type and ‘expanded trimming sequence window’ di-nucleotide position |
| Set of all nucleotide and di-nucleotide DNA-shape coefficients |
| DNA-shape-specific covariate function (Equation 19) |
Length parameter-specific notation | |
| Length specific model coefficient |
| Length-specific covariate function |
Modeling notation | |
| Example model covariate function including motif and base-count-beyond model parameters (Equation 2) |
| Conditional logit model formulation using the motif and base-count-beyond model covariate function (Equation 3) |
| Aggregated log-likelihood for the conditional logit model; this likelihood function is un-weighted (Equation 4) and gives every observation uniform treatment in the likelihood |
| Sampling procedure for the construction of the expected likelihood |
| Expected log-likelihood for the conditional logit model; this likelihood function (Equation 5) weights each observation by its sampling probability, |
| Expected log-likelihood for the conditional logit model; this likelihood function (Equation 7) weights each observation by its sampling probability from the empirical joint PDF (Equation 6) |
| Empirical average per-gene-allele-group frequency used in formulating a subject-independent gene sampling procedure (Equation 8) |
| Expected log-likelihood for the conditional logit model; this likelihood function (Equation 9) weights each observation using a subject-independent gene sampling procedure (Equation 8) |
Model evaluation notation | |
| An arbitrary model trained on a specified training data set |
| Full V-gene data set |
| Full J-gene data set |
| Arbitrary held-out data set |
| Probability of the arbitrary held-out data set (Equation 21) |
| Expected per-sequence conditional log loss (Equation 11) of a trained model evaluated on a data set |
| Expected per-sequence conditional log loss across 20 random held-out data sets (Equation 22) |
| Per-gene mean squared error (Equation 23) for a gene using a model trained using the V-gene training data set |
Coefficient evaluation notation | |
| Test statistic (Equation 12) for evaluating the significance of a single inferred coefficient |
| Set of SNPs within the gene encoding the Artemis protein that were previously identified to be associated with increasing the extent of trimming (Russell et al., 2022b) |
| Number of minor alleles in the genotype of an individual for SNP |
| Set of interaction coefficients between each model parameter and the SNP genotype |