Method for identification of conditionassociated public antigen receptor sequences
Abstract
Diverse repertoires of hypervariable immunoglobulin receptors (TCR and BCR) recognize antigens in the adaptive immune system. The development of immunoglobulin receptor repertoire sequencing methods makes it possible to perform repertoirewide disease association studies of antigen receptor sequences. We developed a statistical framework for associating receptors to disease from only a small cohort of patients, with no need for a control cohort. Our method successfully identifies previously validated Cytomegalovirus and type one diabetes responsive TCR$\beta $ sequences .
https://doi.org/10.7554/eLife.33050.001Introduction
Tcell receptors (TCR) and Bcell receptors (BCR) are hypervariable immunoglobulins that play a key role in recognizing antigens in the vertebrate immune system. TCR and BCR are formed in the stochastic process of V(D)J recombination, creating a diverse sequence repertoire. These receptors consist of two hypervariable chains, the $\alpha $ and $\beta $ chains in the case of TCR. Progress in high throughput sequencing now allows for deep profiling of TCR$\alpha $ and TCR$\beta $ chain repertoires, by establishing a nearcomplete list of unique receptor chain sequences, or ‘clonotypes’, present in a sample. Most sequencing data available correspond to TCR$\beta $ only, but the same principles discussed below apply to TCR$\alpha $ repertoires, or to paired $\alpha \beta $ repertoires.
Comparison of sequenced repertoires has revealed that in any pair of individuals, large numbers of TCR$\beta $ sequences have the same amino acid sequence (Venturi et al., 2011). Several mechanisms leading to the repertoire overlap have been identified so far. The first mechanism is convergent recombination. Due to biases in V(D)J recombination process, the probability of generation of some TCR$\beta $ sequences is very high, making them appear in almost every individual multiple times and repeatedly sampled in repertoire profiling experiments (Britanova et al., 2014). This sharing does not result from a common specificity or function of Tcells corresponding to the shared TCR$\beta $ clonotypes, and may in fact correspond to cells from the naive compartment in both donors (Quigley et al., 2010), or from functionally distinct subsets such as CD4 and CD8 Tcells. The second possible reason for TCR sequence sharing is specific to identical twins, who may share T cell clones as a consequence of cord blood exchange in utero via a shared placenta (Pogorelyy et al., 2017). Note that in that scenario both the $\beta $ and $\alpha $ chains are shared together. The third and most interesting mechanism for sharing the sequence of either the $\beta $ or $\alpha $ or both chains is convergent selection in response to a common antigen. From functional studies, such as sequencing of MHCmultimer specific Tcells, it is known that the antigenspecific repertoire is often biased, and the same antigenspecific TCR $\beta $ or $\alpha $ chain sequences can be found in different individuals (Miles et al., 2011; Dash et al., 2017; Glanville et al., 2017).
Reproducibility of a portion of the antigenspecific Tcell repertoire in different patients creates an opportunity for disease association studies using TCR$\beta $ repertoire datasets (Faham et al., 2017; Emerson et al., 2017). These studies analyse the TCR$\beta $ sequence overlap in large cohorts of healthy controls and patients to identify shared sequences overrepresented in the patient cohort. Here we propose a novel computational method to identify clonotypes which are likely to be shared because of selection for their response to a common antigen, instead of convergent recombination. Our approach is based on a mechanistic model of TCR recombination and is applicable to small cohorts of patients, without the need for a healthy control cohort.
Results
As a proof of concept, we applied our method to two large publicly available TCR$\beta $ datasets from Cytomegalovirus (CMV)positive (Emerson et al., 2017) and type one diabetes (T1D) (Seay et al., 2016) patients. In both studies the authors found shared public TCR$\beta $ clonotypes that are specific to CMVpeptides or selfpeptides, respectively. Specificity of these clonotypes was defined using MHCmultimers. We show that TCR$\beta $ chain sequences functionally associated with CMV and T1D in these studies are identified as outliers by our method.
The main ingredient of our approach is to estimate the probability of generation of shared clonotypes, and to use this probability to determine the source of sharing (see Figure 1). Due to the limited sampling depth of any TCR sequencing experiment, chances to sample the same TCR$\beta $ clonotype twice are low, unless this clonotype is easy to generate convergently, with many independent generation events with the same TCR$\beta $ amino acid sequence in each individual (convergent recombination), or if corresponding Tcell clone underwent clonal expansion, making its concentration in blood high (convergent selection). Thus, we reasoned that convergently selected clonotypes should have a lower generative probability than typical convergently recombined clonotypes. To test this, we estimated the generative probability of the TCR$\beta $’s Complementarity Determining Region 3 (CDR3) aminoacid sequences that were shared between several patients. Since no algorithm exists that can compute this generative probability directly, our method relies on the random generation and translation of massive numbers of TCR nucleotide sequences using a mechanistic statistical model of V(D)J recombination (Murugan et al., 2012), as can be easily performed for example using the IGoR software (Marcou et al., 2017).
In Figure 2A we plot for each clonotype the number of donors sharing that clonotype against its generation probability. Diseasespecific TCR$\beta $ variants validated by functional tests in source studies are circled in red. Note that validated diseasespecific TCR$\beta $ sequences have a much lower generation probability than the typical sequences shared by the same number of donors. We developed a method of axis transformation (see Materials and methods) to compare the model prediction with data values on the same scale (Figure 2B), so that outliers can be easily identified by their distance to identity line. Our method can be used to narrow down the potential candidates for further experimental validation of responsive receptors. Additional information, like the expansion of the identified TCR$\beta $ clonotype in the inflammation site, the presence of the same clonotype in the repertoire of activated or memory Tcells, or absence in a cohort of healthy controls, could provide additional evidence for functional association of identified candidates with a given condition.
Our method also identifies other significant outliers than reported in the source studies (shown in red, and obtained after multipletest correction – see Materials and methods), which may have three possible origins. First, they may be associated with the condition, but were missed by the source studies. Second, they may be due to other factors shared by the patients, such as features involved in thymic or peripheral selection, or reactivity to other common conditions than CMV (e.g. influenza infection). Third, they can be the result of intersample contamination. Our approach is able to diagnose the last explanation by estimating the likelihood of sharing at the level of nucleotide sequences (i.e. synonymously), as detailed in the Materials and methods section.
Discussion
Antigen receptor sequencing currently has little clinical applications. One of the most important ones is diagnostics and tracking of malignant Tcell and Bcell clones in lymphomas, where it allows for directly measuring the abundances of certain clones at different timepoints. Our method allows for a sequencebased theoretical prediction of Tcell abundances at the population level, and for the identification of Tcell clones associated with infectious and autoimmune conditions. Extensive databases of conditionassociated clones can provide a means of disease diagnostics and extend the clinical utility of antigen receptor repertoire sequencing technologies.
This method may also be useful in the analysis of known antigenspecific TCR clonotypes. The typical source of such TCR sequences are MHCmultimer positive cells isolated from one or a few donors (Shugay et al., 2018; Tickotsky et al., 2017). Some of these antigenspecific clonotypes are private, and are hard to find in other patients, providing limited diagnostic value. Our method is able to distinguish these clones from publicly responding clonotypes that are likely to be shared by many patients using only their CDR3 amino acid sequences.
The cohort size necessary for the identification of antigenspecific clonotypes with our method varies (see ‘Designing the experiment’ subsection in Materials and methods). It depends on the strength and diversity of the response to the given antigen. CMV and other Herpesviridae (EBV, HSV), are able to cause a persistent infection, and a large fraction of the TCR repertoire of CMVpositive donors are believed to be specific to them—on average, up to 10% of CD8 +cells are specific to a single CMV epitope in elderly individuals (Khan et al., 2004). However, it was shown that in a human acute infection model of yellow fever vaccination, virusspecific Tcell clones are one of the most abundant in the TCR repertoire and occupy up to 12% of the CD8 +T cell repertoire. This response is shortlived and contracts significantly a month after immunization (Miller et al., 2008). So the peak of an immune response is the best timepoint to search for antigenspecific TCRs in acute infections using this method. Tcell response to herpesviruses is also not unique in terms of public clonotype involvement—in ankylosing spondylitis (Faham et al., 2017), 30–40% of patients share a certain TCR$\beta $ aminoacid sequence, which is more than the fraction of patients sharing CMVspecific clonotypes that we analysed in this study.
Our approach can be used on other hypervariable receptor chains (TCR$\alpha $, BCR heavy and light chains), as well as other species (mice, fish, etc.). Both $\alpha $ and $\beta $ chains contribute to Tcell receptor specificity. Singlecell or paired sequencing technologies (Zemmour et al., 2018) could identify partner receptor chains for conditionassociated TCR $\alpha $ or $\beta $ chain sequences identified with our approach. Antigenic peptides recognized by complete Tcell receptors could then be recovered in vitro using yeastdisplay libraries of peptideMHC (Gee et al., 2018). As paired sequencing becomes more widespread, our method can be extended to the analysis of full paired TCR by applying the exact same analysis using the joint recombination probability of $\alpha \beta $ clonotypes.
Recent advances in computational methods allow us to extract TCR repertoires from existing RNASeq data (Bolotin et al., 2017; Brown et al., 2015). Huge numbers of available RNASeq datasets from patients with various conditions can be used for analysis and identification of novel virus, cancer, and self reactive TCR variants using our method. The more immunoglobulin receptors with known specificity are found using this type of association mapping, the more clinically relevant information can be extracted from immunoglobulin repertoire data.
Materials and methods
Statistical analysis
Problem formulation
Request a detailed protocolOur framework is applicable to analyze the outcome of a next generation sequencing experiment probing the immune receptor repertoires of $n$ individuals with a given condition, for example CMV or Type one diabetes. We denote by ${M}_{i}$ the number of unique amino acid TCR sequences in patient $i$, $i=1,\mathrm{\dots},N$. For a given TCR amino acid sequence $\sigma $, we set ${x}_{i}=1$ to indicate that $\sigma $ is present in patient $i$’s repertoire, and ${x}_{i}=0$ otherwise. For a given shared sequence $\sigma $, we want to know how likely its sharing pattern is under the null hypothesis of convergent recombination, correcting for the donors’ different sampling depths. In other words, is $\sigma $ overrepresented in the population of interest? If $\sigma $ is significantly overrepresented, we also want to quantify the size of this effect.
Overview
Under the null hypothesis, the presence of $\sigma $ in a certain number of donors is explained by independent convergent V(D)J recombination events in each donor. Given the total number of recombination events that led to the sequenced sample of donor $i$, ${N}_{i}$, the presence of given amino acid sequence $\sigma $ in donor is Bernoulli distributed with probability
where ${P}_{\mathrm{post}}(\sigma )$ is the model probability that a recombined product found in a blood sample has sequence $\sigma $ under the null hypothesis. It is formed by the product of ${P}_{\mathrm{gen}}(\sigma )$, the probability to generate the sequence $\sigma $, estimated using a V(D)J recombination model (see the following subsection), and $Q$, a constant correction factor accounting for thymic selection (see Estimation of the correction factor $Q$ subsection). The number of independent recombination events ${N}_{i}$ leading to the observed unique sequences in a sample $i$ is unknown, because of convergent recombination events within the sample, but it can be estimated from the number of unique sequences ${M}_{i}$, using the model distribution ${P}_{\mathrm{post}}$ (see Estimation of ${N}_{i}$ subsection).
We also calculate the posterior distribution of ${P}_{\mathrm{data}}(\sigma )$, corresponding to the empirical counterpart of ${P}_{\mathrm{post}}(\sigma )$ in the cohort, inferred from the sharing pattern of $\sigma $ across donors. We use information about the presence of $\sigma $ in our donors, ${x}_{1},\mathrm{\dots},{x}_{n}$ and the sequencing depth for each donor, ${N}_{1},\mathrm{\dots},{N}_{n}$ (see Estimation of ${P}_{\mathrm{data}}(\sigma )$ subsubsection), yielding the posterior density: $\rho ({P}_{\mathrm{data}}{x}_{1},\mathrm{\dots},{x}_{N})$.
Finally, we estimate the probability, given the observations, that the true value of ${P}_{\mathrm{data}}$ is smaller than the theoretical value ${P}_{\mathrm{post}}$ predicted using V(D)J recombination model, analogous to a pvalue and used to identify significant effects:
To estimate the effect size $q(\sigma )$ we compare ${P}_{\mathrm{data}}$ to ${P}_{\mathrm{post}}$,
Estimation of ${P}_{\mathrm{gen}}$, the probability of generation of a TCR CDR3 amino acid sequence
Request a detailed protocolTo procedure outlined above requires to calculate ${P}_{\mathrm{gen}}(\sigma )$, the probability to generate a given CDR3 amino acid sequence. Methods exist to calculate the probability of TCR and BCR nucleotide sequences from a given recombination model (Murugan et al., 2012; Marcou et al., 2017), but are impractical to calculate the probability of amino acid sequences, because of the large number of codon combinations that can lead to the same amino acid sequence, ${\prod}_{a=1}^{L}{n}_{\mathrm{codons}}(\sigma (a))$, where $L$ is the sequence length, and ${n}_{\mathrm{codons}}(\tau )$ the number of codons coding for amino acid $\tau $. The number is about $1.4\times {10}^{7}$ for a typical CDR3 length of 15 amino acid.
Instead, we estimated ${P}_{\mathrm{gen}}(\sigma )$ using a simple MonteCarlo approach. We randomly generated a massive number (${N}_{\mathrm{sim}}=2\times {10}^{9}$) of recombination scenarios according to the validated recombination model (Murugan et al., 2012):
The resulting sequences were translated, truncated to only keep the CDR3, and counted. ${P}_{\mathrm{gen}}(\sigma )$ was approximated by the fraction of events thus generated that led to sequence $\sigma $. This approximation becomes more accurate as ${N}_{\mathrm{sim}}$ increases, with an error on ${P}_{\mathrm{gen}}(\sigma )$ scaling as ${({P}_{\mathrm{gen}}(\sigma )/{N}_{\mathrm{sim}})}^{1/2}$.
Estimation of the correction factor $Q$
Request a detailed protocolNot all generated sequences pass selection in the thymus. ${P}_{\mathrm{gen}}$ systematically underestimates the frequency of recombination event that eventually make it into the observed repertoire. To correct for this effect, we estimate a correction factor $Q$, as was suggested in (Elhanati et al., 2014):
Contrary to (Elhanati et al., 2014), which learned a sequencespecific factor for each individual, here we assume that all observed sequences passed thymic selection. $Q$ is a normalization factor accounting for the fact that just a fraction ${Q}^{1}$ of sequences pass thymic selection. This factor is determined for each VJcombination as an offset when plotting $\mathrm{log}{P}_{\mathrm{gen}}$ against $\mathrm{log}{P}_{\mathrm{data}}^{*}$ (see the following subsection for definition of ${P}_{\mathrm{data}}^{*}$), using least squares fitting.
Estimation of ${P}_{\mathrm{data}}(\sigma )$, the probability of sequence occurrence in data
Request a detailed protocolThe variable ${x}_{i}$ indicates the presence or absence of a given TCR amino acid sequence $\sigma $ in the $i$th dataset with ${N}_{i}$ recombination events per donor. We want to estimate ${P}_{\mathrm{data}}(\sigma )$, which is a fraction of recombination events leading to $\sigma $ in the population of interest. According to Bayes’ theorem, for a given $\sigma $, the probability density function of ${P}_{\mathrm{data}}$ reads:
The likelihood is given by a product of Bernouilli probabilities:
and a flat prior ${\rho}_{\mathrm{prior}}({P}_{\mathrm{data}})=\mathrm{const}$ is used.
We estimate ${P}_{\mathrm{data}}^{*}$ (shown in Figure 2B) as the maximum of the posterior distribution:
Estimation of ${N}_{i}$, the number of recombination events
Request a detailed protocolThe total number ${N}_{i}$ of recombination events in $i$th dataset is unknown, but we can count the number of unique CD3 acid sequences ${M}_{i}$ observed in the sequencing experiment. For a typical TRB experiment, convergent recombination is relatively rare and one could use ${N}_{i}\approx {M}_{i}$ as an approximation. However, for less diverse loci (e.g TRA), or for much higher sequencing depths, one should correct for convergent recombination, as the the observed number of unique aminoacid sequences could be much lower than the actual number of corresponding recombination events.
The average number of unique sequences resulting from ${N}_{i}$ recombination events is, in theory:
where $T$ is the set of sequences that can pass thymic selection. To estimate that number, we generate a very large number ${N}_{\mathrm{sim}}$ of recombinations, leading to ${N}_{\mathrm{uni}}$ unique CDR3 amino acid sequences for which ${P}_{\mathrm{gen}}$ is estimated as explained above. We take $T$ to be a random subset of unique sequences, $T\subset \{{\sigma}_{1},\mathrm{\dots},{\sigma}_{{N}_{\mathrm{uni}}}\}$, of size $T={N}_{\mathrm{uni}}/Q$, and we apply Equation 8.
Using this equation we plot the calibration curve for the TRBV51 TRBJ26 VJ datasets in Figure 3. For comparison the case of no thymic selection ($Q=1$) is shown in red. The inversion of this curve yields ${N}_{i}$ as a function of ${M}_{i}$.
Pipeline description
Request a detailed protocolIn this section we describe how to apply our algorithm to real data. All the code and data necessary to reproduce our analysis is available online on github (https://github.com/pogorely/vdjRec/; copy archived at https://github.com/elifesciencespublications/vdjRec/).
We start with annotated TCR datasets (CDR3 amino acid sequence, Vsegment, Jsegment), one per donor. Such datasets are produced by MiXCR (Bolotin et al., 2015), immunoseq (http://www.adaptivebiotech.com/immunoseq) and most other software for NGS repertoire data preprocessing. Data we used was in immunoseq format, publicly available from https://clients.adaptivebiotech.com/immuneaccess database.
We proceed as follows:
Split datasets by VJ combinations. The resulting datasets correspond to lists of unique CDR3 amino acid sequences for each donor and VJ combination. All following steps should be done independently for each VJ combination.
(Optional). Filter out sequences present in only one donor to speed up the downstream analysis.
Generate a large amount of simulated nucleotide TCR sequences for a given VJ combination. Extract and translate their CDR3, and count how many times each sequence appears in the simulated set (restricting to sequences actually observed in donors for better efficiency). The resulting number divided by the total number of simulated sequences is an estimate of ${P}_{\mathrm{gen}}$.
Estimate ${P}_{\mathrm{data}}^{*}$ for each sequence in the dataset, see Estimation of ${P}_{\mathrm{data}}(\sigma )$ subsection.
Using ${P}_{\mathrm{data}}^{*}$ and ${P}_{\mathrm{gen}}$, estimate for each VJ combination the normalization $Q$ by minimizing ${\sum}_{j=1}^{n}{(\mathrm{log}{P}_{\mathrm{data}}^{*}({\sigma}_{j})\mathrm{log}{P}_{\mathrm{gen}}({\sigma}_{j})\mathrm{log}Q)}^{2}$, see Estimation of the correction factor $Q$ subsection, where ${\sigma}_{j}$, $j=1,\mathrm{\dots},n$ are the shared sequences.
Calculate ${P}_{\mathrm{post}}=Q\times {P}_{\mathrm{gen}}$. Calculate the pvalue (Equation 1) and effect size (Equation 2).
Usage example
Data sources
Request a detailed protocolData from (Emerson et al., 2017) and (Seay et al., 2016) is publicly available from the immuneaccess database: https://clients.adaptivebiotech.com/immuneaccess. For our analysis, we only considered VJ combinations for which the authors identified conditionassociated clonotypes with MHCmultimer proved specificity. CDR3 aminoacid sequences and V and J segment of these TCR clonotypes are given in Table 1.
Analysis results
Request a detailed protocolWe applied our pipeline to identify CMVspecific and selfspecific TCR sequences listed in Table 1. For our analysis we used only case cohorts, without controls. For each dataset we followed our pipeline described in Pipeline description subsection. We found that sequences reported in the source studies as being both significantly enriched in the patient cohort, and antigenspecific according to MHCmultimers, were the most significant in 3 out of 4 datasets (See Table 2). In the remaining TRBV12 dataset, the sequence of interest was the top $40$ most significant out of $27,699$ sequences present in at least two CMVpositive donors.
Identifying contaminations
Request a detailed protocolIntersample contamination may complicate highthroughput sequencing data analysis in many ways. It could occur both during library preparation or the sequencing process itself (Sinha et al., 2017). Contaminations have the same nucleotide and amino acid sequence in all datasets, and so our method identifies them as outliers, because their sharing cannot be explained by a high recombination probability.
Our method provides a tool to diagnose contamination. Given an aminoacid sequence present in many donors, we measure its theoretical nucleotide diversity using the same simulation approach we used to calculate the generative probability ${P}_{\mathrm{gen}}$ of the amino acid sequence (see Estimation of ${P}_{\mathrm{gen}}$ subsection). If the diversity of the simulated nucleotide sequences is much larger than observed in the data, it is a sign of contamination.
We applied this approach to the CDR3 sequence CASSLVGGPSSEAFF associated to Type one diabetes, and found 19 recombination events consistent with that amino acid sequence out of our simulated dataset. We found 18 different nucleotide variants out of the 19 total possible. In contrast, in the data this clononotype had the same nucleotide variant in all of the eight donors in which it was present. That variant was absent from the simulated set. A onesided Fisher exact test gives a $p<{10}^{6}$ probability of this happening by chance, indicating contamination as a likely source of sharing.
Designing the experiment
Request a detailed protocolOur approach also allows us to obtain important estimates for experiment design. A number of variables affect detection of an antigenspecific clone using our approach: the abundance of the clone in the general population (represented by ${P}_{\mathrm{data}}$ in our approach), the cohort size, the sequencing depth ${N}_{i}$ in each donor in the cohort, and also the effect size. Fixing any two of these variables results in a constraint between the other two and the affects the probability to detect an antigenspecific clonotype, which translates into the statistical power of the method. As an example of such an analysis, we fix the cohort size at 10, 30 or 100 donors (see Figure 4A–C respectively) and the sequencing depth at ${N}_{i}=1000$ unique clones sequenced per repertoire for a given VJcombination in each donor in the cohort. We ask how frequently a disease specific clone with ${\stackrel{~}{P}}_{\mathrm{data}}$ abundance in the population and effect size $q={\stackrel{~}{P}}_{\mathrm{data}}/{P}_{\mathrm{post}}$ is detected with our method. To address this question for each value ${\stackrel{~}{P}}_{\mathrm{data}}$ we perform a simulation: we simulate ${x}_{1},{x}_{2},\mathrm{\dots},{x}_{n}$ Bernoulli variables, each with a ${p}_{i}=1{e}^{{N}_{i}{\stackrel{~}{P}}_{\mathrm{data}}}$ success probability. For a given value of ${\stackrel{~}{P}}_{\mathrm{data}}$ and $q$ there is a single value of ${P}_{\mathrm{post}}={\stackrel{~}{P}}_{\mathrm{data}}/q$. Then we calculate
where $\rho ({P}_{\mathrm{data}}{x}_{1},\mathrm{\dots},{x}_{n})$ is the posterior density, and check if $\mathbb{P}({P}_{\mathrm{p}\mathrm{o}\mathrm{s}\mathrm{t}}>{P}_{\mathrm{d}\mathrm{a}\mathrm{t}\mathrm{a}})$ is below a significance threshold of 0.0001. Such a low significant threshold in this example is chosen to take into account the multiple testing correction: we assume that about 1000 shared clones would be tested in a such analysis and p¡0.01 after multiple testing is chosen as the significance threshold in this study, which gives p¡0.0001 before the Bonferroni multiple testing correction. Then we plot the number of simulations in which a significant result was obtained for given effect size $q$ and ${\stackrel{~}{P}}_{\mathrm{data}}$ for the clone of interest and the fraction of donors with this sequence in the simulated cohort (see Figure 4E, blue curve). Unsurprisingly, the effect size plays a role in the probability to detect an antigen specific clone, and the detection is not possible at all if the clone is not shared between several donors in the cohort (in our example this happens for ${\stackrel{~}{P}}_{\mathrm{data}}<{10}^{5}$) irrespective to the effect size. Larger cohort sizes can help to resolve clones with lower abundances, but sequencing depth also has a strong effect on the power of the approach. In Figure 4D and E we show simulation results for a fixed $q=10$ and different sequencing depths ${N}_{i}$ of 100, 1000 or 10000 clones per donor in a given VJ combination. Interestingly, a large sequencing depth (black curve) can lead to a situation when an abundant and frequently generated clone will not be detected by the algorithm, because it will be found in all donors in the cohort. An additional test that checks the predictions by lowering the sequencing depth in silico by downsampling can solve this problem.
Another complicated question is how ${P}_{\mathrm{data}}$ is related to the number of clones and the fraction of the repertoire involved in the response to the infection in a given donor. If the same antigenspecific clone is present in every donor, ${P}_{\mathrm{data}}$ is close to the average abundance of this clone in the repertoire. However one can imagine an opposite situation where the response is so diverse and private that different clones respond to a given antigen in each donor. It was previously shown that the diversity and publicness of responding Tcell clonotypes varies a lot across antigens (Dash et al., 2017). Our approach is restricted to the identification of public antigenspecific clonotypes, which may not exist for all antigens.
Data availability

Tissue distribution and clonal diversity of the T and B cell repertoire in type 1 diabetesPublicly available at ImmuneAccess.
References

Antigen receptor repertoire profiling from RNAseq dataNature Biotechnology 35:908–911.https://doi.org/10.1038/nbt.3979

MiXCR: software for comprehensive adaptive immunity profilingNature Methods 12:380–381.https://doi.org/10.1038/nmeth.3364

Agerelated decrease in TCR repertoire diversity measured with deep and normalized sequence profilingThe Journal of Immunology 192:2689–2698.https://doi.org/10.4049/jimmunol.1302064

Restricted autoantigen recognition associated with deletional and adaptive regulatory mechanismsThe Journal of Immunology 183:59–65.https://doi.org/10.4049/jimmunol.0804046

Bias in the αβ Tcell repertoire: implications for disease pathogenesis and vaccinationImmunology and Cell Biology 89:375–387.https://doi.org/10.1038/icb.2010.139

Persisting fetal clonotypes influence the structure and overlap of adult human T cell receptor repertoiresPLoS Computational Biology 13:e1005572.https://doi.org/10.1371/journal.pcbi.1005572

VDJdb: a curated database of Tcell receptor sequences with known antigen specificityNucleic Acids Research 46:D419–D427.https://doi.org/10.1093/nar/gkx760

A mechanism for TCR sharing between T cell subsets and individuals revealed by pyrosequencingThe Journal of Immunology 186:4285–4294.https://doi.org/10.4049/jimmunol.1003898
Article and author information
Author details
Funding
Russian Science Foundation (151500178)
 Dmitriy M Chudakov
 Ilgar Z Mamedov
 Yuri B Lebedev
European Research Council (724208)
 Aleksandra M Walczak
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by Russian Science Foundation grant 151500178, and partially supported by European Research Council Consolidator Grant 724208.
Version history
 Received: October 24, 2017
 Accepted: March 12, 2018
 Accepted Manuscript published: March 13, 2018 (version 1)
 Version of Record published: March 28, 2018 (version 2)
Copyright
© 2018, Pogorelyy et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 3,020
 views

 543
 downloads

 52
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Evolutionary Biology
A comprehensive census of McrBC systems, among the most common forms of prokaryotic Type IV restriction systems, followed by phylogenetic analysis, reveals their enormous abundance in diverse prokaryotes and a plethora of genomic associations. We focus on a previously uncharacterized branch, which we denote coiledcoil nuclease tandems (CoCoNuTs) for their salient features: the presence of extensive coiledcoil structures and tandem nucleases. The CoCoNuTs alone show extraordinary variety, with three distinct types and multiple subtypes. All CoCoNuTs contain domains predicted to interact with translation system components, such as OBfolds resembling the SmpB protein that binds bacterial transfermessenger RNA (tmRNA), YTHlike domains that might recognize methylated tmRNA, tRNA, or rRNA, and RNAbinding Hsp70 chaperone homologs, along with RNases, such as HEPN domains, all suggesting that the CoCoNuTs target RNA. Many CoCoNuTs might additionally target DNA, via McrC nuclease homologs. Additional restriction systems, such as Type I RM, BREX, and Druantia Type III, are frequently encoded in the same predicted superoperons. In many of these superoperons, CoCoNuTs are likely regulated by cyclic nucleotides, possibly, RNA fragments with cyclic termini, that bind associated CARF (CRISPRAssociated Rossmann Fold) domains. We hypothesize that the CoCoNuTs, together with the ancillary restriction factors, employ an echeloned defense strategy analogous to that of Type III CRISPRCas systems, in which an immune response eliminating virus DNA and/or RNA is launched first, but then, if it fails, an abortive infection response leading to PCD/dormancy via host RNA cleavage takes over.

 Computational and Systems Biology
Imputing data is a critical issue for machine learning practitioners, including in the life sciences domain, where missing clinical data is a typical situation and the reliability of the imputation is of great importance. Currently, there is no canonical approach for imputation of clinical data and widely used algorithms introduce variance in the downstream classification. Here we propose novel imputation methods based on determinantal point processes (DPP) that enhance popular techniques such as the multivariate imputation by chained equations and MissForest. Their advantages are twofold: improving the quality of the imputed data demonstrated by increased accuracy of the downstream classification and providing deterministic and reliable imputations that remove the variance from the classification results. We experimentally demonstrate the advantages of our methods by performing extensive imputations on synthetic and real clinical data. We also perform quantum hardware experiments by applying the quantum circuits for DPP sampling since such quantum algorithms provide a computational advantage with respect to classical ones. We demonstrate competitive results with up to 10 qubits for smallscale imputation tasks on a stateoftheart IBM quantum processor. Our classical and quantum methods improve the effectiveness and robustness of clinical data prediction modeling by providing better and more reliable data imputations. These improvements can add significant value in settings demanding high precision, such as in pharmaceutical drug trials where our approach can provide higher confidence in the predictions made.