1. Genetics and Genomics
  2. Microbiology and Infectious Disease
Download icon

Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration

  1. John A Lees  Is a corresponding author
  2. Nicholas J Croucher
  3. David Goldblatt
  4. François Nosten
  5. Julian Parkhill
  6. Claudia Turner
  7. Paul Turner
  8. Stephen D Bentley  Is a corresponding author
  1. Wellcome Trust Sanger Institute, United Kingdom
  2. St. Mary’s Campus, Imperial College London, United Kingdom
  3. University College London, United Kingdom
  4. Mahidol University, Thailand
  5. University of Oxford, United Kingdom
Research Article
Cite this article as: eLife 2017;6:e26255 doi: 10.7554/eLife.26255
6 figures, 5 tables and 6 data sets


Swabbing and sequencing study design.

We start with serotype swab data on 598 children from two cohorts, taken every month after birth for two years. For all samples we fitted the transition and emission probabilities of a continuous time hidden Markov model for each serotype. Then, for each child, we used these parameters were then used to infer the most likely carriage durations. We matched carriage episodes with resistance and genomic data for 2157 episodes to draw conclusions on the basis of variation in this epidemiological parameter.

Hidden Markov models of swab time series, and their goodness-of-fit.

We fitted three different models to the processed time-series data with states, allowed transitions and emissions as shown. We refitted each model allowing the transitions probabilities to covary with the age of the child and whether the child had carried pneumococcus previously. For the converged model the Akaike information criterion (AIC) is shown for the original fit, and when including these covariates (AICcovar).

Figure 3 with 3 supplements
Distribution of carriage duration, and effect of monotonic transformation.

Panel (a) shows a histogram of the inferred carriage duration, (b) shows this result after the natural logarithm is taken, and (c) after the warping function is applied.

Figure 3—source data 1

Sequenced isolates and their untransformed inferred carriage durations.

Figure 3—source data 2

Sequenced isolates and their warped carriage durations.

Figure 3—figure supplement 1
Regression diagnostics and outlier removal.

Panel (a) shows prior to outlier removal, (b) after outlier removal as produced by plot.lm() in R. Points deviating from normal residuals (top right plot), and at high leverage (bottom right plot) were removed. These observations appeared to be due to swabs not taken at the prescribed monthly intervals.

Figure 3—figure supplement 2
Monotonic warping function from warped-lmm. x-axis shows the centred and normalised input phenotype; y-axis shows corresponding warped value.
Figure 3—figure supplement 3
Normal quantile-quantile plot of carriage length, and effect of monotonic transformation.

Panel (a) the inferred carriage duration, (b) after the natural logarithm is taken, and (c) after the warping function is applied.

Figure 4 with 4 supplements
Mapping of carriage duration onto phylogeny.

Using the carriage duration as a continuous trait, the ancestral state at every node of the rooted phylogeny was reconstructed. Red branches are carriage for a short time, blue for a long time. Clusters identified in previous analysis have been labelled.

Figure 4—source data 1

Phylogenetic tree in Newick format.

Figure 4—figure supplement 1
Mapping of warped carriage duration onto phylogeny.

As Figure 4, but using the warped carriage duration. Blue branches are carriage for a short time, yellow for a long time.

Figure 4—figure supplement 2
Histogram of pairwise patristic distances on the inferred phylogeny.

A cut-off for heritability estimation was chosen at 0.04, under which a clear second maxima corresponds to closely related isolates on the tree.

Figure 4—figure supplement 3
Change in carriage duration associated with capsule switching events.

For each of the three events analysed the subtree containing the switch is shown on the left. For each isolate within the subtree, carriage duration (on a roughly exponential scale), warped carriage duration (on a roughly linear scale) and serotype are shown as coloured bars aligned with the tip.

Figure 4—figure supplement 4
Lasso regression plots for lineage effects.

Panel a) shows the value of each predictor on the y-axis for different values of the 1 penalty λ on the x-axis, which increases from left to right. The labels along the top are the number of predictors remaining in the model for each λ. Panel b) shows the results of leave-one-out cross validation on the mean-squared error, along the same x-scale. The λ at minimum error is shown by the left dashed line, and the λ within one standard error is shown by the right dashed line.

Figure 5 with 5 supplements
Manhattan plot of SNPs associated with carriage duration.

The significance of each SNP’s association with carriage duration against its position in the ATCC 700669 genome is shown. The red line denotes genome-wide significance (α<0.05 Bonferroni corrected with 92487 unique tests), and the blue line suggestive significance (2.3 orders of magnitude below significant, following convention). Loci reaching suggestive significance are labelled with their nearest annotation, as in Table 4.

Figure 5—source data 1

Plot file for Manhattan plot, with coordinates and -log10 transformed p-values of all tested SNPs.

Figure 5—figure supplement 1
Possible SNPs associated with lineage and carriage duration.

The SNPs and p-values as shown in Figure 5, however the x-axis is now ordered by strength of association of lineage (defined by principal component) with carriage duration. The left most lineages are those most associated, those in black/grey were not significantly associated. SNPs are coloured by the lineage they are most associated with.

Figure 5—figure supplement 2
Distribution of lengths of significant k-mers.

The lengths of those k-mers reaching significance in the LMM analysis. Lengths below 20 bases were filtered from downstream analysis, due to having low specificity.

Figure 5—figure supplement 3
Quantile-quantile plot of association p-values.

For fast-lmm results on (a) SNPs passing quality filters and (b) k-mers of all lengths passing quality filters.

Figure 5—figure supplement 4
Manhattan plots of phage-associated SNPs associated with carriage duration.

As in Figure 5, but enlarging the phage region found to be significant. SNPs are coloured by their LD with the lead SNP (the highest P-value in the region plotted), and are crosses if they are predicted to cause a change in coding sequence. Panel (a) shows LD in relation to the lead SNP at position 1516350. Panel (b) plots genes in the region, with the start and end of the phage genes labelled. Panel (c) shows LD in relation to the second SNP signal at position 1517063.

Figure 5—figure supplement 5
Identification of phage in assemblies by blastn hit length.

Histogram of the length of top hits against a database of phage sequence by blastn. Isolates with >5000 bp hits were defined as having phage present.

Predicted mean carriage duration as a function of child age.

Fit is an exponential decay over the first two years of life, using the decay rate inferred from a linear regression of log(carriage duration).



Table 1

Success of culturing unencapsulated S. pneumoniae.

Based on having >1% abundance of 16S reads showing the bacteria as being present, 44/361 true positive swabs were not successfully cultured.

AbundanceCulture positiveNumber
>1%Not cultured44
<1%Not cultured54
Table 2
Coefficients from lasso regression model of carriage duration.

The mean (intercept) corresponds to a sensitive 6A/C carriage episode, and different serotypes and resistances are perturbations about this mean. Positive effects are expected to have a greater magnitude, due to the positive skew of carriage duration. Rows in bold were significant predictors in the covariance test.

FactorEffect on carriage duration (days)
Mean (intercept)59.5
Erythromycin resistance+7.5
Tetracycline resistance+3.0
Trimethoprim resistance+2.9
Clindamycin resistance+1.8
Penicillin intermediate resistance+1.3
Serotype 19F+46.9
Serotype 23F+21.0
Serotype 6B+16.2
Serotype 14+7.2
Serotype 21+1.6
Serotype 19B−0.1
Serotype 18C−1.9
Serotype 29−4.3
Serotype 3−4.5
Serotype 4−7.2
Serotype 24F−8.5
Non-typable (NT)−12.3
Serotype 5−18.6
Table 3
Mean length of carriage, and expected number of carriage episodes within the first two years of life.

Only serotypes with enough data for the HMM fit to converge are shown. Starred observations have a standard error which is larger than the estimated value, indicating low confidence in the estimate.

SerotypeSojourn time (days)Expected number of infections
Table 4
SNP Locus effects at genome-wide and suggestive significance.

Co-ordinates are with respect to the ATCC 700669 reference genome, and are for the lead SNP in each locus after LD-pruning. Effect sizes are for the warped phenotype.

Co-ordinateNearest annotationEffect sizeP-valueSignificance level
303239IS630-Spn1 transposase0.0789.2×10-5Suggestive
971849SPRITE repeat region0.0789.4×10-5Suggestive
1013978IS630-Spn1 transposase0.113.7×10-5Suggestive
1073185FM211187.3435 (pseudogene)0.0863.3×10-5Suggestive
1472933Upstream of fms−0.235.3×10-5Suggestive
1473700putative glutathione S-transferase−0.168.8×10-5Suggestive
1515497hypothetical phage protein−0.0995.2×10-5Suggestive
1516293putative phage Holliday junction resolvase−0.105.1×10-6Suggestive
1516350putative phage Holliday junction resolvase−0.122.1×10-7Genome-wide significant
1517063phage protein−0.113.3×10-7Genome-wide significant
1813192thioesterase superfamily protein−0.124.8×10-6Suggestive
Table 4–source data 1

Association results for SNPs, from fast-lmm.

Table 5
Summary of variance of carriage duration explained by genetic and environmental factors.

H2 encompasses all rows, other than the measured environmental effects. For each variant component the method used to estimate it is reported: CPP - closest phylogenetic pairs; LMM - variance component using a linear mixed model with pathogen genotype as random effects; R2 - linear regression using lasso to select predictors.

SourceOf which isTotal variance explainedProportion of total heritability explained
Total heritability (H2)0.634 (CPP)1.00
Common SNP heritability (hSNP2)0.438 (LMM)0.691
Serotype and resistance0.190 (R2)/0.253 (LMM)0.300 (R2)/0.399 (LMM)
Serotype only0.178 (R2)/0.135 (LMM)0.281 (R2)/0.213 (LMM)
Resistance only0.092 (R2)/0.113 (LMM)0.145 (R2)/0.178 (LMM)
Phage k-mers0.067 (LMM)0.106
Intact comYC0.127 (LMM)0.201
Measured environmental effectsAge and previous carriage0.046 (R2)-

Data availability

The following previously published data sets were used
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)