Microbiology and Infectious Disease

A novel computational pipeline for var gene expression augments the discovery of changes in the Plasmodium falciparum transcriptome during transition from in vivo to short-term in vitro culture

Clare Andradi-Brown
Jan Stephan Wichers-Misterek
Heidrun von Thien
Yannick D. Höppner
Judith A. M. Scholz
Helle Smedegaard Hansson
Emma Filtenborg Hocke
Tim-Wolf Gilberger
Michael F. Duffy
Thomas Lavstsen
Jake Baum
Thomas D. Otto
Aubrey J. Cunnington author has email address
Anna Bachmann author has email address

Section of Paediatric Infectious Disease, Department of Infectious Disease, Imperial College London, UK
Department of Life Sciences, Imperial College London, South Kensington, London, SW7 2AZ, UK
Centre for Paediatrics and Child Health, Imperial College London, UK
Bernhard Nocht Institute for Tropical Medicine, Bernhard-Nocht-Strasse 74, 20359 Hamburg, Germany
Centre for Structural Systems Biology, Hamburg, Germany, Notkestraße 85, 22607 Hamburg, Germany
Biology Department, University of Hamburg, Hamburg, Germany
CMP, University of Copenhagen, 2200 Copenhagen, Denmark
Department of Microbiology and Immunology, University of Melbourne, Melbourne/Parkville VIC 3052, Australia
School of Biomedical Sciences, Faculty of Medicine & Health, UNSW, Kensington, Sydney, 2052, Australia
School of Infection & Immunity, MVLS, University of Glasgow, UK
German Center for Infection Research (DZIF), partner site Hamburg-Borstel-Lübeck-Riems, Germany

https://doi.org/10.7554/eLife.87726.1

Open access
Copyright information

Figures and data

Summary of the var gene, PfEMP1 protein structure, var chromosomal location and PfEMP1 binding phenotypes. a)
Structure of the var gene which encodes the PfEMP1 protein. The var gene is composed of two exons, the first, around 3–9.4 kb, encodes the highly variable extracellular region and the transmembrane region (TM) of PfEMP1. Exon 2 is shorter with about 1.2 kb and encodes a semi-conserved intracellular region (acidic terminal segment, ATS). The PfEMP1 protein is composed of an N-terminal segment (NTS), followed by a variable number of Duffy binding-like (DBL) domains and a cysteine-rich interdomain region (CIDR) (Rask et al., 2010). b) Chromosomal position and transcriptional direction (indicated by arrows) of the different var gene groups, designated by the respective type of upstream sequence (Deitsch & Hviid, 2004). c) Summary of PfEMP1 proteins encoded in the parasite genome, their composition of domain subtypes and associated N-terminal binding phenotype. Group A and some B proteins have an EPCRbinding phenotype; the vast majority of group B and C PfEMP1 proteins bind to CD36. Group A proteins also include those that bind an as yet unknown receptor, as well as VAR1 and VAR3 variants with unknown function and binding phenotype. VAR2CSA (group E) binds placental CSA. Figure modified from Wichers et al., 2021.

Novel computational pipelines for assembling var transcripts.
The original approach (green) used SoapDeNovo-Trans (k=71) to perform whole var transcript assembly. The whole transcript approach (blue) focused on assembling whole var transcripts from the non-core reads using rnaSPAdes (k = 71). Contigs were then joined into longer transcripts using SSPACE. The domain approach (orange) assembled var domains first and then joined the domains into whole transcripts. Domains were assembled separately using three different de novo assemblers (SoapDeNovo2, Oases and rnaSPAdes). Next, a reference of assembled domains was composed and cd-hit (at sequence identity = 99%) was used to remove redundant sequences. Cap3 was used to merge and extend domain assemblies. Finally, SSPACE was used to join domains together. HMM models built on the Rask et al., 2010 dataset were used to annotate the assembled transcripts (Rask et al., 2010). The most significant alignment was taken as the best annotation for each region of the assembled transcript (significance <= 1e-5) (identified using cath-resolve-hits0. Transcripts < 500nt were removed. A var transcript was selected if it contained at least one significantly annotated domain (in exon 1). Var transcripts that encoded only the more conserved exon 2 (ATS domain) were discarded. The three pipelines were run on the 32 malaria patient ex vivo samples from Wichers et al., 2021 (Wichers et al., 2021).

Summary of analyses of var and core gene transcriptome changes in paired ex vivo and short-term in vitro cultured parasites.
From a total of 13 parasite isolates, the ex vivo samples (Wichers et al., 2021) and the corresponding in vitro-cultured parasites of the first (n=13), second (n=10) and third (n=1) replication cycle were analysed by RNA sequencing. The expression of non-polymorphic core genes and polymorphic var genes was determined in different analysis streams: (1) Non-polymorphic core gene reads were mapped to the 3D7 reference genome, expression was quantified using HTSeq and differential expression analysis performed (orange); (2) Polymorphic reads were identified, whole transcripts were assembled with rnaSPAdes, expression of both var transcripts (red) and domains (green) was quantified, and differential expression analysis was performed. All assembled var transcripts from all samples are combined in the ‘all sample analysis’ (pink) and this reference is used to quantify var transcript expression (previously described in Wichers et al., 2021). ‘Per patient analysis’ (yellow) represents combining all assembled var transcripts for samples originating from the same ex vivo sample only. Var type (Group A vs Group B and C) expression (purple) was quantified using the DBLα and NTS assembled sequences and differences across generations were assessed. Total var gene expression (turquoise) was quantified by assembling and quantifying the coverage over the highly conserved LARSFADIG motif, with the performance of assembly using Trinity and rnaSPAdes assessed. Var expression homogeneity (VEH) was analysed at the patient level (α diversity curves) and across all samples (β diversity index). All differential expression analyses were performed using DESeq2.To ensure a fair comparison of samples, which may contain different proportions of life cycle stages, the performance of two different in silico approaches was evaluated by counting Giemsa-stained thin blood smears (blue).

Statistics for the different approaches used to assemble the var transcripts.
Var assembly approaches were applied to malaria patient ex vivo samples (n=32) from (Wichers et al., 2021) and statistics determined. Given are the total number of assembled var transcripts longer than 500 nt containing at least one significantly annotated var domain, the maximum length of the longest assembled var transcript in nucleotides and the N50 value, respectively. The N50 is defined as the sequence length of the shortest var contig, with all var contigs greater than or equal to this length together accounting for 50% of the total length of concatenated var transcript assemblies.

Summary of the dataset used in the analysis.
RNA-sequencing was performed on 32 malaria infected German traveler samples (Wichers et al., 2021). These 32 ex vivo samples were used to compare the performance of the var assembly approaches. Parasites from 13 of these ex vivo samples underwent one cycle of in vitro replication, 10 parasite samples were also subjected to a second cycle of replication in vitro, and a single parasite isolate was also analyzed after a third cycle of replication. For the ex vivo vs short-term in vitro cultivation analysis only paired samples were used. The number of assembled var contigs represents results per sample using the whole transcript approach, and shows either the number of assembled var contigs significantly annotated as var gene and > =500nt in length, or the number of assembled var transcripts identified with a length >= 1500nt and containing at least 3 significantly annotated var domains. *PE; paired-end reads.

Var transcriptome analysis of ex vivo and short-term in vitro cultured parasite samples.
Var transcripts for paired ex vivo (n=13), generation 1 (n=13), generation 2 (n=10) and generation 3 (n=1) were de novo assembled using the whole transcript approach. Var transcripts were filtered for those >= 1500nt in length and containing at least 3 significantly annotated var domains. RNA-sequencing reads were aligned to all filtered var transcripts assembled across all samples and expression was quantified using Salmon. a) PCA plot of log2 Salmon normalized read counts (adjusted for life cycle stage, derived from the mixture model approach). Points are coloured by their generation (ex vivo: purple, generation 1: red, generation 2: green, and generation 3: blue) and labelled by their patient identity. b) PCA plot of log2 Salmon normalized read counts (adjusted for life cycle stage, derived from the mixture model approach)for ex vivo (blue) and paired generation 1 (red) samples. Points are labelled by their patient identity. c) Volcano plot showing extent and significance of up- or down-regulation of var gene expression in ex vivo (n=13) compared with paired generation 1 cultured parasites (n=13) (red and blue: P < 0.05 after Benjamini-Hochberg adjustment for FDR, red and green: absolute log2 fold change log2FC in expression >= 2). Transcripts with a log2FC > = 2 represent those upregulated in generation 1 parasites. Transcripts with a log2FC <= -2 represent those downregulated in generation 1 parasites. d) Volcano plot showing extent and significance of up- or down-regulation of var gene expression in ex vivo (n=10) compared with paired generation 2 cultured parasites (n=10) (green: absolute log2 fold change in expression >= 2). Transcripts with a log2FC >= 2 represent those upregulated in generation 2 parasites. Transcripts with a log2FC <= -2 represent those downregulated in generation 2 parasites. e) var2csa expression through short-term in vitro cultivation. All significantly annotated var2csa transcripts had their log2 Salmon normalised count summed. Significant differences in expression levels between generations were assessed using a paired Wilcoxon test and adjusted using the Benjamini-Hochberg method. Numbers above the boxplots represent significant differences observed (FDR <=0.05). Coloured lines connect paired patient samples through the generations. f) Volcano plot showing extent and significance of up- or down-regulation of var gene expression in generation 1 (n=10) compared with paired generation 2 cultured parasites (n=10) (green: absolute log2 fold change log2FC >=2). Transcripts with a log2FC >= 2 represent those upregulated in generation 2 parasites. Transcripts with a log2FC <= -2 represent those downregulated in generation 2 parasites. g) Var transcript expression differences for the differentially expressed var transcripts in the paired ex vivo (n=13) vs generation 1 (n=13) parasite sample analysis. Box plots show log2-transformed normalized Salmon read counts adjusted for life cycle stage estimates (derived from the mixture model approach) across the generations (blue: ex vivo, orange, generation 1: green, generation 2, and red: generation 3). Normalised counts had 0.5 added prior to log2 transformation.

Var domain transcriptome analysis through short-term in vitro culture. Var transcripts for paired ex vivo (n=13), generation 1 (n=13), generation 2 (n=10) and generation 3 (n=1) were de novo assembled using the whole transcript approach. Var transcripts were filtered for those >= 1500nt in length and containing at least 3 significantly annotated var domains. Transcripts were annotated using HMM models built on the Rask et al., 2010 dataset (Rask et al., 2010). When annotating the whole transcript, the most significant alignment was taken as the best annotation for each region of the assembled transcript (evalue cut off 1e-5). Multiple annotations were allowed on the transcript if they were not overlapping, determined using cath-resolve-hits. Var domain expression was quantified using FeatureCounts and the domain counts aggregated a) PCA plot of log2 normalized read counts (adjusted for life cycle stage, derived from the mixture model approach). Points are coloured by their generation (ex vivo; purple, generation 1; red, generation 2; green and generation 3; blue) and labelled by their patient identity b) Volcano plot showing extent and significance of up- or down-regulation of var domain expression in ex vivo (n=13) compared with paired generation 1 cultured parasites (n=13) (red and blue, P < 0.05 after Benjamini-Hochberg adjustment for FDR; red and green, absolute log2 fold change log2FC in expression >= 2). Domains with a log2FC > = 2 represent those upregulated in generation 1 parasites. Domains with a log2FC <= -2 represent those downregulated in generation 1 parasites c) Volcano plot showing extent and significance of up- or down-regulation of var domain expression in ex vivo (n=10) compared with paired generation 2 cultured parasites (n=10) (green, absolute log2 fold change log2FC in expression >= 2). Domains with a log2FC >= 2 represent those upregulated in generation 2 parasites. Domains with a log2FC <= -2 represent those downregulated in generation 2 parasites. Differential expression analysis was performed using DESeq2 (adjusted for life cycle stage, derived from the mixture model approach).

Var group expression analysis through short-term in vitro culture. The DBLα domain sequence for each transcript was determined and for each patient a reference of all assembled DBLα domains combined. Group A var genes encode DBLα1, group B encode DBLα2 and groups B and C encode DBLα0. Domains were grouped by type and their expression summed. The relevant sample’s non-core reads were mapped to this using Salmon and DBLα expression quantified. DESeq2 normalisation was performed, with patient identity and life cycle stage proportions included as covariates. A similar approach was repeated for NTS domains. Group A var genes encode NTSA compared to group B and C var genes which encode NTSB. Boxplots show log2 normalised Salmon read counts for a) group A var gene expression through cultured generations assessed using the DBLα domain sequences b) group B and C var gene expression through cultured generations assessed using the DBLα domain sequences c) group A var gene expression through cultured generations assessed using the NTS domain sequences d) group B and C var gene expression through cultured generations assessed using the NTS domain sequences. Different coloured lines connect paired patient samples through the generations (ex vivo (n=13) and generation 1 (n=13) samples vs ex vivo (n=9) and generation 1 (n=9)).

Core gene transcriptome analysis of ex vivo and short-term in vitro cultured samples.
Core gene expression was assessed for paired ex vivo (n=13), generation 1 (n=13), generation 2 (n=10) and generation 3 (n=1) parasite samples. Subread align was used, as in the original analysis, to align the reads to the human genome and P. falciparum 3D7 genome, with var, rif, stevor, surf and rRNA genes removed. HTSeq count was used to quantify gene counts. a) PCA plot of log2 normalized read counts. Points are coloured by their generation (ex vivo: purple, generation 1: red, generation 2: green, and generation 3: blue) and labelled by their patient identity. b) PCA plot of log2 normalized read counts for ex vivo (purple) and paired generation 1 (red) samples. Points are labelled by their patient identity. c) Volcano plot showing extent and significance of up- or down-regulation of core gene expression in ex vivo (n=13) compared with paired generation 1 cultured parasites (n=13) (red and blue: P < 0.05 after Benjamini-Hochberg adjustment for FDR, red and green: absolute log2 fold change log2FC in expression >= 2). Genes with a log2FC > = 2 represent those upregulated in generation 1 parasites. Genes with a log2FC <= -2 represent those downregulated in generation 1 parasites. d) Volcano plot showing extent and significance of up- or down-regulation of core gene expression in ex vivo (n=10) compared with paired generation 2 cultured parasites (n=10) (red: P < 0.05 after Benjamini-Hochberg adjustment for FDR, red and green: absolute log2 fold change log2FC in expression >= 2). Genes with a log2FC >= 2 represent those upregulated in generation 2 parasites. Genes with a log2FC <= -2 represent those downregulated in generation 2 parasites. Normalized read counts of the core gene analysis were adjusted for life cycle stage, derived from the mixture model approach.

Sign up for email alerts