Research Article

Genetics and Genomics

Accurate predictions of SARS-CoV-2 infectivity from comprehensive analysis

Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Republic of Korea
Graduate School of Medical Science and Engineering, Korea Advanced Institute and Technology, Republic of Korea
Department of Systems Biology, College of Life Science and Biotechnology, Yonsei University, Republic of Korea
Department of Microbiology and Immunology, Seoul National University College of Medicine, Republic of Korea
Department of Biomedical Sciences, Seoul National University College of Medicine, Republic of Korea
School of Chemical and Biological Engineering, Seoul National University, Republic of Korea
Department of Medical Life Sciences, College of Medicine, The Catholic University of Korea, Republic of Korea
Seoul National University Bundang Hospital, Republic of Korea
Precision Medicine Research Center, College of Medicine, The Catholic University of Korea, Republic of Korea
Cancer Evolution Research Center, College of Medicine, The Catholic University of Korea, Republic of Korea
CMC Institute for Basic Medical Science, the Catholic Medical Center of The Catholic University of Korea, Republic of Korea
INNOONE, Republic of Korea

Dec 24, 2024

https://doi.org/10.7554/eLife.99833.3

Open access
Copyright information

eLife Assessment

The study provides valuable insight into the biological significance of SARS-CoV-2 by using a series of computational analyses of viral proteins. While the evidence is solid, the reviewers noted a lack of clarity about the objectives of the analyses. While impactful for the field, the manuscript would benefit from improved presentation.

https://doi.org/10.7554/eLife.99833.3.sa0

Significance of the findings:

Valuable: Findings that have theoretical or practical implications for a subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Solid: Methods, data and analyses broadly support the claims with only minor weaknesses

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

An unprecedented amount of SARS-CoV-2 data has been accumulated compared with previous infectious diseases, enabling insights into its evolutionary process and more thorough analyses. This study investigates SARS-CoV-2 features as it evolved to evaluate its infectivity. We examined viral sequences and identified the polarity of amino acids in the receptor binding motif (RBM) region. We detected an increased frequency of amino acid substitutions to lysine (K) and arginine (R) in variants of concern (VOCs). As the virus evolved to Omicron, commonly occurring mutations became fixed components of the new viral sequence. Furthermore, at specific positions of VOCs, only one type of amino acid substitution and a notable absence of mutations at D467 were detected. We found that the binding affinity of SARS-CoV-2 lineages to the ACE2 receptor was impacted by amino acid substitutions. Based on our discoveries, we developed APESS, an evaluation model evaluating infectivity from biochemical and mutational properties. In silico evaluation using real-world sequences and in vitro viral entry assays validated the accuracy of APESS and our discoveries. Using Machine Learning, we predicted mutations that had the potential to become more prominent. We created AIVE, a web-based system, accessible at https://ai-ve.org to provide infectivity measurements of mutations entered by users. Ultimately, we established a clear link between specific viral properties and increased infectivity, enhancing our understanding of SARS-CoV-2 and enabling more accurate predictions of the virus.

Introduction

The importance of big data science has led to the accumulation of huge amounts of data, recently, a generative AI model with a large language model (LLM) using various types of big data has further been emerging across various fields, including healthcare, omics research, and industry (Thirunavukarasu et al., 2023; Yang et al., 2022). In response to the continuous emergence of new SARS-CoV-2 variants, our study introduces an innovative model and platform that integrates big data analysis, protein structure prediction using AlphaFold, and AI learning to predict highly transmissible novel SARS-CoV-2 variants. The validation of predictions made using AI is crucial, so we present the proofs through in-silico and in-vitro experiments for highly infectious variants.

The COVID-19 pandemic-related deaths have decreased, but the infectivity of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) persists. This may be because of neutralizing antibodies from previous infections or numerous mutations in the virus, from the Alpha variant to current Omicron sublineages (Tsai et al., 2021; Cao et al., 2022).

Genomic databases, such as Nextstrain and the Global Initiative on Sharing All Influenza Data (GISAID), Our World in Data (OWID), containing epidemiological data, have been instrumental in collecting valuable data during the COVID-19 pandemic (Shu and McCauley, 2017; Mathieu et al., 2021; Hadfield et al., 2018). Reports on the current global efforts of genomic surveillance strategies and sequencing show varied levels of data accumulation (Chen et al., 2022). Nonetheless, the increasing amount of available SARS-CoV-2 data has made various analytical approaches possible.

SARS-CoV-2 mutations can alter gene function. For example, the SARS-CoV-2 furin gene plays an important role in the cleavage of the spike protein, and mutations in the gene significantly affect SARS-CoV-2 fusion with the host cell membrane (Peacock et al., 2021; Johnson et al., 2021). Mutations in the genes have a direct impact on its protein structure, influencing the pathway used by the virus to infect host cell (Bouhaddou et al., 2023). Mutations in the spike protein lead to differences in the protein structure and binding affinity to ACE2, the receptor through which the virus penetrates the host cell (Ali et al., 2021; Seyran et al., 2021).

Deep learning methods predict mutations and protein structures, and their application in research has led to enhanced genomic analysis with large datasets (Senior et al., 2020; Theodosiou and Read, 2023). This approach has proved valuable in predicting disease-related mutations and identifying genes of interest (Yang et al., 2021). Recently, deep learning has been utilized to analyze mutations in SARS-CoV-2 (Berman et al., 2023; Zhou et al., 2023). Prediction models, such as AlphaFold2, have been used to assess protein 3D structures (Jumper et al., 2021). This has led to the detection of structural changes caused by mutations in Delta (B.1.617.2) and Omicron (B.1.1.529) (Bhowmick et al., 2022). Alphafold2 has also been used to analyze protein genotypes and phenotypes in the RBD region (Kilim et al., 2023). Furthermore, hydrophobic properties in the amino acid sequence affect protein folding (Lins and Brasseur, 1995). Coronavirus hydrophobicity has significant effects on amino acid properties and protein folding (Shekhawat and Roy Chowdhury Chakravarty, 2022).

For these prior approaches to virus analysis and prediction, expertise in the relevant fields is required for a full understanding. Also, structure-based predictions of mutations cannot fully account for all aspects of SARS-CoV-2 infection. In this study, we analyzed SARS-CoV-2 mutations using artificial intelligence (AI) methods and leveraged large datasets to elucidate properties of the virus and make various predictions. We discovered amino acid polarity changes and substitutions and then evaluated the infectivity from significant mutations in the RBM region. We specifically examined the effect of protein structures from hydrophobicity to hydrophilic and alkaline properties.

Our evaluation involved a comprehensive analysis of epidemiological, and genomic data, capitalizing on the availability of large SARS-CoV-2 datasets. We extracted properties from the VOCs and each sublineage at the nucleotide and amino acid levels. We developed an evaluation model called amino acid property eigen selection score (APESS) to analyze SARS-CoV-2 sequences. Various methods were utilized to validate our findings, and we used machine learning to make further predictions.

Finally, we present our findings and evaluation models through the Artificial Intelligence Analytics Toolkit for predicting virus mutations in protEin (AIVE), a web-based platform that integrates SARS-CoV-2 data, visualizes APESS score distribution, offers 3D protein predictions, and increases accessibility for researchers. Overall, our research provides a model for pandemic preparedness and the study of infectious diseases.

Results

Discovery of significant properties in the amino acid sequence

We examined the amino acid sequences of SARS-CoV-2 to make discoveries about biochemical properties. We identified consecutive hydrophilic amino acids within the SARS-CoV-2 spike protein in the following lineages: Wuhan-Hu-1, Alpha, Beta, and Omicron (BA.1, BA.2, BA.2.75, BA.4/BA.5, XBB, BQ.1). A series of hydrophilic amino acid regions were observed in the RBM region (amino acid sequence 437–508). In specific SARS-CoV-2 lineages, amino acid substitutions were observed in these regions (Figure 1A, Figure 1—figure supplement 1). For example, the Delta and Omicron lineages contained a substitution of T478K of alkaline amino acids.

Figure 1 with 4 supplements see all

Download asset Open asset

Analysis of protein properties discovered in the SARS-CoV-2 amino acid sequence.

(A) The SARS-CoV-2 amino acid sequence between positions 437 and 508 in the receptor binding motif (RBM) is displayed with the corresponding amino acids in the original positions. Amino acid substitutions are shown for Alpha, Beta, Delta, Omicron BA.1, Omicron BA.2, Omicron BA.4/BA.5, Omicron BQ.1, and Omicron XBB. Hydrophilic (polar) amino acids are displayed in red, hydrophobic (non-polar) in blue, acidic in green, and alkaline (positively charged) in yellow. (B) The number of polarity changes [N: hydrophobic (nonpolar), P: hydrophilic (polar), A: acidic, and B: alkaline (basic)] in the receptor binding domain (RBD) region is displayed. Wuhan-Hu-1, Alpha, Beta, Delta, Omicron (BA.1), Omicron (BA.2), Omicron (BA.2.75), Omicron (BA.4/BA.5), Omicron (XBB), and Omicron (BQ.1) are presented on the graph with each lineage color-coded. The polarity change count for NN*, NP*, PN*, and PP* are shown in more detail. PN* (Wuhan-Hu-1: 39, Alpha: 39, Beta: 40, Delta: 37, Omicron BA.1: 36, Omicron BA.2: 35, Omicron BA.2.75: 35, Omicron BA.4/BA.5: 34, Omicron XBB: 36, Omicron BQ.1: 34) PP* (Wuhan-Hu-1: 31, Alpha: 31, Beta: 31, Delta: 30, Omicron BA.1: 27, Omicron BA.2: 26, Omicron BA.2.75: 26, Omicron BA.4/BA.5: 27, Omicron XBB: 29, Omicron BQ.1: 27). Overall, polarity decreased from PN* to PP* across all SARS-CoV-2 lineages. (C) The amino acid substitutions in the RBM region from the reference to VOCs are displayed. The seventeen amino acids in the reference list are tyrosine (Y), threonine (T), serine (S), glutamine (Q), asparagine (N), cysteine (C), valine (V), proline (P), leucine (L), isoleucine (I), glycine (G), phenylalanine (F), alanine (A), arginine (R), lysine (K), glutamic acid (E), and aspartic acid (D). For variants of concerns (VOCs), the amino acid substitutions are indicated by gray lines. There was a more than twofold increase in lysine (K) and arginine (R) in VOCs compared with the reference.

Each position showed differences in the number of polarity changes; specifically, changes from hydrophilic (polar, P) to hydrophobic (non-polar, N) or positively charged amino acids. For most positions, we did not detect significant differences in polarity between the lineages investigated (Supplementary file 1a–c).

As the virus progressed from the Wuhan-Hu-1 (40 in NN* and 31 in PP*) to Omicron (average 45.5 in NN* and average 27 in PP*), the number of polarity changes increased in NN* while decreasing in PP*. For these positions, polarity changes occurred in the lineages as they evolved to the Omicron variant (Figure 1B). More polarity changes occurred in Delta, XBB, Omicron BQ.1, and Omicron BA.4/BA.5 than in the other VOCs.

We investigated the amino acid substitutions in the VOCs and compared them with the Wuhan-Hu-1 sequence as the reference. We discovered a twofold increase in amino acid substitutions to lysine (K) and arginine (R) in later SARS-CoV-2 lineages. Meanwhile, glutamine (Q), phenylalanine (F), and glutamic acid (E) levels decreased by half in the VOCs. For phenylalanine (F), hydrophobic residue mutations occurred in areas irrelevant to the regions of consecutive hydrophilic amino acids (Figure 1C and Supplementary file 1d and Figure 1—figure supplement 2).

Various coronaviruses, including MERS-CoV, SARS-CoV-1, and SARS-CoV-2, did not show significant differences in polarity across positions (Figure 1—figure supplement 3A).

We investigated the polarity (hydrophilic, hydrophobic, alkaline, and acidic) of amino acids in Wuhan-Hu-1, Alpha, Beta, Omicron (BA.1, BA.2, BA.2.75, BA.4/BA/5, XBB, BQ.1), and the number of amino acids in the RBM sequence. Most amino acids were either hydrophilic or hydrophobic across lineages, with a slight increase in alkaline amino acid levels in the Omicron variants than in the others (Figure 1—figure supplement 3B).

We investigated the nucleotide sequences of the SARS-CoV-2 spike protein gene. We analyzed the transitions and transversions of 7,335,614 samples from the GISAID viral sequence. The rate of change from G to U was higher than that of U to G, and the rate from C to U was higher than that of U to C. Both C to G and G to C rates were extremely low (Figure 1—figure supplement 3C and Supplementary file 1e, Panchin and Panchin, 2020; Yi et al., 2021).

Identification of properties in amino acid substitutions

We studied each mutation in the RBM region of lineages and sublineages using verified mutation data: Alpha, Beta, Delta, Omicron (BA.2.75), Omicron (BA.4/BA.5), Omicron (XBB), and Omicron (BQ.1).

From the viral sequences submitted to GISAID, we extracted data (n=7,335,614) of the sublineages defined on the outbreak.info platform. The most noteworthy mutations were N440, L452, S477, T478, E484, F486, N501, and Y505 (Figure 2A). Only seven individuals possessed mutations at D467 out of the 7,335,614 investigated (Figure 2—figure supplement 1A).

Figure 2 with 3 supplements see all

Download asset Open asset

Evaluation and validation of amino acid substitutions in the SARS-CoV-2 receptor binding motif (RBM) region.

(A) Mutations, their occurrence rates as percentages, and the original amino acid at the position are shown. Alpha, Beta, Delta, Omicron (BA.2.75), Omicron (BA.4/BA.5), Omicron (XBB), and Omicron (BQ.1) lineages are displayed with corresponding colors. N440, L452, S477, T478, E484, F486, N501, and Y505 are indicated by yellow triangles while D467 is indicated by a red triangle. (B) For positions N440, L452, S477, T478, E484, F486, N501, and Y505, lineages and amino acid substitutions are displayed. Arrows indicate the mutation rate where width corresponds with the percentage and the colors indicate lineages Alpha, Beta, Delta, Omicron (BA.2.75), Omicron (BA.4/BA.5), Omicron (XBB), and Omicron (BQ.1). The colors in the pie chart indicate amino acids. The mutation rate of Alpha is lower than 20%. At the 501st position in the RBM region, amino acid substitutions from asparagine (N) to tyrosine (Y) (n=282) occurred along with other substitutions. For the Beta variant, the 484^th position showed a mutation rate over 60% with E484K. The Delta variant showed a mutation rate of 60% at the 452nd position. L452R and T478K amino acid substitutions along with various mutations were observed. In Omicron, the mutation rate for S477, T478, E484, N501, and Y505 was over 40%. The amino acid substitutions were S477K, T478K, E484A, N501Y, and Y505H. We calculated the mutation rates for the following positions: T478K (99.95%), Q498R (94.14%), N501Y (99.52%), and Y505H (97.66%). (C) The effect of the D467 amino acid substitution on viral infection was evaluated in vitro via luciferase and viral entry assays. Mutagenesis at D467 to hydrophobic amino acids proline (P) and isoleucine (I) was performed. There was a significant decrease in the RLU and viral entry percentages for both D467P and D467I (<0.0001).

Importantly, the experimental results corroborated the amino acid substitutions observed in the viral sequences. Almost no mutations were found at the D467 position of the RBM sequence. In vitro experiments using a luciferase assay and viral entry experiments demonstrated that infectivity associated with D467 mutagenesis (D467P and D467I) was lower than the wild-type (spike protein D614G) (Figure 2C, Figure 2—figure supplement 2).

We investigated the mutation rates and amino acid substitutions in VOCs and variants under monitoring (VUMs). In the earlier Alpha, Beta, and Delta lineages, various amino acid changes occurred at multiple locations. In later lineages, we observed a reduction in the diversity of amino acid changes due to substitutions (Figure 2B). L452R and T478K were the most numerous non-synonymous mutations while more synonymous mutations developed.

Differences in molecular cell levels that affect infectivity and severity in the evolution of SARS-CoV-2

We investigated the molecular cell level differences that affect the severity and infectivity of SARS-CoV-2 in its evolutionary process. From 1/23/2020 to 12/31/2022, we examined the number of infections and deaths using data from OWID and compared the periods where lineages occurred. Based on epidemiological data, we established three main periods: the major outbreak period of Delta, Omicron BA.5, and Omicron BQ. During the Delta period, there was an increase in the ratio of deaths to infections, for the Omicron BA.5 period, both infections and deaths increased, and finally in the Omicron BQ period, both decreased (Figure 3A). We attributed these trends to mutations in the major lineages and further investigation of sublineages through viral sequence information revealed stabilization per lineage (Supplementary file 1f and g).

Figure 3 with 12 supplements see all

Download asset Open asset

Association between SARS-CoV-2 mutations and epidemiological data.

(A) For each SARS-CoV-2 lineage, the position and distribution of amino acid substitutions were analyzed alongside epidemiological data. The first period, from November 2020 to December 2021, was characterized by low infections and high deaths. Delta was the most prominent during this period, coinciding with worldwide vaccinations. Amino acid substitutions L452R and T478K were observed at the highest frequency while T478K and L452R were independently observed at the highest frequencies. The second period, from January 2022 to April 2022, saw an increase in the new cases and deaths. BA.5 was prominent during this period with various amino acid substitutions observed. The third period, from May 2022 to November 2022, showed a significant decrease in infections and deaths. Omicron, specifically BQ.1 was prominent during this period and worldwide vaccination rates decreased. (B) From the viral sequences of the patients, the association between the primary mutations of Delta, BA.5, and BQ.1, and epidemiological data (symptoms and severity) was analyzed. Odds ratios are displayed for L440K, K444T, L452R, N460K, S477N, T478K, E484A, F486V, Q498R, N501Y, and Y505H. L452R was an indicator of symptomaticity in Delta and BA.5. K477T was associated with symptomaticity in BQ.1. All mutations were associated with mildness. The 95% confidence intervals are shown for Delta, BA.5, and BQ.1. (C) We analyzed the expression of Delta compared to BA.4/BA.5 using the GSE235262 dataset. In Delta, the mTOR pathway was observed to regulate ribosome biogenesis, autophagy, and lipid biosynthesis while also playing a role in viral infection pathways. The expression values were calculated as Delta [log2(TPM +1)] / Omicron BA.4+BA.5 [log2(TPM +1)], with ENST Ensembl Transcript IDs, and * indicating a significance level of p<0.05. (D) The folding structures and pDockQ scores (0.506, 0.569, 0.577, 0.560, 0.564, and 0.575 for Wuhan-Hu-1, Beta, Delta, BA.4/BA.5, BQ.1, and XBB, respectively) were shown.

We investigated the epidemiological relationship between symptoms and disease severity in patients by examining mutations in the RBM region for major SARS-CoV-2 lineages and their sublineages. We found that these mutations became fixed over time (Figure 3B and Supplementary file 1h and i). In the Delta variant, a significant increase in symptomatic infections occurred for mutations L452R (odds ratio 4.346, 95% CI 2.378–8.191) and T478K (odds ratio 3.116, 95% CI 1.595–6.292). For most patients with Omicron BA.5 and BQ.1, mutations were asymptomatic. However, among patients with Omicron BQ.1 mutations, only those with K447T were symptomatic. The Delta, Omicron BA.5, and Omicron BA.1 variants more frequently produced milder outcomes with more mutations. Consequently, we showed that as SARS-CoV-2 evolves to Omicron, the number of asymptomatic patients with mild outcomes increased, with infection symptoms becoming less severe.

Since we cannot determine the severity and presence of symptoms in SARS-CoV-2 from mutations alone, we analyzed gene expression within host cells of infected patients. We utilized GSE235262 from the NCBI GEO database to analyze the gene expression of controls (uninfected), Alpha (B.1.1.7), Delta (B.1.617.2), and Omicron (B.1.529, BA.2, BA.4, and BA.5) variants during waves of COVID-19. In particular, we focused on the expression differences between Alpha (B.1.1.7) and Omicron (BA.4/BA.5), and between Delta (B.1.617.2) and Omicron (BA.4/BA.5), conducting enrichment analyses on genes that showed significant differences (Supplementary file 1j). Compared to Omicron (BA.4/BA.5), most genes in Delta (B.1.617.2) were up regulated in molecular and virus infection pathways (p<0.05). Notably, in the mTOR pathway, there was a correlation with receptor genes leading to the PI3K-AKT pathway, lysosome, ribosome biogenesis, autophagy, and lipid biosynthesis, In the virus infection pathway, endocytosis and metabolic pathways were related (Figure 3C). Compared to Omicron (BA.4/BA.5), Delta is more associated with various pathways within the host cell, leading to inflammation, replication, and severity.

We believe that Omicron’s characteristics are not solely due to its molecular cell level features within the host cell but also stem from its evolution, affecting its binding affinity to the host’s ACE2. Therefore, we utilized pDockQ and HADDOCK to predict the binding affinity between SARS-CoV-2 and the ACE2 receptor affected by mutations (van Zundert et al., 2016; Bryant et al., 2022). Based on the pDockQ results, with Wuhan-Hu-1 as the standard, the SARS-CoV-2 variants showed strong affinity in descending order of Delta (0.577), Omicron XBB (0.575), Beta (0.569), Omicron BQ.1 (0.564), and Omicron BA.4/BA.5 (0.560) (Figure 3D, Figure 3—figure supplement 1). Compared to the binding affinity of Wuhan-Hu-1 (0.506), the lineages with higher binding affinity were associated with infectivity (Supplementary file 1k–1 m and Figure 3—figure supplement 2).

The results showed the following descending order of binding affinities for the ACE2 receptor: Delta, XBB, Beta, BQ.1, BA.4/BA.5, BA.1/BA.2, Alpha, and Wuhan-Hu-1. While Beta had a high-affinity score, it has had a low occurrence rate since its emergence, and it did not receive a high score from our evaluation model.

To check for interspecies infection in humans, bats, and pangolins, we measured the bond affinity between the virus and the host. We also measured the bond affinity of SARS-CoV-1. The bond strength of SARS-CoV-2 to the Homo sapiens ACE2 receptor is weaker than that of SARS-CoV-1 (Cao et al., 2021). Although the bond strength with the same host is higher, it weakens with a different host. This suggests that viruses that evolve within a human host are likely to be the most infectious to humans. As SARS-CoV-2 evolved to Omicron, the binding affinity increased for the virus.

Overall, the Delta variant was found to be more associated with the host molecular pathway and severity. Meanwhile, the Omicron variant showed a higher interaction between the ACE2 and RBM region. The Omicron variant is more infectious.

Development of APESS, an evaluation scoring model and, the evaluation of lineages

We developed APESS, an evaluation model to analyze viral sequences based on the nucleotide, amino acid, and protein structure properties. APESS was calculated from four previous calculations: sub-clustering of protein structure (SCPS), polarity change score (PCS), mutation rate (MR), and biochemical properties eigen score (BPES) (Figure 4A and Supplementary file 1n–q and Figure 4—figure supplement 1). The detailed calculations and components of APESS are displayed in the Materials and Methods section.

Figure 4 with 2 supplements see all

Download asset Open asset

APESS: a comprehensive evaluation model of SARS-CoV-2 mutations.

(A) Amino acid property eigen selection score (APESS), an evaluation model based on the properties discovered in the receptor binding motif (RBM) and the infectivity of SARS-CoV-2, was developed. A 72-amino acid-long RBM sequence of SARS-CoV-2 was used to comprehensively evaluate the sub-clustering of protein structure (SCPS), polarity change score (PCS), mutation rate (MR), and biochemical properties eigen score (BPES). Through comprehensive analysis of each position, the infectivity of the input sequence could be evaluated against preexisting lineages. (B) The APESS scores were calculated for SARS-CoV-2 lineages Alpha, Beta, Delta, and Omicron (BA.2.75, BA.5, XBB, BQ.1), and the data were obtained for sublineages from viral sequences. The original lineages are displayed with a gray triangle and their APESS scores, whereas the sublineages are color-coded differently. The S477K substitution resulted in the highest APESS score.

We calculated the APESS scores of 7,335,614 viral sequences encompassing VOCs and their sublineages (Figure 4B). The Alpha and Beta variants received lower APESS scores, whereas the Delta and Omicron variants (BA.5, BQ.1, and XBB) received higher scores.

We generated an APESS distribution graph for the VOC sublineages. Significant variants, including Delta and Omicron, showed APESS values higher than 1.62, indicating high infectivity (Figure 4A, Figure 4—figure supplement 2). The APESS scores of the sublineages were similar to those of the VOCs. Some sublineages had lower APESS scores because of the wide variety of amino acid substitutions in the RBM region (Figure 4B). The sublineages with the highest APESS scores possessed amino acid substitutions to lysine (K) and arginine (R) at S477.

APESS evaluates the infectivity of the SARS-CoV-2 lineage. Lower APESS scores indicated that the lineage is less infectious, whereas higher APESS scores indicated that it may be more infectious.

Based on cumulative prevalence results from investigations on March 21, 2023 (outbreak.info), 2% cumulative prevalence of the S477K mutation was confirmed in the Omicron variants BA.5.1 and BA.5.2. Severe infections were associated with amino acid substitutions to lysine (K) and arginine (R) at the S477 position.

Utilizing a gaussian mixture model (GMM), we divided the APESS scores of 30,000 randomly selected sublineages into four components. Four groups were identified, G1 (Alpha and Beta: a centroid at 0.0289), G2 (BA.1, BA.2.75, and XBB: a centroid at 1.9474), G3 (Delta: a centroid at 1.6383). G4 (BA.2, BA.4/5 and BQ.1: a centroid at 2.0802). We also examined the distribution for all 7,335,614 sequences and found that the results were the same as the 30,000 sampled sublineages. We observed that most of Alpha and Beta were included in the component with a centroid of 0.0289, while the component with a centroid of 2.0627 included various Omicron variants.

Creation of candidate lineages and protein structure predictions

We introduced amino acid substitutions at specific locations in the SARS-CoV-2 backbone for the wild-type and VOCs. The amino acid substitutions were lysine (K), arginine (R), asparagine (N), serine (S), tyrosine (Y), and glycine (G). We then evaluated the infectivity of these candidate lineages with our evaluation model APESS. APESS scores of candidate lineages with amino acid substitutions to positively charged residues K and R were higher than those of the SARS-CoV-2 variants. In particular, S477K and S477R amino acid substitutions were linked to increased infectivity. In contrast, amino acid substitutions of N, S, Y, and G resulted in APESS scores similar to or lower than of the SARS-CoV-2 variants (Figure 5A).

Figure 5 with 4 supplements see all

Download asset Open asset

Multifaceted evaluation of SARS-CoV-2: evaluation model, machine learning, and in vitro assay.

(A) Mutagenesis sequences containing consecutive hydrophilic amino acids were evaluated with amino acid property eigen selection score (APESS). They were based on Wuhan-Hu-1, Alpha, Beta, Delta, BA.1, BA.2, BA.2.75, BA.4/BA.5, XBB, and BQ.1, as indicated by colors and gray triangles. APESS values for the K, R, N, S, and Y mutated sequences of the lineages are displayed. Mutagenesis of lysine (K) and arginine (R) in Omicron sublineages resulted in increased APESS scores, whereas mutagenesis of asparagine (N), serine (S), and tyrosine (Y) resulted in decreased APESS scores. Specific regions for K and R are magnified to show the distribution of the APESS scores of the mutagenesis sequences in more detail. (B) To predict mutations with high infectivity using APESS, mutagenesis was performed using Wuhan-Hu-1, Alpha, Beta, Delta, BA.1, BA.2, BA.2.75, BA.4/BA.5, XBB, and BQ.1 as the backbone. The presence of these amino acid substitutions was verified using the viral sequence data from GISAID. For each lineage, the amino acid substitutions resulted in 280 mutagenic sequences. Thirty sequences with the highest APESS and pDockQ scores are displayed. N460R and S469R have not been observed naturally, whereas N439R, S459R, N437R, Y501R, S438R, and S494R have been observed in ten people or less. (C) For mutations occurring in lineages and mutations evaluated through APESS, AI learning models (Random Forest, LightGBM, XGBoost, Ensemble, and deep learning) were used to investigate the probability. For N460K, there was a ninefold increase in the probability of XBB compared to prior Omicron lineages. Q493R is not found in XBB but still has a high probability of occurrence. (D) The effects of N437 and N460 amino acid substitutions on viral infection were evaluated in vitro using luciferase and viral entry assays. There was a significant increase in the Relative Light Units and viral entry percentage for N437R, and vice versa for N460K.

Of the mutated sequences, amino acid substitutions of K and R showed the highest APESS and pDockQ scores (Figure 5B).

We used deep learning and machine learning methods to determine the probability of amino acid substitutions at specific locations in the lineages before and after the Omicron variant emergence. Models such as random forest, light-gradient boosting machine (LightGBM), extreme Gradient Boosting (XGBoost), and ensemble methods were used, and consistent results were obtained (Figure 5—figure supplement 1 and Supplementary file 1r–1 v). Specifically, for N460K, we found that the probability increased as the lineage progressed to Omicron. For Q493R, the probability was relatively constant for Omicron lineages (Figure 5C and Supplementary file 1w). We used Accuracy, Precision, Recall, and F1 score to evaluate performance. All models showed high-performance scores above 0.95 in Precision, Recall, and F1 score. For accuracy, XGBoost, scored above 0.89, exhibiting relatively high performance while LightGBM scored above 0.78.

To validate the binding kinetics of Q493R and N460K variant RBDs to human ACE2 at the macromolecular level, we performed surface plasmon resonance (SPR). In line with our computational prediction, both variants show approximately a threefold increase in binding affinity compared to the wild-type RBD (Figure 5—figure supplements 2 and 3).

We further verified the infectivity of the predicted mutations using luciferase and viral entry assays (Figure 5D). N437R mutation led to a twofold increase in luciferase activity and viral entry compared to the wild-type (spike protein D614G). In contrast, N460K and Q493R mutations led to a decrease in luciferase activity and viral entry compared to the wild-type (spike protein D614G) (Figure 5D, Figure 5—figure supplement 3). However, the viral entry of N460K was 10-fold higher than that of Q493R. N460K mutations occurred at a low rate but had a higher probability of infection compared to other mutations.

Therefore, our results imply that the maintenance of the proliferative rate of the virus was due to new mutations, high docking scores, and an increased presence of Omicron variants (Figure 3—figure supplements 3–12).

Based on our findings, we developed AIVE (https://ai-ve.org/), a web-based platform enabling rapid and precise prediction of protein structure and calculation of SARS-CoV-2 infectivity. AIVE facilitates the analysis of virus sequences entered by users and provides visualization and analysis reports, allowing it to be used in environments without GPU installation (Figure 6). For example, we used a customized sequence wherein N460K substitution was introduced into the Wuhan-Hu-1 sequence (Wuhan-Hu-1+N460 K) as an input. The output generated the following four results:

Figure 6 with 2 supplements see all

Download asset Open asset

Prediction of potential SARS-CoV-2 mutations through integrated evaluation and prediction.

(Input). This figure consists of three steps: ‘Input,’ ‘Processing,’ and ‘Output.’ Users can select a custom sequence from the entire SARS-CoV-2 sequence, choose variants of concerns (VOCs), or create customized sequences for analysis. Depending on the user’s system environment, analysis can be done through local prediction (server) or Google Colab. (Processing) Three types of analyses are performed. First, the protein 3D structure prediction is analyzed. This includes protein 3D structure, predicted local distance difference test (pLDDT), and predicted aligned error (PAE). Second, the infectivity is evaluated using APESS (2.12). For each position, the structural difference graph for biochemical properties eigen score (BPES), mutation rate (MR), polarity change score (PCS), and sub-clustering of protein structure (SCPS) is visualized. The APESS distribution is visualized for known VOCs and created variants. Third, polarity changes are visualized in sequences. (Output) Four results comparing Wuhan-Hu-1+N460 K and XBB (with N460K) are output and visualized. First, through protein structure prediction, secondary structures can be confirmed in XBB (with N460K) compared to Wuhan-Hu-1+N460 K (Yellow arrow). Second, the comparison of polarity changes through the mutation of Wuhan-Hu-1+N460 K (red dotted line) is done. Third, in XBB (with N460K), which has more mutations than Wuhan-Hu-1+N460 K, the difference in values at each position in the protein sequence of SCPS, PCS, MR, and BPES is displayed (red dotted line). Fourth, the distribution of APESS, which represents the comprehensive value of SCPS, PCS, MR, and BPES, is shown. ‘apess’ indicates the score for each position in the customized protein sequence. In the case of XBB (with N460K), which has more mutations than Wuhan-Hu-1+N460 K, an apess score distribution of XBB (with N460K) has values from –0.079–1.385 is shown (Yellow arrow). X-axis presents the position in RBM (72aa) and Y-axis presents APESS score, respectively (④–1). APESS is the summed value of each position and can evaluate infectivity. A region including XBB (with N460K) shows infectivity due to many mutations, and also shows an increase in APESS score due to the N460K mutation. X-axis presents APESS score and Y-axis presents a density, respectively (④–2). AIVE comprehensively evaluates viral infectivity, protein structure, amino acid substitutions, and polarity changes in preexisting and potential SARS-CoV-2 sequences.

First, protein structure prediction results showed that Wuhan-Hu-1+N460 K had increased folding compared to Wuhan-Hu-1. Meanwhile, XBB, which contains N460K substitution, demonstrated a more stabilized protein structure compared to Wuhan-Hu-1+N460 K due to the alpha-helix in the 493–495 region and the beta-sheet in the 465–467 and 505–507 regions (Panel ① of Figure 6). Second, for polarity changes, there was a decrease in consecutive hydrophilic amino acids in Wuhan-Hu-1+N460 K due to the mutation (Panel ② of Figure 6). Third, calculations of the APESS sub-scores showed a difference in scores for XBB compared to Wuhan Hu-1 +N460 K owing to mutations (Panel ③ of Figure 6). Fourth, the APESS distribution was displayed as ‘apess,’ which reflects the score at each position in the amino acid sequence. The apess score at N460K for both Wuhan-Hu-1+N460 K and XBB (with N460K) was 0.112 (Panel ④–1 of Figure 6). Furthermore, the distribution of the total ‘APESS’ score, which comprehensively evaluates each position of the sequencing results, was used to determine the infectivity sections. Usage guidelines of AIVE can be found in supplementary information (Figure 6—figure supplement 1).

An apess score at N460K for XBB (with N460K) was 1.967. XBB (with N460K) mutation was located in the infectivity region with a statistically significant association (>95%, pink color), whereas APESS score increased for Wuhan Hu-1 +N460 K compared to Wuhan Hu-1 but was not included in the high infectivity section containing XBB (Panel ④–2 of Figure 6). Therefore, our research findings enable rapid protein structure prediction and APESS via meticulous and systematic structuring of data. Visualization of these analyses and the evaluation of infectivity are available on AIVE.

Discussion

In this study, we aimed to comprehensively analyze the viral sequence and protein structure of SARS-CoV-2 and investigated its association with epidemiological data. Specifically, we analyzed the impact of these factors on infection and predicted the occurrence of new mutations. Our approach involved a multistep analysis. First, we identified specific amino acid substitutions within the viral genome, focusing on their potential impact on SARS-CoV-2 protein structure. Second, we explored the impact of these structural changes on the interaction between SARS-CoV-2 and the host. Third, we assessed the potential effects of these changes on virus infectivity. This systematic analysis allowed us to gain useful insights into the behavior and evolution of the virus.

Hydrophobic protein structure plays an important role in protein stability and folding (Pace et al., 2011; Islam et al., 2019). This affects structural changes from hydrophilic to hydrophobic or positively charged residues.

In our investigation of various viruses, Flaviviruses, including Zika virus, Japanese encephalitis virus, and Dengue virus, are enveloped viruses that use envelope proteins to infiltrate the host. Infection is facilitated by E-protein folding, which causes the virus to fuse with its host (Hu et al., 2021). Recently, SARS-CoV-2 treatments targeted the folding of the nonstructural protein NSP3 of the virus (Bergasa-Caceres and Rabitz, 2020). Research on avian coronaviruses has shown that the introduction of a hydrophobic domain into the infectious bronchitis virus E protein affects the infectivity of the virus (Ruch and Machamer, 2011). Amino acid substitutions such as L411F, F472S, D510S, and I529T have been reported in MERS-CoV (Wong et al., 2021; Kleine-Weber et al., 2019). In addition to D510S, consecutive hydrophilic amino acids were observed in MERS-CoV, which we believe contributed to its low infectivity (Figure 1—figure supplement 4).

Our analysis primarily focused on the RBM (amino acids 437–508) region of the spike protein of SARS-CoV-2 that directly interacts with the ACE2 receptor, allowing the virus to infiltrate host cells and discover mutations in VOCs and VUMs. In the Alpha, Beta, and Delta lineages, specific amino acid substitutions occurred in various locations. For Omicron, less diversity was observed for amino acid substitutions (Figure 2B).

Amino acid substitutions became fixed as SARS-CoV-2 lineages progressed. Examination of epidemiological data revealed four distinct periods. During the first and third periods, there was less diversity in amino acid substitutions, leading to a decrease in both infections and deaths (Figure 3A). In contrast, during the second period, there was more diversity in amino acid substitutions for the sublineages, leading to an increased number of cases and deaths. Thus, we established an association between the fixation of SARS-CoV-2 amino acid substitutions and infectivity.

We made discoveries on specific amino acid substitutions at positions. The N437R mutation led to increased viral infectivity and received a high APESS score but was barely observed in patients. We detected only three instances of N437R amino acid substitutions, all with the AAT codon, where two or more codon positions must undergo alterations. Such changes were only in V445P, a mutation found in XBB and its sublineages, but it occurred at a miniscule rate of 0.14%. Although not yet prominent, N437R is expected to be associated with high infectivity and remains a primary candidate for close monitoring in the future.

Frequency prediction of N460K through machine learning showed a ninefold increase compared to BA.1. And the mutation showed decreased viral entry in vitro and low binding affinity in silico. The N460K mutation affects docking but combination with other mutations could increase infectivity. Ito et al. confirmed that N460K mutation forms a hydrogen bond to N-linked glycan on N90 of human ACE2 (Ito et al., 2023).

The D467 position plays a key role in a salt-bridge interaction within the Delta variant (Baral et al., 2021). Due to this and mutations at the position being extremely rare suggest that D467 contributes to the structural stability of the virus. Specifically, amino acid substitutions D467P and D467I disrupt the salt bridge formed by R457 and R454 (Figure 2—figure supplement 1). Salt bridges play an important role in thermostabilization (Ban et al., 2019) and represent electrostatic interactions between positively charged and negatively charged residues. Alterations in positively charged residues can disrupt these salt-bridge interactions with negatively charged residues of SARS-CoV-2. Overall, mutations at this site impair the ability of SARS-CoV-2 to effectively infect the host. In silico results showed amino acid substitutions at D467 caused the alpha helix to present in the wild-type to change into a linear structure (Figure 2—figure supplement 3).

XBB predicted by ML exhibited a decrease in frequency, displaying a low slope and signifying slow disappearance (Figure 5C). Protein measurement proved to be difficult in vitro. Notably, there was an approximately twofold increase in binding affinity against the wild-type in SPR analysis. For the Q493R mutation itself, the docking is low but combined with other mutations, it showed increased docking. We believe that the mutation itself negatively affects the spike protein structure but is important in binding with the host overall. Recent structural research which identified RBD-ACE2 complex structures of BQ1.1 (N460K) and B1.1.529 (Q493R) supports our results (Han et al., 2022).

Our research shows that consecutive hydrophobic amino acids in the RBM region and specific amino acid substitutions affect not only infection but also protein structure. We assessed the infectivity of SARS-CoV-2 lineages using protein structure prediction. We observed that the emergence of various mutations corresponded to changes in binding affinity. As SARS-CoV-2 lineages progressed to Omicron, their binding affinity became stronger, resulting in increased infectivity, notably in the Delta and Omicron variants. Conversely, the Beta variant showed high binding affinity but not high infectivity. Our assumption is that for consecutive amino acids and amino acid substitutions, lysine (K) and arginine (R) have less weight; therefore, Beta is unlikely to develop into a significant variant. We speculate that the Beta variant emerged early in the pandemic, potentially limiting its ability to infect a large population, while widespread vaccination efforts may have contributed to decreased infectivity.

During the evolutionary process of SARS-CoV-2, the increase in infectivity and severity of variants from Alpha to Delta was attributed to abnormalities in the molecular signaling systems of the host. In contrast, Omicron showed lower severity but increased infectivity. We believe this is because of structural changes that allow the variant to bind well with the ACE2 receptor (Figure 3, Figure 3—figure supplement 2). In addition, epidemiological data reveals a decrease in deaths, the reproduction rate is maintained, and Omicron has high docking which confirms the high infectivity. We created mutagenesis sequences created from the backbone of Wuhan-Hu-1 and VOCs with amino acid substitutions to lysine (K) and arginine (R). Sequences from the backbone of Omicron XBB were predicted to have high infectivity and still liable to occur. We can therefore expect consistent infections caused by SARS-CoV-2 going forward.

We comprehensively evaluated genomic and epidemiological data. This multifaceted approach, along with the usage of APESS, machine learning, in vitro assays, and AIVE ensured a more comprehensive understanding of SARS-CoV-2 behavior and evolution. In addition, by applying GMM to the APESS score of a new sample, we can predict its membership within a specific component (Figure 4A). This prediction allows us to make informed conjectures about its potential strains and infectivity.

Based on our findings, we predicted approximately 440 mutations with a probability of occurring in the near future. One year after generating the predicted datasets, BA.2.86, EG.5.1, and HK.3 were included in the datasets and were reported to the highest frequencies with 43%, 28%, and 12% at Nextstrain (https://nextstrain.org/ncov/gisaid/global/6m; December 21, 2023). APESS values of them also were high at 2.051, 1.982, and 1.986, respectively. These mutations were provided in AIVE and can be considered as important targets to prevent a new wave of infections.

We created AIVE to feature our findings and analyses on an online platform. AIVE can be easily utilized without expert knowledge of its algorithm and its analysis features can be used without GPU setup or the need for software library installations. Furthermore, AIVE offers a user-friendly interface with the flexibility to input and append experiment data for analysis. Additionally, ongoing and past analyses can be accessed after processing, providing many options for management. The time required for analysis in the local server or Google Colab can be verified. The local server run on RTX8000 takes 50 min, offers a large database, and is free to use (Figure 6). After completing the job, the user can examine and manage the analysis results permanently with their account generated on AIVE.

Prior research of SARS-CoV-2 included analysis of epidemiological data, molecular work, and AI applications. We adopted a comprehensive approach, utilizing real-world data and multi-faceted validation. Our research reveals that SARS-CoV-2 increased in infectivity over time, illuminating significant trends in viral infections. Our discoveries, evaluation model, and the AIVE platform will serve as the foundation for preparedness against further developments in future pandemics.

Share this article

Cite this article

Analysis of protein properties discovered in the SARS-CoV-2 amino acid sequence.

Evaluation and validation of amino acid substitutions in the SARS-CoV-2 receptor binding motif (RBM) region.

Association between SARS-CoV-2 mutations and epidemiological data.

APESS: a comprehensive evaluation model of SARS-CoV-2 mutations.

Multifaceted evaluation of SARS-CoV-2: evaluation model, machine learning, and in vitro assay.

Prediction of potential SARS-CoV-2 mutations through integrated evaluation and prediction.

Author details

Jongkeun Park

Contribution

Contributed equally with

Competing interests

WonJong Choi

Contribution

Contributed equally with

Competing interests

Do Young Seong

Contribution

Contributed equally with

Competing interests

Seungpil Jeong

Contribution

Competing interests

Ju Young Lee

Contribution

Competing interests

Hyo Jeong Park

Contribution

Competing interests

Dae Sun Chung

Contribution

Competing interests

Kijong Yi

Contribution

Competing interests

Uijin Kim

Contribution

Competing interests

Ga-Yeon Yoon

Contribution

Competing interests

Hyeran Kim

Contribution

Competing interests

Taehoon Kim

Contribution

Competing interests

Sooyeon Ko

Contribution

Competing interests

Eun Jeong Min

Contribution

Competing interests

Hyun-Soo Cho

Contribution

Competing interests

Nam-Hyeok Cho

Contribution

Competing interests

Dongwan Hong

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism