Analysis of protein properties discovered in the SARS-CoV-2 amino acid sequence

(A) The SARS-CoV-2 amino acid sequence between positions 437 and 508 in the Receptor Binding Motif (RBM) is displayed with the corresponding amino acids in the original positions. Amino acid substitutions are shown for Alpha, Beta, Delta, Omicron BA.1, Omicron BA.2, Omicron BA.4/BA.5, Omicron BQ.1, and Omicron XBB. Hydrophilic (polar) amino acids are displayed in red, hydrophobic (non-polar) in blue, acidic in green, and alkaline (positively charged) in yellow.

(B) The number of polarity changes [N: hydrophobic (nonpolar), P: hydrophilic (polar), A: acidic, and B: alkaline (basic)] in the Receptor Binding Domain (RBD) region is displayed. Wuhan-Hu-1, Alpha, Beta, Delta, Omicron (BA.1), Omicron (BA.2), Omicron (BA.2.75), Omicron (BA.4/BA.5), Omicron (XBB), and Omicron (BQ.1) are presented on the graph with each lineage color-coded. The polarity change count for NN*, NP*, PN*, and PP* are shown in more detail. PN* (Wuhan-Hu-1: 39, Alpha: 39, Beta: 40, Delta: 37, Omicron BA.1: 36, Omicron BA.2: 35, Omicron BA.2.75: 35, Omicron BA.4/BA.5: 34, Omicron XBB: 36, Omicron BQ.1: 34) PP* (Wuhan-Hu-1: 31, Alpha: 31, Beta: 31, Delta: 30, Omicron BA.1: 27, Omicron BA.2: 26, Omicron BA.2.75: 26, Omicron BA.4/BA.5: 27, Omicron XBB: 29, Omicron BQ.1: 27). Overall, polarity decreased from PN* to PP* across all SARS-CoV-2 lineages.

(C) The amino acid substitutions in the RBM region from the reference to VOCs are displayed. The seventeen amino acids in the reference list are tyrosine (Y), threonine (T), serine (S), glutamine (Q), asparagine (N), cysteine (C), valine (V), proline (P), leucine (L), isoleucine (I), glycine (G), phenylalanine (F), alanine (A), arginine (R), lysine (K), glutamic acid (E), and aspartic acid (D). For VOCs, the amino acid substitutions are indicated by gray lines. There was a more than two-fold increase in lysine (K) and arginine (R) in VOCs compared with the reference.

Evaluation and validation of amino acid substitutions in the SARS-CoV-2 RBM region

(A) Mutations, their occurrence rates as percentages, and the original amino acid at the position are shown. Alpha, Beta, Delta, Omicron (BA.2.75), Omicron (BA.4/BA.5), Omicron (XBB), and Omicron (BQ.1) lineages are displayed with corresponding colors. N440, L452, S477, T478, E484, F486, N501, and Y505 are indicated by yellow triangles while D467 is indicated by a red triangle.

(B) For positions N440, L452, S477, T478, E484, F486, N501, and Y505, lineages and amino acid substitutions are displayed. Arrows indicate the mutation rate where width corresponds with the percentage and the colors indicate lineages Alpha, Beta, Delta, Omicron (BA.2.75), Omicron (BA.4/BA.5), Omicron (XBB), and Omicron (BQ.1). The colors in the pie chart indicate amino acids. The mutation rate of Alpha is lower than 20%. At the 501st position in the RBM region, amino acid substitutions from asparagine (N) to tyrosine (Y) (n = 282) occurred along with other substitutions. For the Beta variant, the 484th position showed a mutation rate over 60% with E484K. The Delta variant showed a mutation rate of 60% at the 452nd position. L452R and T478K amino acid substitutions along with various mutations were observed. In Omicron, the mutation rate for S477, T478, E484, N501, and Y505 was over 40%. The amino acid substitutions were S477K, T478K, E484A, N501Y, and Y505H. We calculated the mutation rates for the following positions: T478K (99.95%), Q498R (94.14%), N501Y (99.52%), and Y505H (97.66%).

(C) The effect of the D467 amino acid substitution on viral infection was evaluated in vitro via luciferase and viral entry assays. Mutagenesis at D467 to hydrophobic amino acids proline (P) and isoleucine (I) was performed. There was a significant decrease in the RLU and viral entry percentages for both D467P and D467I (<0.0001).

Association between SARS-CoV-2 mutations and epidemiological data

(A) For each SARS-CoV-2 lineage, the position and distribution of amino acid substitutions were analyzed alongside epidemiological data. The first period, from November 2020 to December 2021, was characterized by low infections and high deaths. Delta was the most prominent during this period, coinciding with worldwide vaccinations. Amino acid substitutions L452R and T478K were observed at the highest frequency while T478K and L452R were independently observed at the highest frequencies. The second period, from January 2022 to April 2022, saw an increase in the new cases and deaths. BA.5 was prominent during this period with various amino acid substitutions observed. The third period, from May 2022 to November 2022, showed a significant decrease in infections and deaths. Omicron, specifically BQ.1 was prominent during this period and worldwide vaccination rates decreased.

(B) From the viral sequences of the patients, the association between the primary mutations of Delta, BA.5, and BQ.1, and epidemiological data (symptoms and severity) was analyzed. Odds ratios are displayed for L440K, K444T, L452R, N460K, S477N, T478K, E484A, F486V, Q498R, N501Y, and Y505H. L452R was an indicator of symptomaticity in Delta and BA.5. K477T was associated with symptomaticity in BQ.1. All mutations were associated with mildness. The 95% confidence intervals are shown for Delta, BA.5, and BQ.1.

(C) We analyzed the expression of Delta compared to BA.4/BA.5 using the GSE235262 dataset. In Delta, the mTOR pathway was observed to regulate ribosome biogenesis, autophagy, and lipid biosynthesis while also playing a role in viral infection pathways. The expression values were calculated as Delta [log2(TPM+1)] / Omicron BA.4+BA.5 [log2(TPM+1)], with ENST Ensembl Transcript IDs, and * indicating a significance level of p<0.05.

(D) The folding structures and pDockQ scores (0.506, 0.569, 0.577, 0.560, 0.564, and 0.575 for Wuhan-Hu-1, Beta, Delta, BA.4/BA.5, BQ.1, and XBB respectively) were shown.

APESS: a comprehensive evaluation model of SARS-CoV-2 mutations

(A) Amino acid Property Eigen Selection Score (APESS), an evaluation model based on the properties discovered in the RBM and the infectivity of SARS-CoV-2, was developed. A 72-amino acid-long RBM sequence of SARS-CoV-2 was used to comprehensively evaluate the sub-clustering of protein structure (SCPS), polarity change score (PCS), mutation rate (MR), and biochemical properties eigen score (BPES). Through comprehensive analysis of each position, the infectivity of the input sequence could be evaluated against preexisting lineages.

(B) The APESS scores were calculated for SARS-CoV-2 lineages Alpha, Beta, Delta, and Omicron (BA.2.75, BA.5, XBB, BQ.1), and the data were obtained for sublineages from viral sequences. The original lineages are displayed with a gray triangle and their APESS scores, whereas the sublineages are color-coded differently. The S477K substitution resulted in the highest APESS score.

Multifaceted evaluation of SARS-CoV-2: evaluation model, machine learning, and in vitro assay

(A) Mutagenesis sequences containing consecutive hydrophilic amino acids were evaluated with APESS. They were based on Wuhan-Hu-1, Alpha, Beta, Delta, BA.1, BA.2, BA.2.75, BA.4/BA.5, XBB, and BQ.1, as indicated by colors and gray triangles. APESS values for the K, R, N, S, and Y mutated sequences of the lineages are displayed. Mutagenesis of lysine (K) and arginine (R) in Omicron sublineages resulted in increased APESS scores, whereas mutagenesis of asparagine (N), serine (S), and tyrosine (Y) resulted in decreased APESS scores. Specific regions for K and R are magnified to show the distribution of the APESS scores of the mutagenesis sequences in more detail.

(B) To predict mutations with high infectivity using APESS, mutagenesis was performed using Wuhan-Hu-1, Alpha, Beta, Delta, BA.1, BA.2, BA.2.75, BA.4/BA.5, XBB, and BQ.1 as the backbone. The presence of these amino acid substitutions was verified using the viral sequence data from GISAID. For each lineage, the amino acid substitutions resulted in 280 mutagenic sequences. Thirty sequences with the highest APESS and pDockQ scores are displayed. N460R and S469R have not been observed naturally, whereas N439R, S459R, N437R, Y501R, S438R, and S494R have been observed in ten people or less.

(C) For mutations occurring in lineages and mutations evaluated through APESS, AI learning models (Random Forest, LightGBM, XGBoost, Ensemble, and deep learning) were used to investigate the probability. For N460K, there was a 9-fold increase in the probability of XBB compared to prior Omicron lineages. Q493R is not found in XBB but still has a high probability of occurrence.

(D) The effects of N437 and Q493 amino acid substitutions on viral infection were evaluated in vitro using luciferase and viral entry assays. There was a significant increase in the Relative Light Units and viral entry percentage for N437R, vice versa for Q493R.

Prediction of potential SARS-CoV-2 mutations through integrated evaluation and prediction

(Input). This figure consists of three steps: ‘Input’, ‘Processing’, and ‘Output’. Users can select a custom sequence from the entire SARS-CoV-2 sequence, choose VOCs, or create customized sequences for analysis. Depending on the user’s system environment, analysis can be done through local prediction (server) or Google Colab.

(Processing) Three types of analyses are performed. First, the protein 3D structure prediction is analyzed. This includes protein 3D structure, pLDDT, and PAE. Second, the infectivity is evaluated using APESS (2.12). For each position, the structural difference graph for BPES, MR, PCS, and SCPS is visualized. The APESS distribution is visualized for known VOCs and created variants. Third, polarity changes are visualized in sequences.

(Output) Four results comparing Wuhan-Hu-1+N460K and XBB (with N460K) are output and visualized. First, through protein structure prediction, secondary structures can be confirmed in XBB (with N460K) compared to Wuhan-Hu-1+N460K (Yellow arrow). Second, comparison of polarity changes through the mutation of Wuhan-Hu-1+N460K (red dotted line) is done. Third, in XBB (with N460K), which has more mutations than Wuhan-Hu-1+N460K, the difference in values at each position in the protein sequence of SCPS, PCS, MR, and BPES is displayed (red dotted line). Fourth, the distribution of APESS, which represents the comprehensive value of SCPS, PCS, MR, and BPES, is shown. “apess” indicates the score for each position in the customized protein sequence. In the case of XBB (with N460K), which has more mutations than Wuhan-Hu-1+N460K, an apess score distribution of XBB (with N460K) has values from -0.079 to 1.385 is shown (Yellow arrow). X-axis presents position in RBM (72aa) and Y-axis presents APESS score, respectively (μ-1). APESS is the summed value of each position and can evaluate infectivity. A region including XBB (with N460K) shows infectivity due to many mutations, and also shows an increase in APESS score due to the N460K mutation. X-axis presents APESS score and Y-axis presents a density, respectively (μ-2).

AIVE comprehensively evaluates viral infectivity, protein structure, amino acid substitutions, and polarity changes in preexisting and potential SARS-CoV-2 sequences.