Introduction

Taxonomic classification is necessary to unravel evolutionary relationships, while also enabling the establishment of nomenclatures that facilitate effective communication of where an organism falls on a phylogeny [1]. This is particularly important for lineages within bacterial species, which are often distinct in clinically relevant phenotypes including antimicrobial resistance (AMR) and virulence [1]. The ability to identify these lineages consistently via a stable nomenclature system is essential to epidemiological surveillance [1].

In N. gonorrhoeae, the identification of AMR associated lineages is particularly relevant. The gonococcus is multi-drug resistant pathogen included in the WHO priority list for AMR [2, 3]. Globally, gonorrhoea causes a high burden of disease with an estimated 86.9 million new cases annually, which, if not successfully treated, can cause sequelae such as infertility and pelvic inflammatory disease [3, 4]. Facing the dual challenges of AMR surveillance and gonococcal vaccine development [5], accurate characterisation of gonococcal population structure and consistent identification of key lineages is of upmost importance [6, 7].

A variety of approaches have been applied to N. gonorrhoeae lineage taxonomy, which can cause confusion and miscommunication. Underlying these disparate nomenclatures is the complex population structure of the gonococcus; it is a highly recombinogenic organism, carrying out extensive horizontal gene transfer (HGT) between gonococci, and more rarely with other species [811]. Polymorphisms are continually reassorted through this process, leading to low levels of linkage disequilibrium and disrupting clonal population structure [8, 12]. Consequently N. gonorrhoeae has been described as a “sexual clone”, reflecting the strong impact of HGT and the therefore weak clonal signal present in gonococcal genomes [7, 13]. This low signal means that many molecular typing methods that are effective at capturing lineages in relatives such as N. meningitidis are not as reliable in N. gonorrhoeae [7]. The result has been a proliferation of differing approaches and nomenclatures seeking to capture N. gonorrhoeae taxonomy, particularly since 2010 [1416].

Most of these methods are Multi-Locus Sequence Typing (MLST) based. MLST is a widely used molecular approach for characterising bacterial isolates through the combination of alleles across a set of genes (an ‘allelic profile’) [17]. Each unique combination of alleles across the profile represents a sequence type (ST), which can be used, where they correspond with clonal inheritance, to characterise lineages [1]. Isolates can be further grouped based on these STs, using methods such as goeBURST [18] or single linkage clustering [19].

MLST is a powerful approach that has had widespread applications in the analysis of bacterial lineages [1]; however, conventional MLST approaches such as 7-locus MLST use only a small number of loci, which can make this system more vulnerable to disruptive HGT. In bacteria such as N. gonorrhoeae, HGT is extensive enough that it affects the housekeeping genes used in 7-locus MLST [7]. This makes 7-locus MLST a sub-optimal tool when applied in N. gonorrhoeae typing, as two isolates with the same ST may not actually be closely related, having come by the same ST by HGT rather than clonal inheritance [7]. This process affects several gonococcal typing systems, including NG-MAST [7, 15] and NG-STAR [7, 16].

Typing systems using larger numbers of loci can address this problem. For example, core gene MLST (cgMLST) is an extension of the MLST approach that uses a profile of hundreds of core genes to gain higher resolution, while also diluting the taxonomy-jumbling effects of HGT of some of those loci [1, 7, 20, 21]. In N. gonorrhoeae cgMLST v1, genes were defined as core if they were present in >95% of genomes, resulting in a core gene list (scheme) of >1600 core genes [7]. If two isolates were identical across this scheme, allowing for up to 50 loci to be unannotated, they would belong to the same core genome sequence type (cgST) [7]. Single-linkage clustering was then applied to define core genome ‘groups’ based on various threshold levels of allelic differences within and between groups [7]. This clustering was necessary as the large number of loci used in cgMLST results in large numbers of unique cgSTs.

This method provided a high-resolution classification of N. gonorrhoeae lineage taxonomy, while being less affected by HGT than 7-locus MLST, NG-STAR or NG-MAST [7]. However, single-linkage clustering is susceptible to group fusion when intermediate isolates are found that bridge the identity thresholds between linkage groups [21, 22]. HGT can exacerbate this issue, homogenising sequences across lineages by spreading polymorphisms throughout the population. As a result, the Ng cgMLST v1 core genome group nomenclature was not stable, and changed over time as more N. gonorrhoeae isolates were sampled, decreasing the resolution provided and potentially leading to confusion as the nomenclature shifts. Therefore, there is still a need for a N. gonorrhoeae lineage nomenclature that provides similarly effective discrimination between gonococcal lineages, while also remaining consistent over time.

Here, a definitive nomenclature for N. gonorrhoeae is proposed, using LIN codes. This isolate barcoding system has previously been applied to Klebsiella pneumoniae [22] and Streptococcus pneumoniae [23], where LIN codes were shown to provide a high resolution nomenclature, enabling accurate insight into the population structure of these organisms. LIN code combines the HGT-diluting effects of a cgMLST scheme with the stability of a numeric barcode [21, 22]. This stability is derived from the fixed nature of these barcodes for each isolate, along with the flexibility of a multi-position numeric code which can increase infinitely as new variants are sequenced [21, 22, 24]. The N. gonorrhoeae LIN code developed here was implemented in PubMLST, a freely accessible bacterial genomics database [25]. LIN codes are automatically assigned to all WGS uploaded to PubMLST, making this taxonomy easily applicable to old and new datasets, and encouraging the widespread use of this reliable nomenclature.

Methodology

Representative Isolate Collections

Data were extracted from the PubMLST database (pubmlst.org/organisms/neisseria-spp), a publicly accessible bacterial genomics database [25]. At the time of writing, sequence records from over 28,000 gonococci were available in the database. To facilitate efficient analyses representative sub-datasets were generated.

Dataset 1 consisted of a representative collection of 896 isolates belonging to a range of core genome groups at the 300 allelic mismatch threshold [Figure 1]. This was assembled by including all isolates belonging to core genome groups represented by 6 or fewer isolates, in combination with a random selection of 6 isolates from each of the core genome groups represented by more than 6 isolates (Supplementary table 1). Randomisation was achieved in R, using the sample function [26]. This smaller dataset was used for preliminary analysis, as it required less computational power.

Characteristics of isolates within representative development datasets 1 and 2.

This includes geographical distribution (top panels), frequency of isolates sampled over time (middle panels), and genome quality statistics (lower panels) including i) contig number, ii) total genome length and iii) % GC content.

Dataset 2 consisted of 3935 isolates [Figure 1], assembled by including all isolates belonging to core genome groups represented by 20 or fewer isolates (again at allelic mismatch threshold 300), and a random sample of 30% of all core genome groups represented by more than 20 isolates, capped at 350 isolates per core genome group (Supplementary table 2). This larger dataset was used for final analysis once techniques had been explored using dataset 1.

Development of Ng cgMLST v2

Although a 1649 locus Ng cgMLST v1 scheme for N. gonorrhoeae has been published [7], an updated, more readily auto-annotated cgMLST scheme was needed to ensure any WGS deposited in the PubMLST database can be assigned a LIN code with little to no manual curation [22, 23]. This was a consequence of Ng cgMLST v1 including a number of genes with alternate start codons or length variation, complicating automatic allele annotation by PubMLST.

WGS data from dataset 1 were downloaded from PubMLST and annotated with Prokka using default parameters [27]. The resulting .gff output was further analysed with PIRATE for pangenome analysis including the identification of core genes [28]. PIRATE was run using default parameters with the following options: -a (align all genes), -r (plot summaries using r) and –t 24 (24 threads). A preliminary threshold of 95% presence was chosen as ‘core’ rather than 100% to ensure a large enough gene list was identified. This also allowed for the possibility that some genes may be absent due to factors such as mis-assembly or mutation while still being ‘core’ in nature. Genes with duplication events as detected by PIRATE were excluded.

The new core list was assessed in dataset 2 to identify possible curation issues across a wider range of isolates. Following this analysis, all loci that did not meet a threshold of 98% presence in dataset 2 were excluded. This reduced Ng cgMLST v2 prioritised loci that annotate with very little human involvement via manual curation.

The PubMLST Grapetree plugin was used to generate minimum spanning trees based on the Ng cgMLST v2 loci, annotated with core genome groups (based on Ng cgMLST v1), to ensure resolution was high enough with this new reduced scheme to explore established gonococcal lineages [29]. Grapetree compares isolates across a user-defined allelic profile, clustering those that share the most alleles [29].

Analysis of Population Structure and Threshold Selection

Thresholds for LIN code were chosen based on an analysis of natural discontinuities in the gonococcal population structure, assessed by examining the distribution of pairwise allelic mismatches amongst isolates from dataset 2, combined with testing in a local installation of LIN code [30] and the application of clustering statistics including Rand index and silhouette score.

In order to extract pairwise allelic distances for these analyses, a distance matrix was generated comparing dataset 2 isolates across the loci included in Ng cgMLST v2. This was accomplished using the genome comparator plugin in PubMLST. This tool provides gene-by-gene pairwise analysis of isolates across a list of loci, generating a variety of outputs including a distance matrix, core gene list, and alignments [25]. Here, genome comparator was run with default settings. The distances in this matrix were stacked to form a dataframe with three columns (isolate A, isolate B, distance), allowing the distances to be plotted in a histogram using the ggplot2 package in R [31].

Ridgeline plots were created based on these data to explore the number of allelic mismatches associated with clusters identified by pre-existing typing systems such as NG-STAR clonal complexes and Ng cgMLST v1 core genome groups. This was done by extracting the pairwise allelic distances associated with each instance of matching NG-STAR CC, rMLST-ST, or matching Ng cgMLST v1 core genome group, from the distance matrix. Ridgeline plots were constructed in R using the packages ggplot2 and ggridges [31, 32]. Based on the observed breaks in population structure, various bin thresholds were tested using a local version of LIN code downloaded from https://gitlab.pasteur.fr/BEBP/LINcoding applied to dataset 2 [30]. The 3935 isolates yielded 3877 unique cgSTs for testing.

Clustering output from the local LIN code testing using various allelic mismatch thresholds was assessed using statistics, specifically the silhouette score and Rand index. The silhouette score indicates the cohesion of clustering at a particular threshold, varying from −1 to 1 with more positive values indicating higher cluster cohesiveness [33]. The average silhouette score was calculated using MSTClust (v0.21b) (downloaded from https://gitlab.pasteur.fr/GIPhy/MSTclust) for pairwise distance thresholds from 0.01 – 0.7 [34]. Results were plotted using the R package ggplot2 [31].

The Rand index is a statistical measure that indicates the level of similarity between two different partitions of the same dataset, with values approaching 1 indicating higher similarity [35]. The Rand index was applied here to compare local LIN code clustering at various thresholds against Ng cgMLST v1 core genome group clustering at threshold 300 and 400. The index was calculated in R using the package fossil [36].

Finally, LIN codes generated using these thresholds on the representative dataset were visualised on a minimum spanning tree to explore what length of code prefix would be most suited to exploring population structure at various resolutions.

Implementation and Analyses

Once the refined Ng cgMLST v2 scheme was defined in PubMLST and isolates annotated with sequence types, the chosen thresholds were implemented within the database to generate LIN codes. A maximum of 25 missing loci in the Ng cgMLST v2 scheme was tolerated. If this quality threshold was not met, no cgST would be assigned to the isolate, and hence no LIN code. LIN codes were designated in PubMLST ordered based on a minimum spanning tree, using a default batch size of 10,000 isolates [21]. Multilevel single-linkage clustering was used to classify isolates at each threshold within the code. The combination of these clusters with a fixed bar code facilitates LIN code’s stability [22, 23]. Subsequently, isolate records that have a cgST but no LIN code assigned will be scanned once a week, and LIN codes automatically designated.

Once LIN codes were designated within the database, PubMLST was further applied to investigate the distribution of LIN codes and to examine their association with other isolate typing systems such as 7-locus MLST. This was undertaken using the ‘Field breakdown’ and ‘Combinations’ analysis tools, alongside exporting datasets for manipulation.

A published dataset of 171 ceftriaxone resistant gonococci from diverse phylogroups [37] was selected for analysis, to validate LIN code’s ability to reproduce complex phylogenies. Isolates from this dataset not already present in PubMLST were downloaded from the sequence read archive, and where necessary were assembled using SPAdes/3.15.4-GCC-12.3.0 [38]. These isolates were then uploaded to PubMLST and associated under the appropriate publication record (Fifer et al., 2024). One isolate (H18-368) out of 171 no longer had sequence data available, and so could not be included in this analysis.

Phylogenetic Trees

Minimum spanning trees were drawn using the Grapetree plugin in PubMLST [29]. Neighbour joining trees were constructed using the ITOL plugin within the PubMLST interface.

Nucleotide alignments for other trees were generated via genome comparator in PubMLST [25], using the MUSCLE algorithm, with settings to align all loci, not only variable loci. Maximum likelihood trees were constructed using RaxML [39], and approximate maximum likelihood trees using FastTree [40]. Trees were corrected for recombination using ClonalFrameML [41] and edited using ITOL [42].

Results

Core genome MLST Version 2

The final Ng cgMLST v2 included 1430 loci. Of these, 362 were hypothetical genes of unknown function, 96 encoded transferases, 56 synthases, 45 transporters, 34 50S ribosomal proteins, 29 transcriptional regulators, and 21 30S ribosomal proteins (Supplementary table 3). This represented a decrease of 219 loci compared to Ng cgMLST v1, with the largest categories of excluded loci being hypothetical (83 loci) or phage associated (18 loci) (Supplementary table 4). Notable exclusions include tbpA and tbpB, essential iron acquisition proteins that are hypervariable and so not readily auto-annotated. In total 1418 loci were shared between the two schemes, with 12 new to Ng cgMLST v2 (Supplementary table 4).

Of the 1430 loci included, 99% (1418/1430) had under 1000 alleles, and an average length of less than 3000 base pairs [Figure 2]. Exceptionally variable genes included NEIS2020 porB (5544 alleles), NEIS2644 an unnamed phage tail protein (3475 alleles), NEIS0829 pilW (1474 alleles) and NEIS2113 tamB (1122 alleles) [Figure 2c].

Allele number vs allele length for the 1430 genes in Ng cgMLST v2.

Figure 2a) All 1430 genes included in Ng cgMLST v2 are plotted. 2b) The four genes with the highest average length, and the four with the highest number of alleles, were excluded as outliers. The figure was compiled excluding these genes in order to allow a closer examination of the distribution of allele length vs number. One gene, NEIS2113, appeared in both lists. 2c) Table summarising the average length in base pairs, number of alleles, and gene name/function of the seven outlier loci.

It should be noted that while the new scheme is fundamentally still a core genome MLST scheme, it does not include all core genes, and should not be used as a definitive core gene list. However, the strict inclusion criteria for Ng cgMLST v2 facilitated automated assignment of a cgST to a higher proportion of WGS data than its predecessor, while still providing a similar degree of resolution.

Analysis of Population Structure and Allelic Mismatch Thresholds

The number of allelic mismatches present in a pairwise comparison across the 3935 isolates in dataset 2 was assessed across the 1430 core genes in Ng cgMLST v2. When visualised as a histogram [Figure 3a], an uneven distribution of allelic mismatches was observed, exhibiting a highly varied incidence. The allelic mismatch modes indicate clusters of isolates that share the same proportion of alleles in common across the core genome scheme, reflecting natural breaks in the gonococcal population structure [22].

Plots used in the selection of allelic mismatch thresholds for the LIN code.

Figure 3a) Histogram showing the frequency of pairwise allelic mismatches within dataset 2. A subset of the allelic mismatch thresholds applied in the gonococcal LIN code are shown (blue dashed lines) at 25 mismatches (1.75%), 125 (8.74%), 300 (20.98%), 550 (38.46%), and 650 (45.45%). 3b) Ridgeline plots depicting the frequency of allelic mismatches amongst pairs of isolates that belong to the same category of different metrics from dataset 2. From top to bottom: Ribosomal MLST (RST), NG-STAR Clonal Complex (NgSTARCC), Country, Ng cgMLST v1 core genome group at threshold 400 (cgc_400), threshold 300 (cgc_300), and threshold 200 (cgc_200)). 3c) Plot of adjusted Rand index comparing LIN code clustering at various allelic mismatch thresholds to Ng cgMLST v1 core genome groups at threshold 300 (black dots) and NG-STAR CC (blue squares). Clustering was compared using dataset 2. 3d) Plot of silhouette index (score) at various cutoff values, based on MSTclust analysis of 1430 core loci across 3935 representative N. gonorrhoeae isolates (dataset 2). Silhouette score peaked at 0.64 at a cutoff value of 0.225.

The most frequently identified number of allelic mismatches between isolates was 840/1430 (59%). This modal mismatch is low compared to that seen in K. pneumoniae (within species mode 520/629 mismatches, 83%) [22]. There were no instances of pairwise allelic mismatches exceeding 970/1430 loci (68%), meaning that all isolates shared a minimum of 460 alleles with one other isolate in the dataset. This means that compared to K. pneumoniae, N. gonorrhoeae isolates shared more of the same alleles across their core genome. This may be due to the homogenising effect of widespread HGT.

When pairwise allelic mismatches were analysed in isolates belonging to the same grouping, such as Ng cgMLST v1 core genome groups or NG-STAR CCs, discontinuities in the population structure associated with these metrics became visible [Figure 3b]. For isolates belonging to the same core genome group (at threshold 300), NG-STAR CC, or ribosomal ST, most pairs had under 125/1430 allelic mismatches between them [Figure 3b]. This indicates that shared identity in approximately 1305/1430 (91%) Ng cgMLST v2 loci is required to belong to the same core genome group (at threshold 300), NG-STAR CC or ribosomal ST. However, core genome groups at threshold 400 displayed a second peak at 510/1430 (36%) mismatches. This is a result of group fusion due to intermediate genotypes, which has led to divergent gonococci being included in the same group at this threshold. Isolates from the same country displayed a comparatively high average level of allelic mismatch, with a mode of 855/1430 (60%), indicative of limited geographical association of gonococcal lineages.

Rand index values when comparing LIN code bin thresholds to Ng cgMLST v1 core genome groups (at threshold 300) peaked at 1 for a 300-locus mismatch threshold (21%) [Figure 3c]. Similarly, the silhouette score demonstrated a peak of 0.64 at cut-off 0.225 [Figure 3d].

Based on this evidence, allelic mismatch thresholds were chosen for each bin of the LIN code. Thresholds at 125 loci (8.74% mismatch) and 300 (20.98% mismatch) were chosen to correspond to discontinuities in population structure [Figure 3a], alongside a Rand index close to 1 demonstrating their relevance to core genome groups, and silhouette score indicating high cluster cohesiveness at these thresholds [Figure 3c & d]. Higher thresholds at 550 loci (38.46% mismatch) and 650 (45.45%) captured superlineage divisions.

To provide the higher resolution necessary to identify transmission events or outbreak-related gonococcal variants, thresholds at lower levels of allelic mismatch were included. Together, this resulted in a set of 11 bins, corresponding to mismatch thresholds: 650, 550, 300, 125, 25, 10, 7, 5, 3, 1, and 0 [Figure 4]. Minimum spanning tree analysis indicated that a prefix using the first three thresholds was ideal for exploring lineages [Figure 5] corresponding to Ng cgMLST v1 core genome groups at threshold 300.

Illustration of the gonococcal LIN code nomenclature.

4a) Each successive allelic mismatch threshold dictates clustering at a specific position within the code. This clustering is hierarchical, such that isolates sharing a larger proportion the code (from the left across) are of higher genetic similarity. For example, Isolate A and B share a complete LIN code, meaning they have 0 allelic mismatches in their Ng cgMLST v2 loci. Isolate B and C share the first three digits of their LIN code; they belong to the same clusters at these thresholds, and therefore differ in less than 300 alleles out of the 1430 core genes in Ng cgMLST v2 i.e., they belong to the same LIN code “lineage”. 4b) Rooted Maximum likelihood tree demonstrating how LIN codes reflect phylogenetic relationships. The first tree shows a subset of LIN code lineages within superlineage 0_2, with lineage 1_1_2 as the outgroup. Moving to the right, the figure focuses in on lineage 0_2_1, showing the higher resolution provided by LIN sublineages, and then groups. (Figure inspired by Figure 3 in Van Rensburg, Berger et al., 2024 [26])

Minimum spanning tree showing clustering of 3935 isolates from dataset 2 based on Ng cgMLST v2.

LIN code lineages (5a) and 7-locus MLST (5b) demonstrate similar levels of resolution for characterising clustering. LIN codes sublineages (5c) provide higher resolution, similar to that provided by NG-STAR clonal complexes (5d). Only categories including 20 or more isolates are coloured.

Implementation of LIN code

Across the >28,000 publicly available N. gonorrhoeae genomes stored in PubMLST, LIN codes were assigned to 25,912 isolates. At the superlineage level (2 threshold prefix: 0_0) there were 118 clusters, at lineage (3 prefix: 0_0_0) 532 clusters, and at sublineage (4 prefix: 0_0_0_0) 1712 clusters. The 3-threshold prefix lineage definitions directly correlated with Ng cgMLST v1 core genome groups at threshold 300 (adjusted [adj.] Rand Index = 1). This was visualised using phylogenetic trees [Figure 6] and minimum spanning trees [Figure 5] which showed good congruence between LIN codes, SNP-based phylogeny, and other clustering methods. The lineages identified by LIN code demonstrated persistence over time, with lineage 0_2_1 isolates spanning 39 years from 1985 to 2024 (Supplementary Figure 1) N. gonorrhoeae LIN codes were accessed via an isolate’s information page, from the ‘allele designations/scheme fields’ dropdown box within the pubmlst.org/organisms/neisseria-spp isolate database search form, or through the ‘export dataset’ link at the bottom of the page following an isolate search. They could also be exported as metadata in Grapetree analyses.

FastTree of 1000 randomly selected N. gonorrhoeae isolates.

Constructed using 1430 loci from Ng cgMLST v2. Branches are coloured by superlineage (0_2 = red, 0_7 = orange, 0_0 = green, 0_18 = brown, 1_1 = blue, 5_2 = yellow, 19_01 = pink, and 1_8 = purple.) The 21 highest frequency LIN lineages are represented by the inner bar, in colour ranges corresponding to their superlineage colour. LIN codes form monophyletic groupings, indicating that there is a good degree of congruence between the allelic profile clustering method used in cgMLST LIN code and nucleotide sequence alignment-based phylogeny. Core genome groups at threshold 300 (cgc_300) are represented by the outer coloured bar, and show good concordance with LIN lineages, although fewer isolates were able to be annotated with core genome groups than LIN codes due to use of the larger and more poorly auto-annotated cgMLST v1.

Congruence with Previous Typing Schemes

7-locus MLST-STs showed good association with LIN code lineages (0_0_0) (adj. Rand index = 0.92) [Figure 5], although many STs included multiple lineages. For example, MLST-ST 1901 corresponded with lineages 0_2_0 (1932/2376, 81%) and 0_0_12 (443/2376, 19%), while MLST-ST 7363 was predominantly captured by 0_0_52 (796/1322, 60%) and 0_2_17 (355/1322, 27%) (Supplementary table 5) The fact these LIN codes are from different superlineages (0_0 & 0_2) illustrates how these MLST-STs form polyphyletic groupings, combining isolates that are not closely related under the same ST.

NG-STAR clonal complexes (CCs) showed a stronger association with LIN code sublineages (0_0_0_0) (adj. Rand index 0.96) [Figure 5]. NG-STAR CC 90 belonged predominantly to sublineages 0_2_0_1 (1499/1562, 96%) and to a much lesser extent 0_0_12_2 (25/1562, 2%). NG-STAR CC 63 was represented mainly by 0_2_1_10 (1369/1437, 95%), followed by 1_1_22_9 (21/1437, 1%) and others (Supplementary table 5). Again, the diversity of these LIN codes at the superlineage level, and comparison with their location on the phylogeny [Figure 5 & 6], demonstrates how genetically divergent isolates can belong to the same CC.

NG-MAST-STs showed some correlation with LIN code groups (0_0_0_0_0) (adj. Rand index 0.50). For example, NG-MAST-ST 2992 belonged mostly to group 0_2_1_0_73 (409/653, 63%) and to others such as 0_2_1_1_4 (72/653, 11%). NG-MAST-ST 1407 predominantly belonged to group 0_2_0_1_0 (498/787, 63%) and to 0_2_0_1_27 (77/787, 10%).

Resistance associated lineage A, as described using hierarchical Bayesian analyses (BAPs) [43], was previously shown to associate well with Ng cgMLST v1 core genome groups [7]. As LIN code lineages correspond to core genome groups at threshold 300 [Figure 6], the same is true of this nomenclature. Lineage 0_2_0, 0_2_1, and 0_2_17 correspond to core genome group 3, 16 and 8, which in turn correlate with BAPS 8, 7, and 9 respectively, all within the resistance associated lineage A [7, 43].

Mosaic penA type 34 alleles, an important antimicrobial resistance determinant, correlated with LIN code lineage; 89.43% of these alleles were found in isolates belonging to lineage 0_2_0 (Supplementary Table 6).

Example Analysis

The gonococcal LIN code was applied in a reproduction of an analysis of 170 ceftriaxone resistant gonococci initially performed by Fifer et al., 2024 [37] (Supplementary Table 7). A maximum likelihood tree of these isolates was reconstructed and labelled using LIN codes [Figure 7], and demonstrated that LIN code lineages recaptured the diverse clades identified by phylogenetic methods, while also classifying several isolates that were not assigned a phylogroup in the original publication. Furthermore, 15 pairs or triplets of isolates sharing a full-length LIN code were identified within this dataset. Comparison with a random selection of isolates from across the PubMLST database [Figure 6] suggested an overrepresentation of 0_14_0 isolates in the ceftriaxone-resistant dataset. This dataset can be accessed by filtering by publication “Fifer et al., 2024” on the PubMLST Neisseria isolate search page.

Maximum Likelihood tree of 170 Ceftriaxone-resistant isolates previously analysed in Fifer et al., (2024).

Constructed using 1430 loci from Ng cgMLST v2. LIN code lineages were able reproduce the clades identified in Fifer et al., while being readily accessible and providing additional detail about each clade in the form of superlineage & sublineage divisions. Groups of isolates that share the same full length LIN code are highlighted as coloured stars.

Discussion

Accurate identification of bacterial lineages is necessary to examine links between bacterial population structure and characteristics such as AMR, and to enable surveillance of emergent or outbreak-associated variants [1]. However, in N. gonorrhoeae, widespread HGT reassorts DNA among isolates, disrupting clonal inheritance, and consequently distorting phylogenetic trees [7, 44]. This makes it difficult to accurately characterise gonococcal lineages, particularly when using conventional typing systems based on small numbers of genes, such as 7-locus MLST [7, 12]. Previously, the most reliable approach applied core genome MLST (cgMLST) to define core genome groups, using over 1000 genes to dilute the effects of HGT and facilitate high resolution analyses [1, 7, 45]. Unfortunately, this method suffers from cluster instability [22, 44]. LIN code provides an alternative means to leverage the benefits of cgMLST within a stable classification system, which can accommodate new genotypes without the risk of disrupting established clustering [21, 22]. Here, we developed a N. gonorrhoeae LIN code: an effective nomenclature for categorising, exploring, and discussing gonococcal population structure.

Evidence for the existence of distinct, persistent gonococcal lineages pre-dates whole genome sequencing, in the characterisation of auxotypes: related isolates that exhibit similar patterns of growth in media containing different combinations of key nutrients [46]. This may be due to metabolic competition amongst gonococcal strains, causing the population structure to segregate into distinct metabolic types [47]. Previous research has posited that selection can preserve such lineages even in the face of extensive HGT, as has been the case in immune selection for differing antigenic types in other bacterial species [4749]. Sequence-based approaches including 7-locus MLST, NG-STAR and cgMLST have confirmed that the N. gonorrhoeae population contains identifiable groups of related isolates, some of which are associated with clinically relevant phenotypes such as AMR [7, 16, 50, 51], with outbreak events [52], or with specific at-risk groups such as men who have sex with men (MSM) [43, 53]. However, some of these typing systems are prone to providing misleading classifications, where isolates that are not closely related appear to be, due to the effects of HGT within the small numbers of loci used to classify them. Consequently, while equivalent systems have proved highly effective at categorising the closely related N. meningitidis [54], the extensive HGT observed across the gonococcal genome necessitates a whole genome approach to reliably identify related groups of N. gonorrhoeae isolates [7]. LIN code achieves this by applying cgMLST, using hundreds of loci belonging to a wide range of functional categories, including genes involved in metabolic pathways, genetic processing, and AMR. This allows the LIN code to capture gonococcal population structure with increased consistency, and higher resolution, than many previous approaches, providing fresh insight into the biology that underlies this species’ genetic composition.

The LIN nomenclature conveys lineage information in the form of hierarchical clustering at sequential thresholds of tolerated allelic mismatch within a cgMLST scheme (i.e. the number of alleles differing out of the total list of loci in the scheme) [22, 23]. Here, 11 thresholds were used, resulting in an 11-bin barcode. The leftmost numbers of the barcode represent clustering at the highest thresholds of allelic mismatch, here corresponding with LIN superlineage (up to 550 mismatches), LIN lineage (300 mismatches), and LIN sublineage (125 mismatches). Progressing across the barcode to the right, each number indicates clustering at an increasing level of allelic similarity, ultimately resulting in the delineation of highly related isolates that differ in a very small number of core loci [Figure 4] [2123]. Subsets of thresholds, for example a prefix of the first three or four, may be used to define clustering down to a particular level of taxonomic similarity [22, 23]. If two isolates are highly related and therefore identical across their core gene profile, they will share the same LIN code.

The use of multiple thresholds enables the LIN code to relate back to several familiar nomenclatures, such as core genome groups and 7-locus MLST, which are both broadly equivalent to LIN lineages, and NG-STAR CCs, approximately equivalent to LIN sublineages. The difference lies in the accuracy and stability of the LIN code versus its predecessors. Visualisation via phylogenetic trees demonstrated that the LIN codes related strongly to phylogenetic structure, capturing distinct clades without overlap. This is in contrast to previous nomenclatures such as 7-locus MLST, which can form polyphyletic groups due to the misleading effects of HGT [7].

Furthermore, the multi-resolution, hierarchical nature of LIN codes also allows this nomenclature to encode more information about the relationship between isolates than conventional typing systems [21]. For example, if a pair of isolates belong to MLST 1 and 2, it can only be deduced that they differ in between 1 and 7 housekeeping genes. Meanwhile, if the same pair of isolates belong to LIN lineages 0_2_1 and 0_2_2, this communicates that these isolates differ by no more than 550/1430 core gene alleles, but by more than 300/1430, quickly defining the degree of their relatedness [Figure 4]. This is useful when comparing isolates, for example those associated with AMR. Previous approaches used to track AMR strains include both 7-locus MLST and NG-STAR, which types isolates based on their alleles across seven AMR genes (penA, mtrR, porB, ponA, gyrA, parC, and 23S rRNA) [55]. This typing system has been successfully used in many analyses of AMR associated gonococci, and NG-STAR CCs have been recommended as a tool for characterising gonococcal lineages [16, 50]. However, due to HGT specific NG-STAR types or clonal complexes do not always represent closely related isolates. LIN sublineages were able to discern isolates with a similar level of resolution to NG-STAR CCs, but higher reliability. This will facilitate more accurate tracking of AMR associated isolates, such as those belonging to LIN sublineage 0_2_0_1, when compared to the equivalent NG-STAR CC 90, which also includes unrelated isolates belonging to LIN superlineages 0_0 and 1_1. While NG-STAR effectively characterises an isolate’s AMR genotype [55], it cannot consistently distinguish related isolates. LIN code fulfils this role, while also providing increased information about the degree of relatedness of the isolates in question. Accurately defining this lineage structure will be important to elucidate how AMR emerges in distinct gonococcal lineages.

As an allele-based method, LIN codes enable analysis of a large number of isolates rapidly and reproducibly [21, 22, 24]. This is an improvement on phylogenetic tools used to identify gonococcal lineages which rely on nucleotide sequence alignments, requiring significant time and expertise to compute [56]. LIN codes can replicate similar clustering, identifying the same lineages while being fully automated [21]. For example, hierarchical Bayesian analyses (BAPs) have previously been used to delineate gonococcal population structure into two lineages: lineage A, associated with high-risk sexual networks and AMR, and lineage B, associated with lower risk networks and susceptibility [43]. In our LIN code analysis, we observed that lineage A (BAPS 8, 7 and 9) corresponds to LIN lineages 0_2_0, 0_2_1, and 0_2_17, reaffirming that cgMLST is able to provide the same resolution as SNP-based techniques [7]. While BAPs analysis can be complex and time intensive to run, LIN codes can be used to rapidly identify lineage A gonococci simply by uploading WGS data to PubMLST. Within 24 hours of upload, any sequence data in which 1405/1430 core loci can be annotated will be assigned a Ng cgMLST v2 cgST; if this is a previously identified cgST a LIN code will be assigned immediately, if a new cgST the new LIN code will be assigned within 7 days. Furthermore, these isolates can then be analysed in the context of the wider PubMLST gonococcal genome collection, which currently includes >28,000 public records and integrates accessible plugins such as GrapeTree [29] and Genome Comparator [25] for additional investigation.

To further illustrate the efficacy of LIN code we reproduced an analysis of 170 global ceftriaxone resistant isolates [37]. The original article’s methodology involved a WGS alignment and generation of a maximum likelihood tree in order to characterise eight major phylogroups [37]. The gonococcal LIN code instantly reproduced these clusters, while providing additional detail about each clade in the form of superlineage/sublineage divisions and simultaneously contextualising the isolates amongst the wider PubMLST database. For example, lineage 0_14_0 was over-represented in this ceftriaxone-resistant dataset when compared against 1000 randomly selected N. gonorrhoeae isolates from PubMLST. Also, LIN code was able to classify several divergent isolates that were not assigned to any phylogroup in the original publication, and detected 15 instances of matching full-length LIN codes, meaning these isolates were identical across their core genome and could therefore represent isolates associated with transmission events [7]. The PubMLST interface facilitates in-depth analysis of these instances of shared LIN codes, allowing users to explore related isolates at various thresholds of allelic dissimilarity by viewing a table of similar isolates (as defined by LIN code) on an isolate’s information page [Figure 8]. This enables quick investigation of any related isolates, including their location, year, allele at a particular locus, or classification by other typing schemes such as NG-STAR.

Using PubMLST to explore related isolates by LIN code.

Within an isolate’s information page, here using isolate SC18-25 (PubMLST id: 165303) (https://pubmlst.org/bigsdb?page=info&db=pubmlst_neisseria_isolates&id=165303), it is possible to view a breakdown table of similar isolates by LIN code. This isolate shares a complete LIN code with 2 other isolates, meaning they are identical in their core genome. (8a). Clicking on the “matching isolates” number at a certain LIN code threshold then takes the user to the dataset of matching isolates for further analysis (8b). This feature can be applied in the investigation of transmission chains, outbreak events and the dissemination of AMR through clonal expansion.

The genetic mix-and-matching performed by N. gonorrhoeae can make characterisation of its population structure difficult [7]. However, the LIN code nomenclature proposed here provides clarity, consistency, and stability in its description of N. gonorrhoeae lineages. The multi-resolution clustering intrinsic to LIN code facilitates a common language around lineage nomenclature at different epidemiological levels, from high divisions such as superlineage down to unique clones [21]. In conclusion, N. gonorrhoeae LIN codes represent a portable, publicly available taxonomic nomenclature that has the potential to enhance surveillance of N. gonorrhoeae in order to benefit public health.

Data availability

Isolate IDs and metadata of relevant datasets are included in the Supplementary documents. Ng cgMLST v2 and the gonococcal LIN code is publicly accessible via PubMLST (https://pubmlst.org/organisms/neisseria-spp).

Acknowledgements

The authors would like to thank Nazreen Hadjirin and Iman Yassine for their valuable advice. Computational aspects of this work were enabled by the Oxford University Biomedical Research Computing (BMRC) facility.

Additional information

Author contributions

A.U created the updated Ng cgMLST v2 scheme, performed the allelic mismatch and statistical analyses, selected the LIN code thresholds, wrote the manuscript, and created all figures. O.B.H. and M.C.J.M conceptualised and supervised this work, and edited the manuscript. M.A.K contributed python and linux scripts and provided guidance on pangenome analysis. K.A.J implemented the cgMLST scheme and LIN code within the PubMLST database. K.M.P provided assistance with recombination correction.

Funding information

A.U was funded by the BBSRC Interdisciplinary Biosciences DTP (Grant no. BB/M011224/1), and subsequently an ECR Fellowship through the Nuffield Department of Population Health (NDPH), University of Oxford. M.C.J.M, K.A.J and PubMLST are funded by the Wellcome trust (Grant no. 218205/Z/19/Z). O.B.H was funded by the Wellcome trust (Grant no. 214374/Z/18/Z), and subsequently though a Senior Research Fellowship at NDPH, University of Oxford. M.A.K’s studentship was funded by the Ministry of Education, Indonesia, in collaboration with the Medical Science Division, University of Oxford. K.M.P was funded by the Wellcome trust (Grant no. 214374/Z/18/Z).

Funding

BBSRC (BB/M011224/1)

University of Oxford

Wellcome Trust

https://doi.org/10.35802/218205

Wellcome Trust

https://doi.org/10.35802/214374

Additional files

Supplementary tables and figures