Neisseria gonorrhoeae LIN codes: a Robust, Multi-Resolution Lineage Nomenclature

Anastasia Unitt; Made Krisna; Kasia M Parfitt; Keith A Jolley; Martin CJ Maiden; Odile B Harrison

doi:10.7554/eLife.107758.2

eLife Assessment

This important study introduces the Life Identification Number (LIN) coding system as a powerful and versatile approach for classifying Neisseria gonorrhoeae lineages. The authors show that LIN codes capture both previously defined lineages and their relationships in a way that aligns with the species' phylogenetic structure. The compelling evidence presented, together with its integration into the PubMLST platform, underscores its strong potential to enhance epidemiological surveillance and advance our understanding of gonococcal population biology.

https://doi.org/10.7554/eLife.107758.2.sa2

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

compelling: Evidence that features methods, data and analyses more rigorous than the current state-of-the-art

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Investigation of the bacterial pathogen Neisseria gonorrhoeae is complicated by extensive horizontal gene transfer: a process which disrupts phylogenetic signals and impedes our understanding of population structure. The ability to consistently identify N. gonorrhoeae lineages is important for surveillance of this increasingly antimicrobial resistant organism, facilitating efficient communication regarding its epidemiology; however, conventional typing systems fail to reflect N. gonorrhoeae strain taxonomy in a reliable and stable manner. Here, a N. gonorrhoeae genomic lineage nomenclature, based on the barcoding system of Life Identification Number (LIN) codes, was developed using a refined 1430 core gene MLST (cgMLST). This hierarchical LIN code nomenclature conveys lineage information at multiple levels of resolution within one code, enabling it to provide immediate context to an isolate’s ancestry, and to relate to familiar, previously used typing schemes such as Ng cgMLST v1, 7-locus MLST, or NG-STAR clonal complex (CC). Clustering with LIN codes accurately reflects gonococcal diversity and population structure, providing insight into associations between genotype and phenotype for traits such as antibiotic resistance. These codes are automatically assigned and publicly accessible via the pubmlst.org/organisms/neisseria-spp database.

Introduction

Taxonomic classification is necessary to unravel evolutionary relationships, while also enabling the establishment of nomenclatures that facilitate effective communication of where an organism falls on a phylogeny [1]. This is particularly important for lineages within bacterial species, which are often distinct in clinically relevant phenotypes including antimicrobial resistance (AMR) and virulence [1]. The ability to identify these lineages consistently via a stable nomenclature system is essential to epidemiological surveillance [1].

In N. gonorrhoeae, the identification of AMR associated lineages is particularly relevant. The gonococcus is a multi-drug resistant pathogen included in the WHO priority list for AMR [2, 3]. Globally, gonorrhoea causes a high burden of disease with an estimated 86.9 million new cases annually, which, if not successfully treated, can cause sequelae such as infertility and pelvic inflammatory disease [3, 4]. Facing the dual challenges of AMR surveillance and gonococcal vaccine development [5], accurate characterisation of gonococcal population structure and consistent identification of key lineages is of upmost importance [6, 7].

A variety of approaches have been applied to N. gonorrhoeae lineage taxonomy, which can cause confusion and miscommunication. Underlying these disparate nomenclatures is the complex population structure of the gonococcus; it is a highly recombinogenic organism, carrying out extensive horizontal gene transfer (HGT) between gonococci, and more rarely with other species [8–11]. Polymorphisms are continually reassorted through this process, leading to low levels of linkage disequilibrium and disrupting clonal population structure [8, 12]. Consequently N. gonorrhoeae has been described as a “sexual clone”, reflecting the strong impact of HGT and the therefore weak clonal signal present in gonococcal genomes [7, 13]. This low signal means that many molecular typing methods that are effective at capturing lineages in relatives such as N. meningitidis are not as reliable in N. gonorrhoeae [7]. The result has been a proliferation of differing approaches and nomenclatures seeking to capture N. gonorrhoeae taxonomy, particularly since 2010 [14–16].

Most of these methods are Multi-Locus Sequence Typing (MLST) based. MLST is a widely used molecular approach for characterising bacterial isolates through the combination of alleles across a set of genes (an ‘allelic profile’) [17]. Each unique combination of alleles across the profile represents a sequence type (ST), which can be used, where they correspond with clonal inheritance, to characterise lineages [1]. Isolates can be further grouped based on these STs, using methods such as goeBURST [18] or single linkage clustering [19].

MLST is a powerful approach that has had widespread applications in the analysis of bacterial lineages [1]; however, conventional MLST approaches such as 7-locus MLST use only a small number of loci, which can make this system more vulnerable to disruptive HGT. In bacteria such as N. gonorrhoeae, HGT is extensive enough that it affects the housekeeping genes used in 7-locus MLST [7]. This makes 7-locus MLST a sub-optimal tool when applied in N. gonorrhoeae typing, as two isolates with the same ST may not actually be closely related, having come by the same ST by HGT rather than clonal inheritance [7]. This process affects several gonococcal typing systems, including NG-MAST [7, 15] and NG-STAR [7, 16].

Typing systems using larger numbers of loci can address this problem. For example, core gene MLST (cgMLST) is an extension of the MLST approach that uses a profile of hundreds of core genes to gain higher resolution, while also diluting the taxonomy-jumbling effects of HGT of some of those loci [1, 7, 20, 21]. In N. gonorrhoeae cgMLST v1, genes were defined as core if they were present in >95% of genomes, resulting in a core gene list (scheme) of >1600 core genes [7]. If two isolates were identical across this scheme, allowing for up to 50 loci to be unannotated, they would belong to the same core genome sequence type (cgST) [7]. Single-linkage clustering was then applied to define core genome ‘groups’ based on various threshold levels of allelic differences within and between groups [7]. This clustering was necessary as the large number of loci used in cgMLST results in large numbers of unique cgSTs.

This method provided a high-resolution classification of N. gonorrhoeae lineage taxonomy, while being less affected by HGT than 7-locus MLST, NG-STAR or NG-MAST [7]. However, single-linkage clustering is susceptible to group fusion when intermediate isolates are found that bridge the identity thresholds between linkage groups [21, 22]. HGT can exacerbate this issue, homogenising sequences across lineages by spreading polymorphisms throughout the population. As a result, the Ng cgMLST v1 core genome group nomenclature was not stable, and changed over time as more N. gonorrhoeae isolates were sampled, decreasing the resolution provided and potentially leading to confusion as the nomenclature shifts. Therefore, there is still a need for a N. gonorrhoeae lineage nomenclature that provides similarly effective discrimination between gonococcal lineages, while also remaining consistent over time.

Here, a definitive nomenclature for N. gonorrhoeae is proposed, using LIN codes. This isolate barcoding system has previously been applied to Klebsiella pneumoniae [22] and Streptococcus pneumoniae [23], where LIN codes were shown to provide a high resolution nomenclature, enabling accurate insight into the population structure of these organisms. LIN code combines the HGT-diluting effects of a cgMLST scheme with the stability of a numeric barcode [21, 22]. This stability is derived from the fixed nature of these barcodes for each isolate, along with the flexibility of a multi-position numeric code which can increase infinitely as new variants are sequenced [21, 22, 24]. The N. gonorrhoeae LIN code developed here was implemented in PubMLST, a freely accessible bacterial genomics database (RRID:SCR_012955) [25]. LIN codes are automatically assigned to all WGS uploaded to PubMLST, making this taxonomy easily applicable to old and new datasets, and encouraging the widespread use of this reliable nomenclature.

Methodology

Representative Isolate Collections

Data were extracted from the PubMLST database (pubmlst.org/organisms/neisseria-spp), a publicly accessible bacterial genomics database [25]. At the time of writing, sequence records from over 28,000 gonococci were available in the database. To facilitate efficient analyses representative sub-datasets were generated.

Dataset 1 consisted of a representative collection of 896 isolates belonging to a range of core genome groups at the 300 allelic mismatch threshold [Figure 1]. This was assembled by including all isolates belonging to core genome groups represented by 6 or fewer isolates, in combination with a random selection of 6 isolates from each of the core genome groups represented by more than 6 isolates (Supplementary table 1). Randomisation was achieved in R, using the sample function [26]. This smaller dataset was used for preliminary analysis, as it required less computational power.

Characteristics of isolates within representative development datasets 1 and 2.
This includes geographical distribution (top panels), frequency of isolates sampled over time (middle panels), and genome quality statistics (lower panels) including i) contig number, ii) total genome length and iii) % GC content.

Dataset 2 consisted of 3935 isolates [Figure 1], assembled by including all isolates belonging to core genome groups represented by 20 or fewer isolates (again at allelic mismatch threshold 300), and a random sample of 30% of all core genome groups represented by more than 20 isolates, capped at 350 isolates per core genome group (Supplementary table 2). This larger dataset was used for final analysis once techniques had been explored using dataset 1.

Development of Ng cgMLST v2

Although a 1649 locus Ng cgMLST v1 scheme for N. gonorrhoeae has been published [7], an updated, more readily auto-annotated cgMLST scheme was needed to ensure any WGS deposited in the PubMLST database can be assigned a LIN code with little to no manual curation [22, 23]. This was a consequence of Ng cgMLST v1 including a number of genes with alternate start codons or length variation (e.g. due to internal stop codons or frameshift mutations), complicating automatic allele annotation by PubMLST.

WGS data from dataset 1 were downloaded from PubMLST and annotated with Prokka using default parameters (RRID:SCR_014732) [27]. The resulting .gff output was further analysed with PIRATE (version 1.0.5) for pangenome analysis including the identification of core genes (RRID:SCR_017265) [28]. PIRATE was run using default parameters with the following options: -a (align all genes), -r (plot summaries using r) and –t 24 (24 threads). A preliminary threshold of 95% presence was chosen as ‘core’ rather than 100% to ensure a large enough gene list was identified. This also allowed for the possibility that some genes may be absent due to factors such as mis-assembly or mutation while still being ‘core’ in nature. Genes with duplication events as detected by PIRATE were excluded.

The new core list was assessed in dataset 2 to identify possible curation issues across a wider range of isolates. Following this analysis, all loci that did not meet a threshold of 98% presence in dataset 2 were excluded. This reduced Ng cgMLST v2 prioritised loci that annotate with very little human involvement via manual curation.

The PubMLST GrapeTree plugin was used to generate minimum spanning trees based on the Ng cgMLST v2 loci, annotated with core genome groups (based on Ng cgMLST v1), to ensure resolution was high enough with this new reduced scheme to explore established gonococcal lineages [29]. GrapeTree compares isolates across a user-defined allelic profile, clustering those that share the most alleles [29].

Analysis of Population Structure and Threshold Selection

Thresholds for LIN code were chosen based on an analysis of natural discontinuities in the gonococcal population structure, assessed by examining the distribution of pairwise allelic mismatches amongst isolates from dataset 2, combined with testing in a local installation of LIN code [30] and the application of clustering statistics including Rand index and silhouette score.

In order to extract pairwise allelic distances for these analyses, a distance matrix was generated comparing dataset 2 isolates across the loci included in Ng cgMLST v2. This was accomplished using the genome comparator plugin in PubMLST. This tool provides gene-by-gene pairwise analysis of isolates across a list of loci, generating a variety of outputs including a distance matrix, core gene list, and alignments [25]. Here, genome comparator was run with default settings. The distances in this matrix were stacked to form a dataframe with three columns (isolate A, isolate B, distance), allowing the distances to be plotted in a histogram using the ggplot2 package in R (RRID:SCR_014601) [31].

Ridgeline plots were created based on these data to explore the number of allelic mismatches associated with clusters identified by pre-existing typing systems such as NG-STAR clonal complexes and Ng cgMLST v1 core genome groups. This was done by extracting the pairwise allelic distances associated with each instance of matching NG-STAR CC, rMLST-ST, or matching Ng cgMLST v1 core genome group, from the distance matrix. Ridgeline plots were constructed in R using the packages ggplot2 and ggridges (RRID:SCR_024511) [31, 32].

Based on the observed breaks in population structure, various bin thresholds were tested using a local version of LIN code downloaded from https://gitlab.pasteur.fr/BEBP/LINcoding applied to dataset 2 [30]. The 3935 isolates yielded 3877 unique cgSTs for testing.

Clustering output from the local LIN code testing using various allelic mismatch thresholds was assessed using statistics, specifically the silhouette score and Rand index. The silhouette score indicates the cohesion of clustering at a particular threshold, varying from -1 to 1 with more positive values indicating higher cluster cohesiveness [33]. The average silhouette score was calculated using MSTClust (v0.21b) (downloaded from https://gitlab.pasteur.fr/GIPhy/MSTclust) for pairwise distance thresholds from 0.01 – 0.7 [34]. Results were plotted using the R package ggplot2 [31].

The Rand index is a statistical measure that indicates the level of similarity between two different partitions of the same dataset, with values approaching 1 indicating higher similarity [35]. The Rand index was applied here to compare local LIN code clustering at various thresholds against Ng cgMLST v1 core genome group clustering at threshold 300 and 400. The index was calculated in R using the package fossil [36].

Finally, LIN codes generated using these thresholds on the representative dataset were visualised on a minimum spanning tree to explore what length of code prefix would be most suited to exploring population structure at various resolutions.

Implementation and Analyses

Once the refined Ng cgMLST v2 scheme was defined in PubMLST and isolates annotated with sequence types, the chosen thresholds were implemented within the database to generate LIN codes. A maximum of 25 missing loci in the Ng cgMLST v2 scheme was tolerated. If this quality threshold was not met, no cgST would be assigned to the isolate, and hence no LIN code. LIN codes were designated in PubMLST ordered based on a minimum spanning tree, using a default batch size of 10,000 isolates [21]. Multilevel single-linkage clustering was used to classify isolates at each threshold within the code. The combination of these clusters with a fixed bar code facilitates LIN code’s stability [22, 23]. Subsequently, isolate records that have a cgST but no LIN code assigned will be scanned once a week, and LIN codes automatically designated in batches, starting at 23:00 (UK local time) each Sunday. This includes records kept in private projects or embargoed, as well as public datasets.

Once LIN codes were designated within the database, PubMLST was further applied to investigate the distribution of LIN codes and to examine their association with other isolate typing systems such as 7-locus MLST. This was undertaken using the ‘Field breakdown’ and ‘Combinations’ analysis tools, alongside exporting datasets for manipulation.

A published dataset of 171 ceftriaxone resistant gonococci from diverse phylogroups [37] was selected for analysis, to validate LIN code’s ability to reproduce complex phylogenies. Isolates from this dataset not already present in PubMLST were downloaded from the sequence read archive, and where necessary were assembled using SPAdes (version 3.15.4-GCC-12.3.0, RRID:SCR_000131) [38]. These isolates were then uploaded to PubMLST and associated under the appropriate publication record (Fifer et al., 2024). One isolate (H18-368) out of 171 no longer had sequence data available, and so could not be included in this analysis.

Phylogenetic Trees

Minimum spanning trees were drawn using the GrapeTree plugin in PubMLST [29]. Neighbour joining trees were constructed using the ITOL plugin within the PubMLST interface.

Nucleotide alignments for other trees were generated via genome comparator in PubMLST [25], using the MUSCLE algorithm (RRID:SCR_011812) , with settings to align all loci, not only variable loci. Maximum likelihood trees were constructed using RaxML (version 8.2.12-hybrid-avx2 gompi/2021b, RRID:SCR_006086) [39], and approximate maximum likelihood trees using FastTree (version 2.1.1, RRID:SCR_015501) [40]. Trees were corrected for recombination using ClonalFrameML (version 1.13, RRID:SCR_016060) [41] and edited using ITOL [42].

Specific parameters and scripts for all programs have been documented at www.protocols.io/ DOI: https://doi.org/10.17504/protocols.io.4r3l21beqg1y/v1.

Results

Core genome MLST Version 2

The final Ng cgMLST v2 included 1430 loci. Of these, 362 were hypothetical genes of unknown function, 96 encoded transferases, 56 synthases, 45 transporters, 34 50S ribosomal proteins, 29 transcriptional regulators, and 21 30S ribosomal proteins (Supplementary table 3). This represented a decrease of 219 loci compared to Ng cgMLST v1, with the largest categories of excluded loci being hypothetical (83 loci) or phage associated (18 loci) (Supplementary table 4). Notable exclusions include tbpA and tbpB, essential iron acquisition proteins that are hypervariable and so not readily auto-annotated. In total 1418 loci were shared between the two schemes, with 12 new to Ng cgMLST v2 (Supplementary table 4).

Of the 1430 loci included, 99% (1418/1430) had under 1000 alleles, and an average length of less than 3000 base pairs [Figure 2]. Exceptionally variable genes included NEIS2020 porB (5544 alleles), NEIS2644 an unnamed phage tail protein (3475 alleles), NEIS0829 pilW (1474 alleles) and NEIS2113 tamB (1122 alleles) [Figure 2c].

Allele number vs allele length for the 1430 genes in Ng cgMLST v2.
Figure 2a) All 1430 genes included in Ng cgMLST v2 are plotted. 2b) The four genes with the highest average length, and the four with the highest number of alleles, were excluded as outliers. The figure was compiled excluding these genes in order to allow a closer examination of the distribution of allele length vs number. One gene, NEIS2113, appeared in both lists. **2c)** Table summarising the average length in base pairs, number of alleles, and gene name/function of the seven outlier loci.

It should be noted that while the new scheme is fundamentally still a core genome MLST scheme, it does not include all core genes, and should not be used as a definitive core gene list. However, the strict inclusion criteria for Ng cgMLST v2 facilitated automated assignment of a cgST to a higher proportion of WGS data than its predecessor, while still providing a similar degree of resolution.

Analysis of Population Structure and Allelic Mismatch Thresholds

The number of allelic mismatches present in a pairwise comparison across the 3935 isolates in dataset 2 was assessed across the 1430 core genes in Ng cgMLST v2. When visualised as a histogram [Figure 3a], an uneven distribution of allelic mismatches was observed, exhibiting a highly varied incidence. The allelic mismatch modes indicate clusters of isolates that share the same proportion of alleles in common across the core genome scheme, reflecting natural breaks in the gonococcal population structure [22].

Plots used in the selection of allelic mismatch thresholds for the LIN code.
Figure 3a) Histogram showing the frequency of pairwise allelic mismatches within dataset 2. A subset of the allelic mismatch thresholds applied in the gonococcal LIN code are shown (blue dashed lines) at 25 mismatches (1.75%), 125 (8.74%), 300 (20.98%), 550 (38.46%), and 650 (45.45%). **3b)** Ridgeline plots depicting the frequency of allelic mismatches amongst pairs of isolates that belong to the same category of different metrics from dataset 2. From top to bottom: Ribosomal MLST (RST), NG-STAR Clonal Complex (NgSTARCC), Country, Ng cgMLST v1 core genome group at threshold 400 (cgc_400), threshold 300 (cgc_300), and threshold 200 (cgc_200)). **3c)** Plot of adjusted Rand index comparing LIN code clustering at various allelic mismatch thresholds to Ng cgMLST v1 core genome groups at threshold 300 (black dots) and NG-STAR CC (blue squares). Clustering was compared using dataset 2. **3d)** Plot of silhouette index (score) at various cutoff values, based on MSTclust analysis of 1430 core loci across 3935 representative N. gonorrhoeae isolates (dataset 2). Silhouette score peaked at 0.64 at a cutoff value of 0.225.

The most frequently identified number of allelic mismatches between isolates was 840/1430 (59%). This modal mismatch is low compared to that seen in K. pneumoniae (within species mode 520/629 mismatches, 83%) [22]. There were no instances of pairwise allelic mismatches exceeding 970/1430 loci (68%), meaning that all isolates shared a minimum of 460 alleles with one other isolate in the dataset. This means that compared to K. pneumoniae, N. gonorrhoeae isolates shared more of the same alleles across their core genome, and were therefore less diverse. This may be due to the homogenising effect of widespread HGT.

When pairwise allelic mismatches were analysed in isolates belonging to the same grouping, such as Ng cgMLST v1 core genome groups or NG-STAR CCs, discontinuities in the population structure associated with these metrics became visible [Figure 3b]. For isolates belonging to the same core genome group (at threshold 300), NG-STAR CC, or ribosomal ST, most pairs had under 125/1430 allelic mismatches between them [Figure 3b]. This indicates that shared identity in approximately 1305/1430 (91%) Ng cgMLST v2 loci is required to belong to the same core genome group (at threshold 300), NG-STAR CC or ribosomal ST. However, core genome groups at threshold 400 displayed a second peak at 510/1430 (36%) mismatches. This is a result of group fusion due to intermediate genotypes, which has led to divergent gonococci being included in the same group at this threshold. Isolates from the same country displayed a comparatively high average level of allelic mismatch, with a mode of 855/1430 (60%), indicative of limited geographical association of gonococcal lineages.

Rand index values when comparing LIN code bin thresholds to Ng cgMLST v1 core genome groups (at threshold 300) peaked at 1 for a 300-locus mismatch threshold (21%) [Figure 3c]. Similarly, the silhouette score demonstrated a peak of 0.64 at cut-off 0.225 [Figure 3d].

Based on this evidence, allelic mismatch thresholds were chosen for each bin of the LIN code. Thresholds at 125 loci (8.74% mismatch) and 300 (20.98% mismatch) were chosen to correspond to discontinuities in population structure [Figure 3a], alongside a Rand index close to 1 demonstrating their relevance to core genome groups, and silhouette score indicating high cluster cohesiveness at these thresholds [Figure 3c & d]. Higher thresholds at 550 loci (38.46% mismatch) and 650 (45.45%) captured superlineage divisions.

To provide the higher resolution necessary to identify transmission events or outbreak-related gonococcal variants, thresholds at lower levels of allelic mismatch were included. Together, this resulted in a set of 11 bins, corresponding to mismatch thresholds: 650, 550, 300, 125, 25, 10, 7, 5, 3, 1, and 0 [Figure 4]. Minimum spanning tree analysis indicated that a prefix using the first three thresholds was ideal for exploring lineages [Figure 5] corresponding to Ng cgMLST v1 core genome groups at threshold 300.

Illustration of the gonococcal LIN code nomenclature.
**4a)** Each successive allelic mismatch threshold dictates clustering at a specific position within the code. This clustering is hierarchical, such that isolates sharing a larger proportion the code (from the left across) are of higher genetic similarity. For example, Isolate A and B share a complete LIN code, meaning they have 0 allelic mismatches in their Ng cgMLST v2 loci. Isolate B and C share the first three digits of their LIN code; they belong to the same clusters at these thresholds, and therefore differ in less than 300 alleles out of the 1430 core genes in Ng cgMLST v2 i.e., they belong to the same LIN code “lineage”. **4b)** Rooted Maximum likelihood tree demonstrating how LIN codes reflect phylogenetic relationships. The first tree shows a subset of LIN code lineages within superlineage 0_2, with lineage 1_1_2 as the outgroup. Moving to the right, the figure focuses in on lineage 0_2_1, showing the higher resolution provided by LIN sublineages, and then groups. (Figure inspired by Figure 3 in Van Rensburg, Berger et al., 2024 [26])

Minimum spanning tree showing clustering of 3935 isolates from dataset 2 based on Ng cgMLST v2.
LIN code lineages **(5a)** and 7-locus MLST **(5b**) demonstrate similar levels of resolution for characterising clustering. LIN codes sublineages **(5c)** provide higher resolution, similar to that provided by NG-STAR clonal complexes **(5d).** Only categories including 20 or more isolates are coloured.

Implementation of LIN code

Across the >28,000 publicly available N. gonorrhoeae genomes stored in PubMLST, 25,912 had the minimum 1405/1430 core genes annotated, and therefore were assigned a LIN code. At the superlineage level (2 threshold prefix: 0_0) there were 118 clusters, at lineage (3 prefix: 0_0_0) 532 clusters, and at sublineage (4 prefix: 0_0_0_0) 1712 clusters. The 3-threshold prefix lineage definitions directly correlated with Ng cgMLST v1 core genome groups at threshold 300 (adjusted [adj.] Rand Index = 1). This was visualised using phylogenetic trees [Figure 6] and minimum spanning trees [Figure 5] which showed good congruence between LIN codes, SNP-based phylogeny, and other clustering methods. The lineages identified by LIN code demonstrated persistence over time, with lineage 0_2_1 isolates spanning 39 years from 1985 to 2024 (Supplementary Figure 1) N. gonorrhoeae LIN codes were accessed via an isolate’s information page, from the ‘allele designations/scheme fields’ dropdown box within the pubmlst.org/organisms/neisseria-spp isolate database search form, or through the ‘export dataset’ link at the bottom of the page following an isolate search. They could also be exported as metadata in GrapeTree analyses.

Unrooted FastTree of 1000 randomly chosen *N. gonorrhoeae* isolates with LIN codes assigned.
Constructed using 1430 loci from Ng cgMLST v2. Branches are coloured by the 8 most frequent superlineages (0_2 = red, 0_7 = orange, 0_0 = green, 0_18 = brown, 1_1 = blue, 5_2 = yellow, 19_01 = pink, and 1_8 = purple.) The 21 highest frequency LIN lineages are labelled, represented by the inner bar, in colour ranges corresponding to their superlineage colour. LIN codes form monophyletic groupings, indicating that there is a good degree of congruence between the allelic profile clustering method used in cgMLST LIN code and nucleotide sequence alignment-based phylogeny. Core genome groups at threshold 300 (cgc_300) are represented by the outer coloured bar, and show good concordance with LIN lineages, although fewer isolates were able to be annotated with core genome groups than LIN codes due to use of the larger and more poorly auto-annotated cgMLST v1. All isolates used in this tree had a LIN code assigned.

Congruence with Previous Typing Schemes

7-locus MLST-STs showed good association with LIN code lineages (0_0_0) (adj. Rand index = 0.92) [Figure 5], although many STs included multiple lineages. For example, MLST-ST 1901 corresponded with lineages 0_2_0 (1932/2376, 81%) and 0_0_12 (443/2376, 19%), while MLST-ST 7363 was predominantly captured by 0_0_52 (796/1322, 60%) and 0_2_17 (355/1322, 27%) (Supplementary table 5) The fact these LIN codes are from different superlineages (0_0 & 0_2) illustrates how these MLST-STs form polyphyletic groupings, combining isolates that are not closely related under the same ST.

NG-STAR clonal complexes (CCs) showed a stronger association with LIN code sublineages (0_0_0_0) (adj. Rand index 0.96) [Figure 5]. NG-STAR CC 90 belonged predominantly to sublineages 0_2_0_1 (1499/1562, 96%) and to a much lesser extent 0_0_12_2 (25/1562, 2%). NG-STAR CC 63 was represented mainly by 0_2_1_10 (1369/1437, 95%), followed by 1_1_22_9 (21/1437, 1%) and others (Supplementary table 5). Again, the diversity of these LIN codes at the superlineage level, and comparison with their location on the phylogeny [Figure 5 & 6], demonstrates how genetically divergent isolates can belong to the same CC.

NG-MAST-STs showed some correlation with LIN code groups (0_0_0_0_0) (adj. Rand index 0.50). For example, NG-MAST-ST 2992 belonged mostly to group 0_2_1_0_73 (409/653, 63%) and to others such as 0_2_1_1_4 (72/653, 11%). NG-MAST-ST 1407 predominantly belonged to group 0_2_0_1_0 (498/787, 63%) and to 0_2_0_1_27 (77/787, 10%).

Resistance associated lineage A, as described using hierarchical Bayesian analyses (BAPs) [43], was previously shown to associate well with Ng cgMLST v1 core genome groups [7]. As LIN code lineages correspond to core genome groups at threshold 300 [Figure 6], the same is true of this nomenclature. Lineage 0_2_0, 0_2_1, and 0_2_17 correspond to core genome group 3, 16 and 8, which in turn correlate with BAPS 8, 7, and 9 respectively, all within the resistance associated lineage A [7, 43].

Mosaic penA type 34 alleles, an important antimicrobial resistance determinant, correlated with LIN code lineage; 89.43% of these alleles were found in isolates belonging to lineage 0_2_0 (Supplementary Table 6).

Example Analysis

The gonococcal LIN code was applied in a reproduction of an analysis of 170 ceftriaxone resistant gonococci initially performed by Fifer et al., 2024 [37] (Supplementary Table 7). A maximum likelihood tree of these isolates was reconstructed and labelled using LIN codes [Figure 7], and demonstrated that LIN code lineages recaptured the diverse clades identified by phylogenetic methods, while also classifying several isolates that were not assigned a phylogroup in the original publication. Furthermore, 15 pairs or triplets of isolates sharing a full-length LIN code were identified within this dataset. Comparison with a random selection of isolates from across the PubMLST database [Figure 6] suggested an overrepresentation of 0_14_0 isolates in the ceftriaxone-resistant dataset. This dataset can be accessed by filtering by publication “Fifer et al., 2024” on the PubMLST Neisseria isolate search page.

Maximum Likelihood tree of 170 Ceftriaxone-resistant isolates previously analysed in Fifer et al., (2024).
Constructed using 1430 loci from Ng cgMLST v2. LIN code lineages were able reproduce the clades identified in Fifer et al., while being readily accessible and providing additional detail about each clade in the form of superlineage & sublineage divisions. Groups of isolates that share the same full length LIN code are highlighted as coloured stars.

Discussion

Accurate identification of bacterial lineages is necessary to examine links between bacterial population structure and characteristics such as AMR, and to enable surveillance of emergent or outbreak-associated variants [1]. However, in N. gonorrhoeae, widespread HGT reassorts DNA among isolates, disrupting clonal inheritance, and consequently distorting phylogenetic trees [7, 44]. This makes it difficult to accurately characterise gonococcal lineages, particularly when using conventional typing systems based on small numbers of genes, such as 7-locus MLST [7, 12]. Previously, the most reliable approach applied core genome MLST (cgMLST) to define core genome groups, using over 1000 genes to dilute the effects of HGT and facilitate high resolution analyses [1, 7, 45]. Unfortunately, this method suffers from cluster instability [22, 44]. LIN code provides an alternative means to leverage the benefits of cgMLST within a stable classification system, which can accommodate new genotypes without the risk of disrupting established clustering [21, 22]. Here, we developed a N. gonorrhoeae LIN code: an effective nomenclature for categorising, exploring, and discussing gonococcal population structure.

Evidence for the existence of distinct, persistent gonococcal lineages pre-dates whole genome sequencing, in the characterisation of auxotypes: related isolates that exhibit similar patterns of growth in media containing different combinations of key nutrients [46]. This may be due to metabolic competition amongst gonococcal strains, causing the population structure to segregate into distinct metabolic types [47]. Previous research has posited that selection can preserve such lineages even in the face of extensive HGT, as has been the case in immune selection for differing antigenic types in other bacterial species [47–49]. Sequence-based approaches including 7-locus MLST, NG-STAR and cgMLST have confirmed that the N. gonorrhoeae population contains identifiable groups of related isolates, some of which are associated with clinically relevant phenotypes such as AMR [7, 16, 50, 51], with outbreak events [52], or with specific at-risk groups such as men who have sex with men (MSM) [43, 53]. However, some of these typing systems are prone to providing misleading classifications, where isolates that are not closely related appear to be, due to the effects of HGT within the small numbers of loci used to classify them. Consequently, while equivalent systems have proved highly effective at categorising the closely related N. meningitidis [54], the extensive HGT observed across the gonococcal genome necessitates a whole genome approach to reliably identify related groups of N. gonorrhoeae isolates [7]. LIN code achieves this by applying cgMLST, using hundreds of loci belonging to a wide range of functional categories, including genes involved in metabolic pathways, genetic processing, and AMR. This allows the LIN code to capture gonococcal population structure with increased consistency, and higher resolution, than many previous approaches, providing fresh insight into the biology that underlies this species’ genetic composition.

The LIN nomenclature conveys lineage information in the form of hierarchical clustering at sequential thresholds of tolerated allelic mismatch within a cgMLST scheme (i.e. the number of alleles differing out of the total list of loci in the scheme) [22, 23]. Here, 11 thresholds were used, resulting in an 11-bin barcode. The leftmost numbers of the barcode represent clustering at the highest thresholds of allelic mismatch, here corresponding with LIN superlineage (up to 550 mismatches), LIN lineage (300 mismatches), and LIN sublineage (125 mismatches). Progressing across the barcode to the right, each number indicates clustering at an increasing level of allelic similarity, ultimately resulting in the delineation of highly related isolates that differ in a very small number of core loci [Figure 4] [21–23]. Subsets of thresholds, for example a prefix of the first three or four, may be used to define clustering down to a particular level of taxonomic similarity [22, 23]. If two isolates are highly related and therefore identical across their core gene profile, they will share the same LIN code.

The use of multiple thresholds enables the LIN code to relate back to several familiar nomenclatures, such as core genome groups and 7-locus MLST, which are both broadly equivalent to LIN lineages, and NG-STAR CCs, approximately equivalent to LIN sublineages. The difference lies in the accuracy and stability of the LIN code versus its predecessors. Visualisation via phylogenetic trees demonstrated that the LIN codes related strongly to phylogenetic structure, capturing distinct clades without overlap. This is in contrast to previous nomenclatures such as 7-locus MLST, which can form polyphyletic groups due to the misleading effects of HGT [7].

Furthermore, the multi-resolution, hierarchical nature of LIN codes also allows this nomenclature to encode more information about the relationship between isolates than conventional typing systems [21]. For example, if a pair of isolates belong to MLST 1 and 2, it can only be deduced that they differ in between 1 and 7 housekeeping genes. Meanwhile, if the same pair of isolates belong to LIN lineages 0_2_1 and 0_2_2, this communicates that these isolates differ by no more than 550/1430 core gene alleles, but by more than 300/1430, quickly defining the degree of their relatedness [Figure 4]. This is useful when comparing isolates, for example those associated with AMR. Previous approaches used to track AMR strains include both 7-locus MLST and NG-STAR, which types isolates based on their alleles across seven AMR genes (penA, mtrR, porB, ponA, gyrA, parC, and 23S rRNA) [55]. This typing system has been successfully used in many analyses of AMR associated gonococci, and NG-STAR CCs have been recommended as a tool for characterising gonococcal lineages [16, 50]. However, due to HGT, specific NG-STAR types or clonal complexes do not always represent closely related isolates. LIN sublineages were able to discern isolates with a similar level of resolution to NG-STAR CCs, but higher reliability. This will facilitate more accurate tracking of AMR associated isolates, such as those belonging to LIN sublineage 0_2_0_1, when compared to the equivalent NG-STAR CC 90, which also includes unrelated isolates belonging to LIN superlineages 0_0 and 1_1. While NG-STAR effectively characterises an isolate’s AMR genotype [55], it cannot consistently distinguish related isolates. LIN code fulfils this role, while also providing increased information about the degree of relatedness of the isolates in question. Accurately defining this lineage structure will be important to elucidate how AMR emerges in distinct gonococcal lineages.

As an allele-based method, LIN codes enable analysis of a large number of isolates rapidly and reproducibly [21, 22, 24]. This is an improvement on phylogenetic tools used to identify gonococcal lineages which rely on nucleotide sequence alignments, requiring significant time and expertise to compute [56]. LIN codes can replicate similar clustering, identifying the same lineages while being fully automated [21]. For example, hierarchical Bayesian analyses (BAPs) have previously been used to delineate gonococcal population structure into two lineages: lineage A, associated with high-risk sexual networks and AMR, and lineage B, associated with lower risk networks and susceptibility [43]. In our LIN code analysis, we observed that lineage A (BAPS 8, 7 and 9) corresponds to LIN lineages 0_2_0, 0_2_1, and 0_2_17, reaffirming that cgMLST is able to provide the same resolution as SNP-based techniques [7]. While BAPs analysis can be complex and time intensive to run, LIN codes can be used to rapidly identify lineage A gonococci simply by uploading WGS data to PubMLST. Within 24 hours of upload, any sequence data in which 1405/1430 core loci can be annotated (including private isolate records) will be assigned a Ng cgMLST v2 cgST; if this is a previously identified cgST a LIN code will be assigned immediately, if a new cgST the new LIN code will be assigned by the following Monday. Furthermore, these isolates can then be analysed in the context of the wider PubMLST gonococcal genome collection, which currently includes >28,000 public records and integrates accessible plugins such as GrapeTree [29] and Genome Comparator [25]. This facilitates further investigation, for example in identifying an isolates’ relationship to known AMR strains, or in the analysis of suspected transmission linked isolates from within the same community.

To further illustrate the efficacy of LIN code we reproduced an analysis of 170 global ceftriaxone resistant isolates [37]. The original article’s methodology involved a WGS alignment and generation of a maximum likelihood tree in order to characterise eight major phylogroups [37]. The gonococcal LIN code instantly reproduced these clusters, while providing additional detail about each clade in the form of superlineage/sublineage divisions and simultaneously contextualising the isolates amongst the wider PubMLST database. For example, lineage 0_14_0 was over-represented in this ceftriaxone-resistant dataset when compared against 1000 randomly selected N. gonorrhoeae isolates from PubMLST. Also, LIN code was able to classify several divergent isolates that were not assigned to any phylogroup in the original publication, and detected 15 instances of matching full-length LIN codes, meaning these isolates were identical across their core genome and could therefore represent isolates associated with transmission events [7]. The PubMLST interface facilitates in-depth analysis of these instances of shared LIN codes, allowing users to explore related isolates at various thresholds of allelic dissimilarity by viewing a table of similar isolates (as defined by LIN code) on an isolate’s information page [Figure 8]. This enables quick investigation of any related isolates, including their location, year, allele at a particular locus, or classification by other typing schemes such as NG-STAR.

Using PubMLST to explore related isolates by LIN code.
Within an isolate’s information page, here using isolate SC18-25 (PubMLST id: 165303) (https://pubmlst.org/bigsdb?page=info&db=pubmlst_neisseria_isolates&id=165303), it is possible to view a breakdown table of similar isolates by LIN code. This isolate shares a complete LIN code with 2 other isolates, meaning they are identical in their core genome. **(8a).** Clicking on the “matching isolates” number at a certain LIN code threshold then takes the user to the dataset of matching isolates for further analysis **(8b).** This feature can be applied in the investigation of transmission chains, outbreak events and the dissemination of AMR through clonal expansion.

The genetic mix-and-matching performed by N. gonorrhoeae can make characterisation of its population structure difficult [7]. However, the LIN code nomenclature proposed here provides clarity, consistency, and stability in its description of N. gonorrhoeae lineages. The multi-resolution clustering intrinsic to LIN code facilitates a common language around lineage nomenclature at different epidemiological levels, from high divisions such as superlineage down to unique clones [21]. In conclusion, N. gonorrhoeae LIN codes represent a portable, publicly available taxonomic nomenclature that has the potential to enhance surveillance of N. gonorrhoeae in order to benefit public health.

Data availability

All whole genome sequence data used in this publication is publicly available in the pubmlst.org/neisseria database. The specific datasets used are described in the supplementary files. Computational scripts are documented at https://doi.org/10.17504/protocols.io.4r3l21beqg1y/v1.

Acknowledgements

The authors would like to thank Nazreen Hadjirin and Iman Yassine for their valuable advice. Computational aspects of this work were enabled by the Oxford University Biomedical Research Computing (BMRC) facility.

Additional information

Author contributions

A.U created the updated Ng cgMLST v2 scheme, performed the allelic mismatch and statistical analyses, selected the LIN code thresholds, wrote the manuscript, and created all figures. O.B.H. and M.C.J.M conceptualised and supervised this work, and edited the manuscript. M.A.K contributed python and linux scripts and provided guidance on pangenome analysis. K.A.J implemented the cgMLST scheme and LIN code within the PubMLST database. K.M.P provided assistance with recombination correction.

Funding information

A.U was funded by the BBSRC Interdisciplinary Biosciences DTP (Grant no. BB/M011224/1), and subsequently an ECR Fellowship through the Nuffield Department of Population Health (NDPH), University of Oxford. M.C.J.M, K.A.J and PubMLST are funded by the Wellcome trust (Grant no. 218205/Z/19/Z). O.B.H was funded by the Wellcome trust (Grant no. 214374/Z/18/Z), and subsequently though a Senior Research Fellowship at NDPH, University of Oxford. M.A.K’s studentship was funded by the Ministry of Education, Indonesia, in collaboration with the Medical Science Division, University of Oxford. K.M.P was funded by the Wellcome trust (Grant no. 214374/Z/18/Z).

Funding

UKRI | Biotechnology and Biological Sciences Research Council (BBSRC) (BB/M011224/1)

Anastasia Unitt

Wellcome Trust

https://doi.org/10.35802/218205

Keith A Jolley
Martin CJ Maiden

Wellcome Trust

https://doi.org/10.35802/214374

Odile B Harrison
Kasia M Parfitt

Ministry of Education, Indonesia

Made A Krisna

Nuffield Department of Population Health, University of Oxford (NDPH)

Anastasia Unitt

Nuffield Department of Population Health, University of Oxford (NDPH)

Odile B Harrison

Additional files

Supplementary Tables and Figures

References

1.
1. Maiden M.C.J.
2. Van Rensburg M.J.J.
3. Bray J.E.
4. Earle S.G.
5. Ford S.A.
6. Jolley K.A.
7. McCarthy N.D.
2013MLST revisited: the gene-by-gene approach to bacterial genomicsNature Reviews Microbiology 11:728–736https://doi.org/10.1038/nrmicro3093 Google Scholar
2.
1. WHO
2018Report on global sexually transmitted infection surveillance 2018https://www.who.int/reproductivehealth/publications/stis-surveillance-2018/en/
3.
1. Unemo M.
2. Nicholas R.A.
2012Emergence of multidrug-resistant, extensively drug-resistant and untreatable gonorrheaFuture Microbiology 7:1401–1422https://doi.org/10.2217/fmb.12.117 Google Scholar
4.
1. Rowley J.
2. Vander Hoorn S.
3. Korenromp E.
4. Low N.
5. Unemo M.
6. Abu-Raddad L.J.
7. Chico R.M.
8. Smolak A.
9. Newman L.
10. Gottlieb S.
11. Thwin S.S.
12. Broutet N.
13. Taylor M.M.
2019Chlamydia, gonorrhoea, trichomoniasis and syphilis: global prevalence and incidence estimates, 2016Bulletin of the World Health Organization 97:548–562https://doi.org/10.2471/blt.18.228486 Google Scholar
5.
1. Lyu Y.
2. Choong A.
3. Chow E.P.F.
4. Seib K.L.
5. Marshall H.S.
6. Unemo M.
7. de Voux A.
8. Wang B.
9. Miranda A.E.
10. Gottlieb S.L.
11. Mello M.B.
12. Wi T.
13. Baggaley R.
14. Marshall C.
15. Abu-Raddad L.J.
16. Abara W.E.
17. Chen X.-S.
18. Ong J.J.
2023Vaccine value profile for Neisseria gonorrhoeaeVaccine https://doi.org/10.1016/j.vaccine.2023.01.053 Google Scholar
6.
1. Gottlieb S.L.
2. Jerse A.E.
3. Delany-Moretlwe S.
4. Deal C.
5. Giersing B.K.
2019Advancing vaccine development for gonorrhoea and the Global STI Vaccine RoadmapSexual Health 16:426https://doi.org/10.1071/sh19060 Google Scholar
7.
1. Harrison O.B.
2. Cehovin A.
3. Skett J.
4. Jolley K.A.
5. Massari P.
6. Genco C.A.
7. Tang C.M.
8. Maiden M.C.J.
2020Neisseria gonorrhoeae Population Genomics: Use of the Gonococcal Core Genome to Improve Surveillance of Antimicrobial ResistanceThe Journal of Infectious Diseases https://doi.org/10.1093/infdis/jiaa002 Google Scholar
8.
1. O’Rourke M.
2. Stevens E.
1993Genetic structure of Neisseria gonorrhoeae populations: a non-clonal pathogenJ Gen Microbiol 139:2603–11https://doi.org/10.1099/00221287-139-11-2603 Google Scholar
9.
1. Unitt A.
2. Maiden M.
3. Harrison O.
2024Characterizing the diversity and commensal origins of penA mosaicism in the genus NeisseriaMicrob Genom 10https://doi.org/10.1099/mgen.0.001209 Google Scholar
10.
1. Manoharan-Basil S.S.
2. Gonzalez Natalia
3. Laumen Jolein
4. Kenyon Chris
2022Horizontal Gene Transfer of Fluoroquinolone Resistance-Conferring Genes From Commensal Neisseria to Neisseria gonorrhoeae: A Global Phylogenetic Analysis of 20,047 IsolatesFrontiers in Microbiology https://doi.org/10.3389/fmicb.2022.793612 Google Scholar
11.
1. Bennett J.S.
2. Bratcher H.B.
3. Brehony C.
4. Harrison O.B.
5. Maiden M.C.J.
2014The Genus Neisseria
In:
1. Rosenberg E.
2. et al.
, editors. The Prokaryotes: Alphaproteobacteria and Betaproteobacteria Berlin, Heidelberg: Springer Berlin Heidelberg pp. 881–900
Google Scholar
12.
1. Hanage W.P
2016Not So Simple After All: Bacteria, Their Population Genetics, and RecombinationCold Spring Harbor Perspectives in Biology 8:a018069https://doi.org/10.1101/cshperspect.a018069 Google Scholar
13.
1. Harrison O
2021Recent advances in understanding and combatting Neisseria gonorrhoeae: a genomic perspectiveFaculty Reviews 10https://doi.org/10.12703/r/10-65 Google Scholar
14.
1. Harrison O.B.
2. Clemence M.
3. Dillard J.P.
4. Tang C.M.
5. Trees D.
6. Grad Y.H.
7. Maiden M.C.J.
2016Genomic analyses of Neisseria gonorrhoeae reveal an association of the gonococcal genetic island with antimicrobial resistanceJournal of Infection 73:578–587https://doi.org/10.1016/j.jinf.2016.08.010 Google Scholar
15.
1. Kwong J.C.
2. Gonçalves Da Silva A.
3. Dyet K.
4. Williamson D.A.
5. Stinear T.P.
6. Howden B.P.
7. Seemann T.
2016NGMASTER: in silico multi-antigen sequence typing for Neisseria gonorrhoeaeMicrobial Genomics 2https://doi.org/10.1099/mgen.0.000076 Google Scholar
16.
1. Golparian D.
2. Sánchez-Busó L.
3. Cole M.
4. Unemo M.
2021Neisseria gonorrhoeae Sequence Typing for Antimicrobial Resistance (NG-STAR) clonal complexes are consistent with genomic phylogeny and provide simple nomenclature, rapid visualization and antimicrobial resistance (AMR) lineage predictionsJournal of Antimicrobial Chemotherapy 76:940–944https://doi.org/10.1093/jac/dkaa552 Google Scholar
17.
1. Maiden M.C.J.
2. Bygraves J.A.
3. Feil E.
4. Morelli G.
5. Russell J.E.
6. Urwin R.
7. Zhang Q.
8. Zhou J.
9. Zurth K.
10. Caugant D.A.
11. Feavers I.M.
12. Achtman M.
13. Spratt B.G.
1998Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganismsProceedings of the National Academy of Sciences 95:3140–3145https://doi.org/10.1073/pnas.95.6.3140 Google Scholar
18.
1. Francisco A.P.
2. Bugalho M.
3. Ramirez M.
4. Carriço J.A.
2009Global optimal eBURST analysis of multilocus typing data using a graphic matroid approachBMC Bioinformatics 10:152https://doi.org/10.1186/1471-2105-10-152 Google Scholar
19.
1. Everitt B.S.
2. Landau S.
3. Leese M.
4. Stahl D.
2011An Introduction to Classification and Clustering
In:
1. Shewhart W.A.
2. Wilks S.S.
, editors. Cluster Analysis pp. 1–13
https://doi.org/10.1002/9780470977811.ch1 Google Scholar
20.
1. Maiden M.C.J.
2. Harrison O.B.
2016Population and Functional Genomics of Neisseria Revealed with Gene-by-Gene ApproachesJournal of Clinical Microbiology 54:1949–1955https://doi.org/10.1128/jcm.00301-16 Google Scholar
21.
1. Palma F.
2. Hennart M.
3. Jolley K.A.
4. Crestani C.
5. Wyres K.L.
6. Bridel S.
7. Yeats C.A.
8. Brancotte B.
9. Raffestin B.
10. David S.
11. Lam M.M.C.
12. Izdebski R.
13. Passet V.
14. Rodrigues C.
15. Rethoret-Pasty M.
16. Maiden M.C.J.
17. Aanensen D.M.
18. Holt K.E.
19. Criscuolo A.
20. Brisse S.
2024Bacterial strain nomenclature in the genomic era: Life Identification Numbers using a gene-by-gene approachCold Spring Harbor Laboratory https://doi.org/10.1101/2024.03.11.584534 Google Scholar
22.
1. Hennart M.
2. Guglielmini J.
3. Maiden M.C.J.
4. Jolley K.A.
5. Criscuolo A.
6. Brisse S.
2022A dual barcoding approach to bacterial strain nomenclature: Genomic taxonomy of Klebsiella pneumoniae strainsMol Biol Evol 39:2https://doi.org/10.1093/molbev/msac135 Google Scholar
23.
1. Rensburg M.
2. Berger D.
3. Yassine I.
4. Shaw D.
5. Fohrmann A.
6. Bray J.
7. Jolley K.
8. Maiden M.
9. Brueggemann A.
2024Development of the Pneumococcal Genome Library, a core genome multilocus sequence typing scheme, and a taxonomic life identification number barcoding system to investigate and define pneumococcal population structureMicrobial genomics 10https://doi.org/10.1099/mgen.0.001280 Google Scholar
24.
1. Vinatzer B.A.
2. Tian L.
3. Heath L.S.
2017A proposal for a portal to make earth’s microbial diversity easily accessible and searchableAntonie van Leeuwenhoek 110:1271–1279https://doi.org/10.1007/s10482-017-0849-z Google Scholar
25.
1. Jolley K.A.
2. Bray J.E.
3. Maiden M.C.J.
2018Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Research 3:124https://doi.org/10.12688/wellcomeopenres.14826.1 Google Scholar
26.
1. R Core Team
2018R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna4https://www.R-project.org/
27.
1. Seemann T
2014Prokka: rapid prokaryotic genome annotationBioinformatics 30:2068–2069https://doi.org/10.1093/bioinformatics/btu153 Google Scholar
28.
1. Bayliss S.C.
2. Thorpe H.A.
3. Coyle N.M.
4. Sheppard S.K.
5. Feil E.J.
2019PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteriaGigascience 8https://doi.org/10.1093/gigascience/giz119 Google Scholar
29.
1. Zhou Z.
2. Alikhan N.-F.
3. Sergeant M.J.
4. Luhmann N.
5. Vaz C.
6. Francisco A.P.
7. Carriço J.A.
8. Achtman M.
2018GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogensGenome Research 28:1395–1404https://doi.org/10.1101/gr.232397.117 Google Scholar
30.
1. Hennart M.
2019LINcodingGitLab https://gitlab.pasteur.fr/BEBP/LINcoding
31.
1. Wickham H
2016ggplot2: Elegant Graphics for Data AnalysisSpringer International Publishing https://doi.org/10.1007/978-0-387-98141-3 Google Scholar
32.
1. Wilke C.
2024ggridges: Ridgeline Plots in ’ggplot2’https://wilkelab.org/ggridges/
33.
1. Rousseeuw P.J
1987Silhouettes: A graphical aid to the interpretation and validation of cluster analysisJournal of Computational and Applied Mathematics 20:53–65https://doi.org/10.1016/0377-0427(87)90125-7 Google Scholar
34.
1. Criscuolo A.
2021MSTclustGitLab https://gitlab.pasteur.fr/GIPhy/MSTclust
35.
1. Rand W.M
1971Objective Criteria for the Evaluation of Clustering MethodsJournal of the American Statistical Association 66:846–850https://doi.org/10.1080/01621459.1971.10482356 Google Scholar
36.
1. Vavrek M.
2011fossil: palaeoecological and palaeogeographical analysis toolsPalaeontologia Electronica https://doi.org/10.32614/CRAN.package.fossil Google Scholar
37.
1. Fifer H.
2. Doumith M.
3. Rubinstein L.
4. Mitchell L.
5. Wallis M.
6. Singh S.
7. Singh G.J.
8. Rayment M.
9. Evans-Jones J.
10. Blume A.
11. Dosekun O.
12. Poon K.
13. Nori A.
14. Day M.
15. Pitt R.
16. Sun S.
17. Narayanan P.
18. Callan E.
19. Vickers A.
20. Minshull J.
21. Bennet K.
22. Johnson J.E.C.
23. Saunders J.
24. Alexander S.
25. Mohammed H.
26. Woodford N.
27. Sinka K.
28. Cole M.
2024Ceftriaxone-resistant Neisseria gonorrhoeae detected in England, 2015 to 2024; an observational studymedRxiv :2024.08.12.24311674https://doi.org/10.1101/2024.08.12.24311674 Google Scholar
38.
1. Prjibelski A.
2. Antipov D.
3. Meleshko D.
4. Lapidus A.
5. Korobeynikov A.
2020Using SPAdes De Novo AssemblerCurrent Protocols in Bioinformatics 70:e102https://doi.org/10.1002/cpbi.102 Google Scholar
39.
1. Stamatakis A
2014RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogeniesBioinformatics 30:1312–1313https://doi.org/10.1093/bioinformatics/btu033 Google Scholar
40.
1. Price M.N.
2. Dehal P.S.
3. Arkin A.P.
2010FastTree 2 – Approximately Maximum-Likelihood Trees for Large AlignmentsPLoS ONE 5:e9490https://doi.org/10.1371/journal.pone.0009490 Google Scholar
41.
1. Didelot X.
2. Wilson D.J.
2015ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial GenomesPLOS Computational Biology 11:e1004041https://doi.org/10.1371/journal.pcbi.1004041 Google Scholar
42.
1. Letunic I.
2. Bork P.
2021Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotationNucleic Acids Research 49:W293–W296https://doi.org/10.1093/nar/gkab301 Google Scholar
43.
1. Sánchez-Busó L.
2. Golparian D.
3. Corander J.
4. Grad Y.H.
5. Ohnishi M.
6. Flemming R.
7. Parkhill J.
8. Bentley S.D.
9. Unemo M.
10. Harris S.R.
2019The impact of antimicrobials on gonococcal evolutionNature Microbiology 4:1941–1950https://doi.org/10.1038/s41564-019-0501-y Google Scholar
44.
1. Turner K.M.E.
2. Feil E.J.
2007The secret life of the multilocus sequence typeInternational Journal of Antimicrobial Agents 29:129–135https://doi.org/10.1016/j.ijantimicag.2006.11.002 Google Scholar
45.
1. de Korne-Elenbaas J.
2. Bruistena S.M.
3. van Dama A.P.
4. Maiden M.C.J.
5. Harrison O.B.
2022The Neisseria gonorrhoeae Accessory Genome and Its Association with the Core Genome and Antimicrobial ResistanceMicrobiology Spectrum https://doi.org/10.1128/spectrum.02654-21 Google Scholar
46.
1. Catlin B.W.
1978Chapter XIII Characteristics and Auxotyping of Neisseria gonorrhoeae
In:
1. Bergan T.
2. Norris J.R.
, editors. Methods in Microbiology Academic Press pp. 345–380
Google Scholar
47.
1. Gupta S
2024Darwin review: the evolution of virulence in human pathogensProceedings of the Royal Society 291https://doi.org/10.1098/rspb.2023.2043 Google Scholar
48.
1. Watkins E.R.
2. Maiden M. C.
3. Gupta S.
2016Metabolic Competition as a Driver of Bacterial Population StructureFuture Microbiology 11:1339–1357https://doi.org/10.2217/fmb-2016-0079 Google Scholar
49.
1. Buckee C.O.
2. Recker M.
3. Watkins E.R.
4. Gupta S.
2011Role of stochastic processes in maintaining discrete strain structure in antigenically diverse pathogen populationsProc Natl Acad Sci U S A 108:15504–9https://doi.org/10.1073/pnas.1102445108 Google Scholar
50.
1. Golparian D.
2. Cole M.J.
3. Sánchez-Busó L.
4. Day M.
5. Jacobsson S.
6. Uthayakumaran T.
7. Abad R.
8. Bercot B.
9. Caugant D.A.
10. Heuer D.
11. Jansen K.
12. Pleininger S.
13. Stefanelli P.
14. Aanensen D.M.
15. Bluemel B.
16. Unemo M.
2024Antimicrobial-resistant Neisseria gonorrhoeae in Europe in 2020 compared with in 2013 and 2018: a retrospective genomic surveillance studyLancet Microbe 5:e478–e488https://doi.org/10.1016/s2666-5247(23)00370-1 Google Scholar
51.
1. Sánchez-Busó L.
2. Cole M.J.
3. Spiteri G.
4. Day M.
5. Jacobsson S.
6. Golparian D.
7. Sajedi N.
8. Yeats C.A.
9. Abudahab K.
10. Underwood A.
11. Bluemel B.
12. Aanensen D.M.
13. Unemo M.
14. Pleininger S.
15. Indra A.
16. De Baetselier I.
17. Vanden Berghe W.
18. Hunjak B.
19. Blažić T.N.
20. Maikanti-Charalambous P.
21. Pieridou D.
22. Zákoucká H.
23. Žemličková H.
24. Hoffmann S.
25. Cowan S.
26. Schwartz L.J.
27. Peetso R.
28. Epstein J.
29. Viktorova J.
30. Ndeikoundam N.
31. Bercot B.
32. Bébéar C.
33. Lot F.
34. Buder S.
35. Jansen K.
36. Miriagou V.
37. Rigakos G.
38. Raftopoulos V.
39. Balla E.
40. Dudás M.
41. Ásmundsdóttir L.R.
42. Sigmundsdóttir G.
43. Hauksdóttir G.S.
44. Gudnason T.
45. Colgan A.
46. Crowley B.
47. Saab S.
48. Stefanelli P.
49. Carannante A.
50. Parodi P.
51. Pakarna G.
52. Nikiforova R.
53. Bormane A.
54. Dimina E.
55. Perrin M.
56. Abdelrahman T.
57. Mossong J.
58. Schmit J.-C.
59. Mühlschlegel F.
60. Barbara C.
61. Mifsud F.
62. Van Dam A.
63. Van Benthem B.
64. Visser M.
65. Linde I.
66. Kløvstad H.
67. Caugant D.
68. Młynarczyk-Bonikowska B.
69. Azevedo J.
70. Borrego M.-J.
71. Nascimento M.L.R.
72. Pavlik P.
73. Klavs I.
74. Murnik A.
75. Jeverica S.
76. Kustec T.
77. Vázquez Moreno J.
78. Diaz A.
79. Abad R.
80. Velicko I.
81. Unemo M.
82. Fifer H.
83. Shepherd J.
84. Patterson L.
2022Europe-wide expansion and eradication of multidrug-resistant Neisseria gonorrhoeae lineages: a genomic surveillance studyThe Lancet Microbe 3:e452–e463https://doi.org/10.1016/s2666-5247(22)00044-1 Google Scholar
52.
1. Smolarchuk C.
2. Wensley A.
3. Padfield S.
4. Fifer H.
5. Lee A.
6. Hughes G.
2018Persistence of an outbreak of gonorrhoea with high-level resistance to azithromycin in England, November 2014‒May 2018Eurosurveillance 23https://doi.org/10.2807/1560-7917.es.2018.23.23.1800287 Google Scholar
53.
1. Williamson D.A.
2. Chow E.P.F.
3. Gorrie C.L.
4. Seemann T.
5. Ingle D.J.
6. Higgins N.
7. Easton M.
8. Taiaroa G.
9. Grad Y.H.
10. Kwong J.C.
11. Fairley C.K.
12. Chen M.Y.
13. Howden B.P.
2019Bridging of Neisseria gonorrhoeae lineages across sexual networks in the HIV pre-exposure prophylaxis eraNat Commun 10:3988https://doi.org/10.1038/s41467-019-12053-4 Google Scholar
54.
1. Maiden M.C
2008Population genomics: diversity and virulence in the NeisseriaCurr Opin Microbiol 11:467–71https://doi.org/10.1016/j.mib.2008.09.002 Google Scholar
55.
1. Demczuk W.
2. Sidhu S.
3. Unemo M.
4. Whiley D.M.
5. Allen V.G.
6. Dillon J.R.
7. Cole M.
8. Seah C.
9. Trembizki E.
10. Trees D.L.
11. Kersh E.N.
12. Abrams A.J.
13. De Vries H.J.C.
14. Van Dam A.P.
15. Medina I.
16. Bharat A.
17. Mulvey M.R.
18. Van Domselaar G.
19. Martin I.
2017Neisseria gonorrhoeae Sequence Typing for Antimicrobial Resistance, a Novel Antimicrobial Resistance Multilocus Typing Scheme for Tracking Global Dissemination of N. gonorrhoeae StrainsJournal of Clinical Microbiology 55:1454–1468https://doi.org/10.1128/jcm.00100-17 Google Scholar
56.
1. Tonkin-Hill G.
2. Lees J.A.
3. Bentley S.D.
4. Frost S.D.W.
5. Corander J.
2019Fast hierarchical Bayesian analysis of population structureNucleic Acids Research 47:5539–5549https://doi.org/10.1093/nar/gkz361 Google Scholar

Article and author information

Author information

Anastasia Unitt
Nuffield Department of Population Health, University of Oxford, Oxford, United Kingdom
ORCID iD: 0000-0003-1392-9786
Made Krisna
Department of Biology, University of Oxford, Oxford, United Kingdom
ORCID iD: 0000-0002-7681-7135
Kasia M Parfitt
Department of Biology, University of Oxford, Oxford, United Kingdom
ORCID iD: 0000-0001-7511-1459
Keith A Jolley
Department of Biology, University of Oxford, Oxford, United Kingdom
ORCID iD: 0000-0002-0751-0287
Martin CJ Maiden
Department of Biology, University of Oxford, Oxford, United Kingdom
ORCID iD: 0000-0001-6321-5138
Odile B Harrison
Nuffield Department of Population Health, University of Oxford, Oxford, United Kingdom
ORCID iD: 0000-0002-1623-0295
- For correspondence: odile.harrison@ndph.ox.ac.uk

Author Notes

Competing interests: No competing interests declared

Version history

Preprint posted: May 20, 2025
Sent for peer review: June 6, 2025
Reviewed Preprint version 1: August 27, 2025
Reviewed Preprint version 2: November 28, 2025
Version of Record published: December 31, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.107758. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 584
downloads: 40
citations: 2

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Significance of findings

Strength of evidence

Abstract

Introduction

Methodology

Representative Isolate Collections

Characteristics of isolates within representative development datasets 1 and 2.

Development of Ng cgMLST v2

Analysis of Population Structure and Threshold Selection

Implementation and Analyses

Phylogenetic Trees

Results

Core genome MLST Version 2

Allele number vs allele length for the 1430 genes in Ng cgMLST v2.

Analysis of Population Structure and Allelic Mismatch Thresholds

Plots used in the selection of allelic mismatch thresholds for the LIN code.

Illustration of the gonococcal LIN code nomenclature.

Minimum spanning tree showing clustering of 3935 isolates from dataset 2 based on Ng cgMLST v2.

Implementation of LIN code

Unrooted FastTree of 1000 randomly chosen N. gonorrhoeae isolates with LIN codes assigned.

Congruence with Previous Typing Schemes

Example Analysis

Maximum Likelihood tree of 170 Ceftriaxone-resistant isolates previously analysed in Fifer et al., (2024).

Discussion

Using PubMLST to explore related isolates by LIN code.