Martinize2 and Vermouth provide a unified framework for molecular topology generation

  1. Peter C Kroon
  2. Fabian Grünewald  Is a corresponding author
  3. Jonathan Barnoud
  4. Marco van Tilburg
  5. Chris Brasnett
  6. Paulo CT Souza
  7. Tsjerk A Wassenaar
  8. Siewert J Marrink  Is a corresponding author
  1. Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Netherlands
  2. Heidelberg Institute for Theoretical Studies (HITS), Germany
  3. Interdisciplinary Center for Scientific Computing, Heidelberg University, Germany
  4. CiTIUS Intelligent Technologies Research Centre, Spain
  5. Laboratoire de Biologie et Modélisation de la Cellule, CNRS, France
  6. Centre Blaise Pascal de Simulation et de Modélisation Numérique, Ecole Normale Supérieure de Lyon, France
11 figures, 3 tables and 1 additional file

Figures

Fundamental stages in topology generation from atomistic structures.

First, the provided input is parsed (step 1). Second, for every parsed residue, its atoms are identified and, if needed, atom names are corrected and missing atoms are added (step 2). Third, mappings are taken from the library and a resolution transformation to the required output resolution is performed (step 3). Fourth, intra-residue interactions are added from blocks taken from the library, and inter-residue interactions are added from links taken from the library (step 4). Fifth, optionally, post-processing is performed to add, for example an EN (step 5). Finally, the produced topology is written to output files (step 6).

Organization of the Vermouth library.

The vermouth library defines 5 types of data structures (blue) to store molecular information and force field information. For convenience, it also defines two collection classes (orange) composed of several data-structure instances. Data structures are initiated or get input from parsers, which read 6 types of data files (see Appendix 1—table 1 for more details on file types). The central data structure(s) are Molecule and System. These are changed, updated, or transformed by so-called Processor classes, which take force field data as input. Parsers, data structures, and Processors only depend on three Python libraries as shown. At the moment, vermouth also exposes four types of writers (not shown here) to go along with the parsers (see Appendix 1—table 2).

Illustration of atom recognition, mapping, and linking in topology generation.

(a) To identify all atoms in the input molecule (black and orange) every residue in the molecule is overlaid with its canonical reference (blue and green). Atoms are recognized when they overlap with atoms in the reference (green). Atoms not present in the molecule are also identified (blue) and will be added later. Finally, atoms in the molecule not described by the canonical references are also labeled (orange) so that they may be identified later. (b) Identifying the terminal atoms that are not part of the canonical residues. The modification templates are depicted in blue and the atoms they match in orange. The cysteine does not participate since it does not carry any unexpected atoms and is depicted in gray for clarity. (c) Mappings (blue, orange, and green) describe a molecular fragment at two different resolutions and a correspondence between their particles. The correspondence is depicted approximately here. The mappings are applied to the molecule (black). (d) Example of applying a Link. The link depicted (dark blue) adds an angle potential over CG backbone beads.

Workflows for identifying protonation states or PTMs exemplified on protonated histidine.

In route (a), the residue name of the protonated histidine extracted from the atomistic coordinates matches the residue name in the library and matches the fragment. Hence, the protonation state is correctly picked up. In route (b), the residue name matches that of neutral Histidine in the library. A mismatch of the fragments is recognized, and the extra hydrogen is labeled. Subsequently, by matching the extra hydrogen to a modification of the Histidine block, the protonated Histidine is recognized as neutral Histidine plus protonation modification, and the correct parameters for protonated Histidine are generated.

Example of automated identification of PTMs.

CG Martini model of phosphorylated Tyrosine found in the EGFR kinase activation loop. The mapped structure of the phosphorylated residue is shown as beads overlying the atomistic structure.

Fine-tuning options for the elastic network.

(a) ENs and backbone bonds within the human insulin dimer when generated with the molecule or all-option. The dimer consists of two chains colored in red and orange, which are connected by two disulfide bridges shown in purple. EN bonds are generated between the two chains and within the chains. (b) EN and backbone bonds within the insulin dimer when generated with the chain option. In this case, no elastic bonds are generated between the two chains. They are only connected by the disulfide bridge and non-bonded interactions. (c) EN within the FtsZ protein, when generated for both the intrinsically disordered tail domains (orange) and structural domain (red) (d) EN within the FtsZ protein when the EN is only generated within the structural domain by defining the EN unit as going from residues 12–320.

Ligands, cofactors, and polymers transformed to CG Martini level.

(a) Flavin Reductase with two FMNs and one NDP cofactor bound in the reference AA state and mapped to Martini CG as indicated by the spheres. The inset shows a zoom onto FMN; (b) Lysozyme with benzene ligand bound in the reference AA structure and mapped to Martini CG resolution; (c) Crown ether with Martini beads shown on top of the AA structure; (d) Branched polyethylene at AA resolution (left) and Martini resolution (right) with the linear chain part shown in gray and the branches in yellow.

Summary of the successes and failures of the high-throughput pipeline.

We ran the pipeline on the 87084 structures from the template library used by the I-TASSER (Yang et al., 2015) protein prediction software, of which 73% could be converted with Martinize2. The other 26.4% failed mostly due to missing coordinates and unrecognized residues. For 100% of the converted structures, a GROMACS run input file (i.e. tpr-file) could be generated, and on all but 13 of the converted structures, an energy minimization could be performed.

Two examples of problematic atomistic protein structures flagged by Martinize2.

(a) The cysteine residue with too small O-O and O-C distances leads to superfluous bonds being recognized. (b) The incorrect interatomic distances in the histidine ring led to missing bonds (transparent), an erroneous O-N bond connecting the histidine to a neighboring asparagine. Additionally, a nitrogen atom is switched for an oxygen atom in asparagine.

Appendix 3—figure 1
Comparison of processing speeds between Martinize and Martinize2.

11 Protein structures with increasing size from a dataset of 200,000 structures were processed with Martinize and Martinize2 to record the computation time. The protein structures were spaced evenly by residue count.

Appendix 4—figure 1
Example of finding all LCISs between graphs X and Y.

Grayed out nodes are not used (they are excluded from the comparison by the shrinking step), but are depicted for clarity. Since nodes A1 and A2 in Y are symmetry equivalent, not all subgraphs are taken into account. Those that are excluded due to symmetry reasons are depicted in the box Symmetry pruned. Iteration 1: We try to find a subgraph isomorphism between X and Y. None is found. Iteration 2: Y is shrunk to produce the graphs depicted. We try to find subgraph isomorphisms between these and X. None are found. Iteration 3: all graphs from iteration 2 are shrunk further. Since a subgraph isomorphism can be found between at least one of these ({A1, A2, B}) and X, the algorithm terminates afterwards. To highlight how often the algorithm discovers that {A1, B} is subgraph isomorphic to X, it is shown in bold.

Tables

Appendix 1—table 1
Data Parsers object returned as well as format definition and extension.
ExtensionData classParser nameInput format
.ffLinks
Block
Modifications
read_ffin house force-field format
.itpBlockread_itpGROMACS topology file; all [molecule] directive content
.mapMappingread_mappingmapping file as defined using backwards style
.pdbSystemread_pdbcanonical PDB format
.groMoleculeread_groGromacs.gro file
Appendix 1—table 2
Data Writers and the object returned as well as format definition and extension.
Input formatData classParser nameOutput format
.groSystemwrite_groG96 gro file
.pdbSystemwrite_pdbPDB file
.topSystemwrite_topPseudo topology file
.itpSystemwrite_itpGROMACS topology file; all [molecule] directive content
Appendix 2—table 1
Limited overview of selected competing tools capable of generating MD topologies.

‘Force Field’ lists the force fields for which this tool can generate topologies without changing the source code. ‘Type of system’ describes the type of system this tool can generate topologies for. ‘External data files’ means whether the force field parameters used are included in separate data files, making it possible to easily change them. ‘Notes’ lists additional remarks and comments, ‘builds coordinates’ means it is capable of constructing coordinates for complete systems, rather than only for e.g. missing side chains.

NameForce fieldType of systemExternal data filesNotes
pdb2gmx (Abraham et al., 2015; Páll et al., 2015)Any AA/UALinear polymersYes
LEaP (Case et al., 2005)Any AA/UALinear polymersYes
CHARMM (Brooks et al., 2009)Any AA/UALinear polymersYes
Psfgen (Phillips et al., 2005)Any AA/UALinear polymersYes
Martinize 1 (de Jong et al., 2013; Uusitalo et al., 2015)MartiniProteins, DNANo
Sirah Tools (Machado and Pantano, 2016)Any CGLinear polymersYesPerforms mapping only
DoGlycans (Danne et al., 2017)AMBER, OPLSSugarsYesBuilds coordinates
HOOBAS (Girard et al., 2019)MultipleMultipleYesNo user interface, builds coordinates
CHARMM-GUI (Jo et al., 2017; Qi et al., 2015; Jo et al., 2008)MultipleMultipleNoWeb server, builds coordinates
VerMoUTH/Martinize2MultipleMultipleYesThis work
ATB (Malde et al., 2011; Canzar et al., 2013)GROMOS54a7 GROMOS54a8Small moleculesN/AAutomatic de novo parametrization
LigParGen (Jorgensen and Tirado-Rives, 2005; Dodda et al., 2017b; Dodda et al., 2017a)OPLS-AASmall moleculesN/AAutomatic de novo parametrization
CGenFF (Vanommeslaeghe and MacKerell, 2012)CHARMM General Force FieldSmall moleculesN/AAutomatic de novo parametrization

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Peter C Kroon
  2. Fabian Grünewald
  3. Jonathan Barnoud
  4. Marco van Tilburg
  5. Chris Brasnett
  6. Paulo CT Souza
  7. Tsjerk A Wassenaar
  8. Siewert J Marrink
(2025)
Martinize2 and Vermouth provide a unified framework for molecular topology generation
eLife 12:RP90627.
https://doi.org/10.7554/eLife.90627.4