Fundamental stages in topology generation from atomistic structures.

First, the provided input is parsed (step 1). Second, for every parsed residue its atoms are identified and, if needed, atom names are corrected and missing atoms are added (step 2). Third, mappings are taken from the library and a resolution transformation to the required output resolution is performed (step 3). Fourth, intra-residue interactions are added from blocks taken from the library, and inter-residue interactions are added from links taken from the library (step 4). Fifth, optionally, post-processing is performed to add e.g. an elastic network (step 5). Finally, the produced topology is written to output files (step 6).

Organization of the vermouth library.

The vermouth library defines 5 types of data structures (blue) to store molecular information and force field information. For convenience, it also defines two collection classes (orange) composed of several data-structure instances. Data structures are initiated or get input from parsers, which read 6 types of data files (see Table S1 for more details on file types). The central data structure(s) are Molecule and System. These are changed, updated, or transformed by so-called Processor classes, which take force field data as input. Parsers, data structures, and Processors only depend on three libraries as shown. At the moment vermouth also exposes four types of writers (not shown here) to go along with the parsers (see Table S2).

Illustration of atom recognition, mapping, and linking in topology generation.

a) To identify all atoms in the input molecule (black and orange) every residue in the molecule is overlaid with its canonical reference (blue and green). Atoms are recognized when they overlap with atoms in the reference (green). Atoms not present in the molecule are also identified (blue) and will be added later. Finally, atoms in the molecule not described by the canonical references are also labeled (orange) so that they may be identified later. b) Identifying the terminal atoms that are not part of the canonical residues. The modification templates are depicted in blue and the atoms they match in orange. The cysteine does not participate since it does not carry any unexpected atoms, and is depicted in grey for clarity. c) Mappings (blue, red, and green) describe a molecular fragment at two different resolutions and a correspondence between their particles. The correspondence is depicted approximately here. The mappings are applied to the molecule (black). d) Example of applying a Link. The link depicted (dark blue) adds an angle potential over CG backbone beads.

Workflows for identifying protonation states or PTMs exemplified on protonated histidine.

In route a) the residue name of the protonated histidine extracted from the atomistic coordinates matches the residue name in the library and matches the fragment. Hence the protonation state is correctly picked up. In route b) the residue name matches that of neutral Histidine in the library. A mismatch of the fragments is recognized and the extra hydrogen is labelled. Subsequently by matching the extra hydrogen to a modification of the histidine block the protonated Histidine is recognized as neutral Histidine plus protonation modification and the correct parameters for protonated Histidine are generated.

Example of automated identification of PTMs.

CG Martini model of phosphorylated Tyrosine found in the EGFR kinase activation loop. The mapped structure of the phosphorylated residue is shown as beads overlying the atomistic structure.

Fine-tuning options for the elastic network.

a) Elastic networks and backbone bonds within the human insulin dimer when generated with the molecule or all-option. The dimer consists of two chains colored in red and orange, which are connected by two disulfide bridges shown in purple. EN bonds are generated between the two chains and within the chains. b) Elastic network and backbone bonds within the insulin dimer when generated with the chain option. In this case, no elastic bonds are generated between the two chains. They are only connected by the disulfide bridge and non-bonded interactions. c) Elastic network within the Ftsz protein, when generated for both the intrinsically disordered tail domains (orange) and structural domain (red) d) Elastic network within the Ftsz protein when the EN is only generated within the structural domain by defining the EN unit as going from resid 12 to 320.

Ligands, cofactors, and polymers transformed to CG Martini level.

a) Flavin Reductase with two FMNs and one NDP cofactor bound in the reference all-atom state and mapped to Martini CG as indicated by the spheres. The inset shows a zoom onto FMN; b) Lysozyme with benzene ligand bound in the reference all-atom structure and mapped to Martini CG resolution; c) Crown ether with Martini beads shown on top of the all-atom structure; d) Branched polyethylene at all-atom resolution (left) and Martini resolution (right) with the linear chain part shown in gray and the branches in yellow.

Summary of the successes and failures of the high-throughput pipeline.

We ran the pipeline on the 87084 structures from the template library used by the I-TASSER68 protein prediction software of which 73% could be converted with martinize2. The other 26.4% failed mostly due to missing coordinates, and unrecognized residues. For 100% of the converted structures, a GROMACS run input file (i.e. tpr-file) could be generated, and on all but 13 of the converted structures, an energy minimization could be performed.

Two examples of problematic atomistic protein structures flagged by martinize2.

a) the cysteine residue with too small O-O and O-C distances leads to superfluous bonds being recognized. b) the incorrect interatomic distances in the histidine ring led to missing bonds (transparent), an erroneous O-N bond connecting the histidine to a neighboring asparagine. Additionally, a nitrogen atom is switched for an oxygen atom in asparagine.