MDverse: Shedding Light on the Dark Matter of Molecular Dynamics Simulations

Johanna K. S. Tiemann; Magdalena Szczuka; Lisa Bouarroudj; Mohamed Oussaren; Steven Garcia; Rebecca J. Howard; Lucie Delemotte; Erik Lindahl; Marc Baaden; Kresten Lindorff-Larsen; Matthieu Chavent; Pierre Poulain

doi:10.7554/eLife.90061.2

eLife assessment

The study presents a valuable tool for searching molecular dynamics simulation data, making such datasets accessible for open science. The authors provide convincing evidence that it is possible to identify noteworthy molecular dynamics simulation datasets and that their analysis can produce information of value to the community.

https://doi.org/10.7554/eLife.90061.2.sa3

Significance of findings

valuable: Findings that have theoretical or practical implications for a subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

convincing: Appropriate and validated methodology in line with current state-of-the-art

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD ﬁles in generalist data repositories, constituting the dark matter of MD — data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 ﬁles and 2,000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on ﬁles produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identiﬁed systems with speciﬁc molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.

Introduction

The volume of data available in biology has increased tremendously (Marx, 2013; Stephens et al., 2015), through the emergence of high-throughput experimental technologies, often referred to as - omics, and the development of efficient computational techniques, associated with high-performance computing resources. The Open Access (OA) movement to make research results free and available to anyone (including e.g. the Budapest Open Access Initiative and the Berlin declaration on Open Access to Knowledge) has led to an explosive growth of research data made available by scientists (Wilson et al., 2021). The FAIR (Findable, Accessible, Interoperable and Reusable) principles (Wilkinson et al., 2016) have emerged to structure the sharing of these data with the goals of reusing research data and to contribute to the scientiﬁc reproducibility. This leads to a world where research data has become widely available and exploitable, and consequently new applications based on artiﬁcial intelligence (AI) emerged. One example is AlphaFold (Jumper et al., 2021), which enables the construction of a structural model of any protein from its sequence. However, it is important to be aware that the development of AlphaFold was only possible because of the existence of extremely well annotated and cleaned open databases of protein structures (wwPDB Berman et al. (2003)) and sequences (UniProt Consortium (2022)). Similarly, accurate predictions of NMR chemical shifts and chemical-shift-driven structure determination was only made possible via a community-driven collection of NMR data in the Biological Magnetic Resonance Data Bank (Hoch et al., 2023). One can easily imagine novel possibilities of AI and deep learning reusing previous research data in other ﬁelds, if that data is curated and made available at a large scale (Fan and Shi, 2022; Mahmud et al., 2021).

Molecular Dynamics (MD) is an example of a well-established research ﬁeld where simulations give valuable insights into dynamic processes, ranging from biological phenomena to material science (Perilla et al., 2015; Hollingsworth and Dror, 2018; Yoo et al., 2020; Alessandri et al., 2021; Krishna et al., 2021). By unraveling motions at details and timescales invisible to the eye, this well- established technique complements numerous experimental approaches (Bottaro and Lindorff- Larsen, 2018; Marklund and Benesch, 2019; Fawzi et al., 2021). Nowadays, large amounts of MD data could be generated when modelling large molecular systems (Gupta et al., 2022) or when applying biased sampling methods (Hénin et al., 2022). Most of these simulations are performed to decipher speciﬁc molecular phenomena, but typically they are only used for a single publication. We have to confess that many of us used to believe that it was not worth the storage to collect all simulations (in particular since all might not have the same quality), but in hindsight this was wrong. Storage is exceptionally cheap compared to the resources used to generate simulations data, and they represent a potential goldmine of information for researchers wanting to reanalyze them (Antila et al., 2021), in particular when modern machine-learning methods are typically limited by the amount of training data. In the era of open and data-driven science, it is critical to render the data generated by MD simulations not only technically available but also practically usable by the scientiﬁc community. In this endeavor, discussions started a few years ago (Abraham et al., 2019; Abriata et al., 2020; Merz et al., 2020) and the MD data sharing trend has been accelerated with the effort of the MD community to release simulation results related to the COVID-19 pandemic (Amaro and Mulholland, 2020; Mulholland and Amaro, 2020) in a centralized database (https://covid.bioexcel.eu). Speciﬁc databases have also been developed to store sets of simulations related to protein structures (MoDEL: Meyer et al. (2010)), membrane proteins in general (Mem-ProtMD: Stansfeld et al. (2015); Newport et al. (2018)), G-protein coupled receptors in particular (GPCRmd: Rodríguez-Espigares et al. (2020)), or lipids (Lipidbook:Domański et al. (2010), NMRLipids Databank: Kiirikki et al. (2023)).

Albeit previous attempts in the past (Tai et al., 2004; Meyer et al., 2010), there is, as of now, no central data repository that could host all kinds of MD simulation ﬁles. This is not only due to the huge volume of data and its heterogeneity, but also because interoperability of the many ﬁle formats used adds to the complexity. Thus, faced with the deluge of biosimulation data (Hospital et al., 2020), researchers often share their simulation ﬁles in multiple generalist data repositories. This makes it difficult to search and ﬁnd available data on, for example, a speciﬁc protein or a given set of parameters. We are qualifying this amount of scattered data as the dark matter of MD, and we believe it is essential to shed light onto this overlooked but high-potential volume of data. When unlocked, publicly available MD ﬁles will gain more visibility. This will help people to access and reuse these data more easily and overall, by making MD simulation data more FAIR (Wilkinson et al., 2016), it will also improve the reproducibility of MD simulations (Elofsson et al., 2019; Porubsky et al., 2020; consortium, 2019).

In this work, we have employed a search strategy to index scattered MD simulation ﬁles deposited in generalist data repositories. With a focus on the ﬁles generated by the Gromacs MD software, we performed a proof-of-concept large-scale analysis of publicly available MD data. We revealed the high value of these data and highlighted the different categories of the simulated molecules, as well as the biophysical conditions applied to these systems. Based on these results and our annotations, we proposed a search engine prototype to easily explore this dark matter of MD. Finally, building on this experience, we provide simple guidelines for data sharing to gradually improve the FAIRness of MD data.

Results

With the rise of open science, researchers increasingly share their data and deposit them into generalist data repositories, such as Zenodo (https://zenodo.org), Figshare (https://figshare.com), Open Science Framework (OSF, https://osf.io), and Dryad (https://datadryad.org/). In this ﬁrst attempt to ﬁnd out how many ﬁles related to MD are deposited in data repositories, we focused our exploration on three major data repositories: Figshare (∼3.3 million ﬁles, ∼112 TB of data, as of January 2023), OSF (∼2 million ﬁles, as of November 2022; Figures provided by Figshare and OSF user support teams), and Zenodo (∼9.9 million ﬁles, ∼1.3 PB of data, as of December 2022; Panero and Benito (2022)).

One immediate strategy to index MD simulation ﬁles available in data repositories is to perform a text-based Google-like search. For that, one queries these repositories with keywords such as ‘molecular dynamics’ or ‘Gromacs’. Unfortunately, we experienced many false positives with this search strategy. This could be explained by the strong discrepancy we observed in the quantity and quality of metadata (title, description) accompanying datasets and queried in text-based search. For instance, a description text could be composed of a couple of words to more than 1,200 words. Metadata is provided by the user depositing the data, with no incentive to issue relevant details to support the understanding of the simulation. For the three data repositories studied, no human curation other by that of the providers is performed when submitting data. It is also worth mentioning that title and description are provided as free-text and do not abide to any controlled vocabulary such as a speciﬁc MD ontology.

To circumvent this issue, we developed an original and speciﬁc search strategy that we called Explore and Expand (Ex²) (see Fig. 1-A and Materials and Methods section) and that relies on a combination of ﬁle types and keywords queries. In the Explore phase, we searched for ﬁles based on their ﬁle types (for instance: .xtc, .gro, etc) with MD-related keywords (for instance: ‘molecular dynamics’, ‘Gromacs’, ‘Martini’, etc). Each of these hit ﬁles belonged to a dataset, which we further screened in the Expand phase. There, we indexed all ﬁles found in a dataset identiﬁed in the previous Explore phase with, this time, no restriction to the collected ﬁle types (see Fig. 1-A and details on the data scraping procedure in the Materials and Methods section).

(A) Explore and Expand (Ex²) strategy used to index and collect MD-related ﬁles. Within the explore phase, we search in the respective data repositories for datasets that contain speciﬁc keywords (e.g. “molecular dynamics”, “md simulation”, “namd”, “martini”…) in conjunction with speciﬁc ﬁle extensions (e.g. “mdp”, “psf”, “parm7”…), depending on their uniqueness and level of trust to not report false-positives (.i.e not MD related). In the expand phase, the content of the identiﬁed datasets is fully cataloged, including ﬁles that individually could result in false positives (such as e.g. “.log” ﬁles). (B) Number of deposited ﬁles in generalist data repositories, identiﬁed by our Ex² strategy.

Globally, we indexed about 250,000 ﬁles and 2,000 datasets that represented 14 TB of data deposited between August 2012 and March 2023 (see Table 1). One major difficulty were the numerous ﬁles stored in zipped archives, about seven times more than ﬁles steadily available in datasets (see Table 1). While this choice is very convenient for depositing the ﬁles (as one just needs to provide one big zip ﬁle to upload to the data repository server), it hinders the analysis of MD ﬁles as data repositories only provide a limited preview of the content of the zip archives and completely inhibits, for example, data streaming for remote analysis and visualization. Files within zip ﬁles are not indexed and cannot be searched individually. The use of zip archives also hampers the reusability of MD data, since a speciﬁc ﬁle cannot be downloaded individually. One has to download the entire zip archive (sometimes with a size up to several gigabytes) to extract the one ﬁle of interest.

Statistics of the MD-related datasets and ﬁles found in the data repositories Figshare, OSF, and Zenodo.

The ﬁrst dataset we found related to MD data that has been deposited in August 2012 in Figshare and corresponds to the work of Fuller et al. (Fuller et al., 2012) (see Table 1) but we may consider the start of more substantial deposition of the MD data to be 2016 with more than 20,000 ﬁles deposited, mainly in Figshare (see Fig. 1-B). While the number of ﬁles deposited in Zenodo was ﬁrst relatively limited, the last few years (2020-2022) saw a steep increase, passing from a few thousands ﬁles in 2018 to almost 50,000 ﬁles in 2022 (see Fig. 1-B). In 2018, the number of MD ﬁles deposited in OSF was similar to those in the two other data repositories, but did not take off as much as the other data repositories. Zenodo seems to be favored by the MD community since 2019, even though Figshare in 2022 also saw a sharp increase in deposited MD ﬁles. The preference for Zenodo could also be explained by the fact that it is a publicly funded repository developed under the European OpenAIRE program and operated by CERN (European Organization For Nuclear Research and OpenAIRE, 2013). Overall, the trend showed a rise of deposited data with a steep increase in 2022 (Fig. 1-B). We believe that this trend will continue in future years, which will lead to a greater amount of MD data available. It is thus urgent to deploy a strategy to index this vast amount of data, and to allow the MD community to easily explore and reuse such gigantic resource. The following describes what is already feasible in terms of meta analysis, in particular what types of data are deposited in data repositories and the simulation setup parameters used by MD experts that have deposited their data.

With our Ex² strategy (see Fig. 1-A), we assigned the deposited ﬁles to the MD packages: AMBER (Ferrer et al., 2012), DESMOND (Bowers et al., 2006), Gromacs (Berendsen et al., 1995; Abraham et al., 2015), and NAMD/CHARMM (Phillips et al., 2020; Brooks et al., 2009), based on their corresponding ﬁle types (see Materials and Methods section). In the case of NAMD/CHARMM, ﬁle extensions were mostly identical, which prevented us from distinguishing the respective ﬁles from these two MD programs. With 87,204 ﬁles deposited, the Gromacs program was most represented (see Fig. 2-A), followed by NAMD/CHARMM, AMBER, and DESMOND. This statistic is limited as it does not consider more speciﬁc databases related to a particular MD program. For example, the DE Shaw Research website contains a large amount of simulation data related to SARS-CoV-2 that has been generated using the ANTON supercomputer (https://www.deshawresearch.com/downloads/download_trajectory_sarscov2.cgi/) or other extensively simulated systems of interest to the community. However, this in itself might also serve as a good example, since few automated search strategies will be able to ﬁnd custom stand-alone web servers as valuable repositories. Here, our goal was not to compare the availability of all data related to each MD program but to give a snapshot of the type of data available at a given time (i.e. March 2023) in generalist data repositories. Interestingly, many ﬁles (> 133,000) were not directly associated to any MD program (see Fig. 2-A label ‘Unknown’). We categorized these ﬁles based on their extensions (see Fig. 2-B). While 10 % of these ﬁles were without ﬁle extension (Fig. 2-B, column none), we found numerous ﬁles corresponding to structure coordinates such as .pdb (∼12,000) and .xyz (∼6,800) ﬁles. We also got images (.tiff ﬁles) and graphics (.xvg ﬁles). Finally, we found many text ﬁles such as .txt, .dat, and .out which can potentially hold details about how simulations were performed. Focusing further on ﬁles related to the Gromacs program, being currently most represented in the studied data repositories, we demonstrated in the following present possibilities to retrieve numerous information related to deposited MD simulations.

Categorization of index ﬁles based on their ﬁle types and assigned MD engine. (A) Distribution of ﬁles among MD simulation engines (B) Expansion of (A) MD Engine category “Unknown” into the 10 most observed ﬁle types.

First, we were interested in what ﬁle types researchers deposited and thereby ﬁnd potentially of great value to share. We therefore quantiﬁed the types of ﬁles generated by Gromacs (Fig. 3-A). The most represented ﬁle type is the .xtc ﬁle (28,559 ﬁles, representing 8.6 TB). This compressed (binary) ﬁle is used to store the trajectory of an MD simulation and is an important source of information to characterize the evolution of the simulated molecular system as a function of time. It is thus logical to mainly ﬁnd this type of ﬁle shared in data repositories, as it is of great value for reusage and new analyses. Nevertheless, it is not directly readable but needs to be read by a third-party program, such as Gromacs itself, a molecular viewer like VMD (Humphrey et al., 1996) or an analysis library such as MDAnalysis (Michaud-Agrawal et al., 2011; Gowers et al., 2016). In addition, this trajectory ﬁle can only be of use in combination with a matching coordinates ﬁle, in order to correctly access the dynamics information stored in this ﬁle. Thus, as it is, this ﬁle is not easily mineable to extract useful information, especially if multiple .xtc and coordinate ﬁles are available in one dataset. Interestingly, we found 1,406 .trr ﬁles, which contain trajectory but also additional information such as velocities, energy of the system, etc. While this ﬁle is especially useful in terms of reusability, the large size (can go up to several 100 GB) limits its deposition in most data repositories. For instance, a ﬁle cannot usually exceed 50 GB in Zenodo, 20 GB in Figshare (for free accounts) and 5 GB in OSF. Altogether, Gromacs trajectory ﬁles represented about 30,000 ﬁles in the three explored generalist repositories (34% of Gromacs ﬁles). This is a large number in comparison to existing trajectories stored in known databases dedicated to MD with 1,700 MD trajectories available in MoDEL, 1,737 trajectories (as of November 2022) available in GPCRmd, 5,971 (as of January 2022) trajectories available in MemProtMD and 726 trajectories (as of March 2023) available in the NMRLipids Databank. Although fewer in count, these numbers correspond to manually or semi-automatically curated trajectories of speciﬁc systems, mostly proteins and lipids. Thus, ∼30,000 MD trajectories available in generalist data repositories may represent a wider spectrum of simulated systems but need to be further analyzed and ﬁltered to separate usable data from less interesting trajectories such as minimization or equilibration runs.

Content analysis of .xtc and .gro ﬁles. (A) Number of Gromacs-related ﬁles available in searched data repositories. In red, ﬁles used for further analyses. (B) Simple analyze of a subset of .xtc ﬁles with the cumulative distribution of the number of frames (in green) and the system size (in orange). (C) Cumulative distribution of the system sizes extracted from .gro ﬁles. (D) Upset plot of systems grouped by molecular composition, inferred from the analysis of .gro ﬁles. For this ﬁgure, 3D structures of representative systems were displayed, including soluble proteins such as TonB and T4 Lysozyme, membrane proteins such as Kir Channels and the Gasdermin prepore, Protein-/RNA and G-quadruplex and other non-protein molecules.

Given the large volume of data represented by .xtc ﬁles (see above), we could only scratch the surface of the information stored in these trajectory ﬁles by analyzing a subset of 779 .xtc ﬁles - one per dataset in which this type of ﬁle was found. We were able to get the size of the molecular systems and the number of frames available in these ﬁles (Fig. 3-B). The system size was up to more than one million atoms for a simulation of the TonB protein (Virtanen et al., 2020). The cumulative distribution of the number of frames showed that half of the ﬁles contain more than 10,000 frames. This conformational sampling can be very useful for other research ﬁelds besides the MD community that study, for instance, protein flexibility or protein engineering where diverse backbones can be of value. We found an .xtc ﬁle containing more than 5 million frames, where the authors probe the picosecond–nanosecond dynamics of T4 lysozyme and guide the MD simulation with NMR relaxation data (Kümmerer et al., 2021). Extending this analysis to all 28,559 .xtc ﬁles detected would be of great interest for a more holistic view, but this would require an initial step of careful checking and cleaning to be sure that these ﬁles are analyzable. Of note, as .xtc ﬁles also contain time stamps, it would be interesting to study the relationship between the time and the number of frames to get useful information about the sampling. Nevertheless, this analysis would be possible only for unbiased MD simulations. So, we would need to decipher if the .xtc ﬁle is coming from biased or unbiased simulations, which may not be trivial.

These results bring a ﬁrst explanation on why there is not a single special-purpose repository for MD trajectory ﬁles. Databases dedicated to molecular structures such as the Protein Databank (Berman et al., 2000; Kinjo et al., 2017; Armstrong et al., 2019), or even the recent PDB-dev (Burley et al., 2017), designed for integrative models, cannot accept such large-size ﬁles, even less if complete trajectories without reducing the number of frames would be uploaded. This would also require implementing extra steps of data curation and quality control. In addition, the size of the IT infrastructure and the human skills required for data curation represents a signiﬁcant cost that could probably not be supported by a single institution.

Subsequently, our interest shifted towards exploring which systems are being investigated by MD researchers who deposit their ﬁles. We found 9,718 .gro ﬁles which are text ﬁles that contain the number of particles and the Cartesian coordinates of the system modelled. By parsing the number of particles and the type of residue, we were able to give an overview of all Gromacs systems deposited (Fig. 3-C,D). In terms of system size, they ranged from very small - starting with two coarse-grain (CG) particles of graphite (Piskorz et al., 2019), followed by coordinates of a water molecule (3 atoms) (Ivanov et al., 2017), CG model of benzene (3 particles) (Dandekar and Mondal, 2020) and atomistic model of ammonia (4 atoms) (Kelly and Smith, 2020) — to go up to atomistic and coarse-grain systems composed of more than 3 million particles (Duncan et al., 2020; Schaefer and Hummer, 2022) (Fig. 3-C). Interestingly, the system sizes in .gro ﬁles exceeded those of the analyzed .xtc ﬁles (Fig. 3-B). Even if we cannot exclude that the limited number of .xtc ﬁles analyzed (779 .xtc ﬁles selected from 28,559 .xtc ﬁles indexed) could explain this discrepancy, an alternate hypothesis is that the size of an .xtc ﬁle also depends on the number of frames stored. To reduce the size of .xtc ﬁles deposited in data repositories, besides removing some frames, researchers might also remove parts of the system, such as water molecules. As a consequence for reusability, this solvent removal could limit the number of suitable datasets available for researchers interested in re-analysing the simulation with respect to, in this case, water diffusion. While the size of systems extracted from .gro ﬁles was homogeneously spread, we observed a clear bump around system sizes of circa 8,500 atoms/particles. This enrichment of data could be explained by the deposition of ∼340 .gro ﬁles related to the simulation of a peptide translocation through a membrane (Fig. 3-C) (Kabelka et al., 2021). Beyond 1 million particles/atoms, the number of systems is, for the moment, very limited.

We then analyzed residues in .gro ﬁles and inferred different types of molecular systems (see Fig. 3-D). Two of the most represented systems contained lipid molecules. This may be related to NMRLipids initiative (http://nmrlipids.blogspot.com). For several years, this consortium has been ac-tively working on lipid modelling with a strong policy of data sharing and has contributed to share numerous datasets of membrane systems. As illustrated in Fig 3-C, a variety of membrane systems, especially membrane proteins, were deposited. This highlights the vitality of this research ﬁeld, and the will of this community to share their data. We also found numerous systems containing solvated proteins. This type of data, combined with .xtc trajectory ﬁles (see above), could be invaluable to describe protein dynamics and potentially train new artiﬁcial intelligence models to go beyond the current representation of the static protein structure (Lane, 2023). There was also a good proportion of systems containing nucleic acids alone or in interaction with proteins (1237 systems). At this time, we found only few systems containing carbohydrates that also contained proteins and corresponded to one study to model hyaluronan–CD44 interactions (Vuorio et al., 2017). Maybe a reason for this limited number is that systems containing sugars are often modelled using AMBER force ﬁeld (Ferrer et al., 2012), in combination with GLYCAM (Kirschner et al., 2008-03). A future study on the ∼10,200 AMBER ﬁles deposited could retrieve more data related to carbohydrate containing systems. Given the current developments to model glycans (Fadda, 2022), we expect to see more deposited systems with carbohydrates in the coming years.

Finally, we found 1,029 .gro ﬁles which did not belong to the categories previously described. These ﬁles were mostly related to models of small molecules, or molecules used in organic chemistry (Young et al., 2020) and material science (Zheng et al., 2022; Piskorz et al., 2019) (see central panel, Fig. 3-D). Several datasets contained lists of small molecules used for calculating free energy of binding (Aldeghi et al., 2015), solubility of molecules (Liu et al., 2016), or osmotic coefficient (Zhu, 2019). Then, we identiﬁed models of nanoparticles (Kyrychenko et al., 2012; Pohjolainen et al., 2016), polymers (Sarkar et al., 2020; Karunasena et al., 2021; Gertsen et al., 2020), and drug molecules like EPI-7170, which binds disordered regions of proteins (Zhu et al., 2022). Finally, an interesting case from material sciences was the modelling of the PTEG-1 molecule, an addition of polar triethylene glycol (TEG) onto a fulleropyrrolidine molecule (see central panel, Fig. 3-D). This molecule was synthesized to improve semiconductors (Jahani et al., 2014). We found several models related to this peculiar molecule and its derivatives, both atomistic (Qiu et al., 2017; Sami et al., 2022) and coarse grained (Alessandri et al., 2020). With a good indexing of data and appropriate metadata to identify modelled molecules, a simple search, which was previously to this study missing, could easily retrieve different models of the same molecule to compare them or to run multi-scale dynamics simulations. Beyond .gro ﬁles, we would like to analyze the ensemble of the ∼ 12,000 .pdb extracted in this study (see Fig. 2-B) to better characterize the types of molecular structures deposited.

Another important category of deposited ﬁles are those containing information about the topology of the simulated molecules, including ﬁle extensions such as .itp and .top. Further, they are often the results of long parametrization processes (Wang et al., 2004; Vanommeslaeghe and MacKerell, 2012; Souza et al., 2021) and therefore of signiﬁcant value for reusability . Based on our analysis, we indexed almost 20,000 topology ﬁles which could spare countless efforts to the MD community if these ﬁles could be easily found, annotated and reused. Interestingly, the number of .itp ﬁles was elevated (13,058 ﬁles) with a total size of 2 GB, while there were less .top ﬁles (7,009 ﬁles) with a total size of 17 GB. Thus, .itp ﬁles seemed to contain much less information than the .top ﬁles. Among the remaining ﬁle types, .tpr ﬁles contain all the information to potentially directly run a simulation. Here, we found 4,987 .tpr ﬁles, meaning that it could virtually be possible to rerun almost 5,000 simulations without the burden of setting up the system to simulate. Finally, the 3,730 .log ﬁles are also a source of useful information as it is relatively easy to parse this text ﬁle to extract details on how MD simulations were run, such as the version of Gromacs, which command line was used to run the simulation, etc.

Our next step was to gain insight into the parameter settings employed by the MD community, which may aid us in identifying preferences in MD setups and potential necessity for further education to avoid suboptimal or outdated conﬁgurations. We therefore analyzed 10,055 .mdp ﬁles stored in the different data repositories. These text ﬁles contain information regarding the input parameters to run the simulations such as the integrator, the number of steps, the different algorithms for barostat and thermostat, etc. (for more details see: https://manual.gromacs.org/documentation/current/user-guide/mdp-options.html).

We determined the expected simulation time corresponding to the product of two parameters found in .mdp ﬁles: the number of steps and the time step. Here, we acknowledge that one can set up a very long simulation time and stop the simulation before the end or, on contrary, use a limited time (especially when calculations are performed on HPC resources with wall-time) and then extend the simulation for a longer duration. Using only the .mdp ﬁle, we cannot know if the simulation reached its term. To do so, comparison with an .xtc ﬁle from the same dataset may help to answer this speciﬁc question. However, in this study, we were interested in MD setup practices, in particular what simulation time researchers would set up their system with - likely in the mindset to reach that ending time. We restricted this analysis to the 4,623 .mdp ﬁles that used the md or sd integrator, and that have a simulation time above 1 ns. We found that the majority of the .mdp ﬁles were used for simulations of 50 ns or less (see Fig. 4-A). Further, 697 .mdp ﬁles with simulations times set-up between 50 ns and 1 µs and 585 .mdp ﬁles with simulation time above 1 µs were identiﬁed. As analyzing .gro ﬁles showed a good proportion of coarse-grained models (Fig. 3-B,C), we discriminated simulations setups for these two types of models using the time step as a simple cutoff. We considered that a time step greater than 10 fs (i.e. dt=0.01) corresponded to MD setups for coarse grained models (Ingólfsson et al., 2014). Globally, we found that over all simulations, the setups for atomistic simulations were largely dominant. However, for simulations with a simulation time above 1 µs speciﬁcally, coarse-grain simulations represented 86 % of all.

Content analysis of .mdp ﬁles. (A) Cumulative distribution of .mdp ﬁles versus the simulation time for all-atom and coarse-grain simulations. (B) Sankey graph of the repartition between different values for thermostat and barostat. (C) Temperature distribution, full scale in upper panel and zoom-in in lower panel.

We then looked into the combinations of thermostat and barostat (see Fig. 4-B) from 9,199 .mdp ﬁles. The main thermostat used is by far the V-rescale (Bussi et al., 2007) often associated with the Parrinello-Rahman barostat (Parrinello and Rahman, 1981). This thermostat was also used with the Berendsen barostat (Berendsen et al., 1984). In a few cases, we observed the use of the V- rescale thermostat with the very recently developed C-rescale barostat (Bernetti and Bussi, 2020). A total of 2,021 .mdp ﬁles presented neither thermostat nor barostat, which means they would not be used in production runs. This could correspond to setups used for energy minimization, or to add ions to the system (with the genion command), or for molecular mechanics with Poisson–Boltzmann and surface area solvation (MM/PBSA) and molecular mechanics with generalised Born and surface area solvation (MM/GBSA) calculations (Genheden and Ryde, 2015).

Finally, we analyzed the range of starting temperatures used to perform simulations (see Fig. 4- C). We found a clear peak around the temperatures 298 K - 310 K which corresponds to the range between ambient room (298 K - 25 °C) and physiological (310 K - 37 °C) temperatures. Nevertheless, we also observed lower temperatures, which often relate to studies of speciﬁc organic systems or simulations of Lennard-Jones models (Jeon et al., 2016). Interestingly, we noticed the appearance of several pikes at 400 K, 600 K, and 800 K, which were not present before the end of the year 2022. These peaks corresponded to the same study related to the stability of hydrated crystals (Dybeck et al., 2023). Overall, this analysis revealed that a wide range of temperatures have been explored, starting mostly from 100 K and going up to 800 K.

To encourage further analysis of the collected ﬁles, we shared our data collection with the community in Zenodo (see Data and code availability section). The data scrapping procedure and data analysis is available on GitHub with a detailed documentation. To let researchers having a quick glance and explore this data collection, we created a prototype web application called MDverse data explorer available at https://mdverse.streamlit.app/ and illustrated in Fig. 5-A. With this web application, it is easy to use keywords and ﬁlters to access interesting datasets for all MD engines, as well as .gro and .mdp ﬁles. Furthermore, when available, a description of the found data is provided and searchable for keywords (Fig. 5-A, on the left sidebar). The sets of data found can then be exported as a tab-separated values (.tsv) ﬁle for further analysis (Fig. 5-B).

Snapshots of the MDverse data explorer, a prototype search engine to explore collected ﬁles and datasets. (A) General view of the web application. (B) Focus on the .mdp and .gro ﬁles sets of data exported as .tsv ﬁles. The web application also includes links to their original repository.

Towards a better sharing of MD data

With this work, we have shown that it was possible to not only retrieve MD data from the generalist data repositories Zenodo, Figshare and OSF, but to shed light onto the dark matter of MD data in terms of learning current scientiﬁc practice, extracting valuable topology information, and analysing how the ﬁeld is developing. Our objective was not to assess the quality of the data but only to show what kind of data was available. The Ex² strategy to ﬁnd ﬁles related to MD simulations relied on the fact that many MD software output ﬁles with speciﬁc ﬁle extensions. This strategy could not be applied in research ﬁelds where data exhibits non-speciﬁc ﬁle types. We experienced this limitation while indexing zip archives related to MD simulations, where we were able to decide if a zip archive was pertinent for this work only by accessing the list of ﬁles contained in the archive. This valuable feature is provided by data repositories like Zenodo and Figshare, with some caveats, though.

As of March 2023, we managed to index 245,756 ﬁles from 1,979 datasets, representing altogether 14 TB of data. This is a fraction of all ﬁles stored in data repositories. For instance, as of December 2022, Zenodo hosted about 9.9 million ﬁles for ∼1.3 PB of data (Panero and Benito, 2022). All these ﬁles are stored on servers available 24/7. This high availability costs human resources, IT infrastructures and energy. Even if MD data represents only 1 % of the total volume of data stored in Zenodo, we believe it is our responsibility, as a community, to develop a better sharing and reuse of MD simulation ﬁles - and it will neither have to be particularly cumbersome nor expensive. To this end, we are proposing two solutions. First, improve practices for sharing and depositing MD data in data repositories. Second, improve the FAIRness of already available MD data notably by improving the quality of the current metadata.

Guidelines for better sharing of MD simulation data

Without a community-approved methodology for depositing MD simulation ﬁles in data repositories, and based on the current experience we described here, we propose a few simple guidelines when sharing MD data to make them more FAIR (Findable, Accessible, Interoperable and Reusable):

Avoid zip or tar archives whose content cannot be properly indexed by data repositories. As much as possible, deposit original data ﬁles directly.
Describe the MD dataset with extensive metadata. Provide adequate information along your dataset, such as:
- - The scope of the study, e.g. investigate conformation dynamics, benchmark force ﬁeld, …
- - The method on a basic (e.g. quantum mechanics, all-atom, coarse-grain) or advanced (accelerated, metadynamics, well-tempered) level.
- - The MD software: name, version (tag) and whether modiﬁcations have been made.
- - The simulation settings (for each of the steps, including minimization, equilibration and production): temperature(s), thermostat, barostat, time step, total runtime (simulation length), force ﬁeld, additional force ﬁeld parameters.
- - The composition of the system, with the precise names of the molecules and their num-bers, if possible also PDB, UniProt or Ensemble identiﬁers and whether the default structure has been modiﬁed.
- - Give information about any post-processing of the uploaded ﬁles (e.g. truncation or stripping of the trajectory), including before and after values of what has been modiﬁed e.g. number of frames or number of atoms of uploaded ﬁles
- - Highlight especially valuable data, e.g. excessively QM-based parameterized molecules, and their parameter ﬁles.

Store this metadata in the description of the dataset. An adaptation of the Minimum Information About a Simulation Experiment (MIASE) guidelines (Waltemath et al., 2011) in the context of MD simulations would be useful to deﬁne required metadata.

Link the MD dataset to other associated resources, such as:
- - The research article (if any) for which these data have been produced. Datasets are usually mentioned in the research articles, but rarely the other way around, since the deposition has to be done prior to publication. However, it is eminently possible to submit a revised version, and providing a link to the related research paper in updated metadata of the MD dataset will ease the reference to the original publication upon data reuse.
- - The code used to analyze the data, ideally deposited in the repository to guarantee availability, or in a GitHub or GitLab repository.
- - Any other datasets that belong to the same study.
Provide sufficient ﬁles to reproduce simulations and use a clear naming convention to make explicit links between related ﬁles. For instance, for the Gromacs MD engine, trajectory .xtc ﬁles could share the same names as structure .gro ﬁles (e.g. proteinA.gro & proteinA.xtc).
Revisit your data deposition after paper acceptance and update information if necessary. Zenodo and Figshare provide a DOI for every new version of a dataset as well as a ‘master’ DOI that always refers to the latest version available.

These guidelines are complementary to the reliability and reproducibility checklist for molecular dynamics simulations (Commun Biol, 2023). Eventually, they could be implemented in machine actionable Data Management Plan (maDMP) (Miksa et al., 2019). So far, MD metadata is formalized as free text. We advocate for the creation of a standardized and controlled vocabulary to describe artifacts and properties of MD simulations. Normalized metadata will, in turn, enable scientiﬁc knowledge graphs (Auer, 2018; Färber and Lamprecht, 2021) that could link MD data, research articles and MD software in a rich network of research outputs.

Converging on a set of metadata and format requires a large consensus of different stakeholders, from users, to MD program developers, and journal editors. It would be especially useful to organize speciﬁc workshops with representatives of all these communities to collectively tackle this speciﬁc issue.

Improving metadata of current MD data

While indexing about 2,000 MD datasets, we found that title and description accompanying these datasets were very heterogeneous in terms of quality and quantity and were difficult for machines to process automatically. It was sometimes impossible to ﬁnd even basic information such as the identity of the molecular system simulated, the temperature or the length of the simulation. Without appropriate metadata, sharing data is pointless, and its reuse is doomed to fail (Musen, 2022). It is thus important to close the gap between the availability of MD data and its discoverability and description through appropriate metadata. We could gradually improve the metadata by following two strategies. First, since MD engines produce normalized and well-documented ﬁles, we could extract parameters of the simulation by parsing speciﬁc ﬁles. We already explored this path with Gromacs, by extracting the molecular size and composition from .gro ﬁles and the simulation time (with some limitations), thermostat and barostat from .mdp ﬁles. We could go even further, by extracting for instance Gromacs version from .log ﬁle (if provided) or by identifying the simulated system from its atomic topology stored in .gro ﬁles. This strategy can in principle be applied to ﬁles produced by other MD engines. A second approach that we are currently exploring uses data mining and named entity recognition (NER) methods (Perera et al., 2020) to automatically identify the molecular system, the temperature, and the simulation length from existing textual metadata (dataset title and description), providing they are of sufficient length. Finally, the possibilities afforded by large language models supplemented by domain-speciﬁc tools (Bran et al., 2023) might help interpret the heterogenous metadata that is often associated with the simulations.

Future works

In the future, it is desirable to go further in terms of analysis and integrate other data repositories, such as Dryad and Dataverse instances (for example Recherche Data Gouv in France). The collab- orative platform for source code GitHub could also be of interest. Albeit dedicated to source code and not designed to host large-size binary ﬁles, GitHub handles small to medium-size text ﬁles like tabular .csv and .tsv data ﬁles and has been extensively used to record cases of the Ebola epidemic in 2014 (Perkel, 2016) and the Covid-19 pandemic (https://github.com/CSSEGISandData/COVID-19). Thus, GitHub could probably host small text-based MD simulation ﬁles. For Gromacs, we already found 70,000 parameter .mdp ﬁles and 55,000 structure .gro ﬁles. Scripts found along these ﬁles could also provide valuable insights to understand how a given MD analysis was performed. Finally, GitHub repositories might also be an entry point to ﬁnd other datasets by linking to simulation data, such as institutional repositories (see for instance (Pesce and Lindorff-Larsen, 2023)). However, one potential point of concern is that repositories like GitHub or GitLab do not make any promises about long-term availability of repositories, in particular ones not under active development. Archiving of these repositories could be achieved in Zenodo (for data-centric repositories) or Software Heritage (Di Cosmo and Zacchiroli, 2017) (for source-code-centric repositories).

An obvious next step is the enrichment of metadata with the hope to render open MD data more ﬁndable, accessible and ultimately reusable. Possible strategies have already been detailed previously in this paper. We could also go further by connecting MD data in the research ecosystem. For this, two apparent resources need to be linked to MD datasets: their associated research papers to mine more information and to establish a connection with the scientiﬁc context, and their simulated biomolecular systems, which ultimately could cross-reference MD datasets to reference databases such as UniProt (Consortium, 2022), the PDB (Berman et al., 2000) or Lipid Maps (Sud et al., 2007). For already deposited datasets, the enrichment of metadata can only be achieved via systematic computational approaches, while for future depositions, a clear and uniformly used ontology and dedicated metadata reference ﬁle (as it is used by the PLUMED-NEST: Bonomi et al. (2019)) would facilitate this task.

Eventually, front-end solutions such as the MDverse data explorer tool can evolve to being more user-friendly by interfacing the structures and dynamics with interactive 3D molecular viewers (Tiemann et al., 2017; Kampfrath et al., 2022; Martinez and Baaden, 2021).

Conclusion

In this work, we showed that sharing data generated from MD simulations is now a common practice. From Zenodo, Figshare and OSF alone, we indexed about 250,000 ﬁles from 2,000 datasets, and we showed that this trend is increasing. This data brings incentive and opportunities at different levels. First, for researchers who cannot access high-performance computing (HPC) facilities, or do not want to rerun a costly simulation to save time and energy, simulations of many systems are already available. These simulations could be useful to reanalyze existing trajectories, to extend simulations with already equilibrated systems or to compare simulations of a dedicated molecular system modelled with different settings. Second, building annotated and highly curated datasets for artiﬁcial intelligence will be invaluable to develop dynamic generative deep-learning models. Then, improving metadata along available data will foster their reuse and will mechanically increase the reproducibility of MD simulations. At last, we see here the occasion to push for good practices in the setup and production of MD simulations.

Methods and Materials

Initial data collection

We searched for MD-related ﬁles in the data repositories Zenodo, Figshare and Open Science Framework (OSF). Queries were designed with a combination of ﬁle types and optionally keywords, depending on how a given ﬁle type was solely associated to MD simulations. We therefore built a list of manually curated and cross-checked ﬁle types and keywords (https://github.com/MDverse/mdws/blob/main/params/query.yml). All queries were automated by Python scripts that utilized Application Programming Interfaces (APIs) provided by data repositories. Since APIs offered by data repositories were different, all implementations were performed in dedicated Python(van Rossum, 1995) (version 3.9.16) scripts with the NumPy(Oliphant, 2007) (version 1.24.2), Pandas(Wes McKin-ney, 2010) (version 1.5.3) and Requests (version 2.28.2) libraries.

We made the assumption that ﬁles deposited by researchers in data repositories were coherent and all related to a same research project. Therefore, when an MD-related ﬁle was found in a dataset, all ﬁles belonging to this dataset were indexed, regardless of whether their ﬁle types were actually identiﬁed as MD simulation ﬁles. This is the core of the Explore and Expand strategy (Ex²) we applied in this work and illustrated in Fig 1. By default, the last version of the datasets was collected.

When a zip ﬁle was found in a dataset, its content was extracted from a preview provided by Zenodo and Figshare. This preview was not provided through APIs, but as HTML code, which we parsed using the Beautiful Soup library (version 4.11.2). Note that the zip ﬁle preview for Zenodo was limited to the ﬁrst 1,000 ﬁles. To avoid false-positive ﬁles collected from zip archives, a ﬁnal cleaning step was performed to remove all datasets that did not share at least one ﬁle type with the ﬁle type list mentioned above. In the case of OSF, there was no preview for zip ﬁles, so their content has not been retrieved.

Gromacs ﬁles

After the initial data collection, Gromacs .mdp and .gro ﬁles were downloaded with the Pooch library (version 1.6.0). When a .mdp or .gro ﬁle was found to be in a zip archive, the latter was downloaded and the targeted .mdp or .gro ﬁle was selectively extracted from the archive. The same procedure was applied for a subset of .xtc ﬁles that consisted of about one .xtc ﬁle per Gromacs datasets.

Once downloaded, .mdp ﬁles were parsed to extract the following parameters: integrator, time step, number of steps, temperature, thermostat, and barostat. Values for thermostat and barostat were normalized according to values provided by the Gromacs documentation. For the simulation time analysis, we selected .mdp ﬁles with the md or sd integrator and with simulation time above 1 ns to exclude most minimization and equilibrating simulations. For the thermostat and barostat analysis, only ﬁles with non-missing values and with values listed in the Gromacs documentation were considered.

The .gro ﬁles were parsed with the MDAnalysis library (Michaud-Agrawal et al., 2011) to extract the number of particles of the system. Values found in the residue name column were also extracted and compared to a list of residues we manually associated to the following categories: protein, lipid, nucleic acid, glucid and water or ions (https://github.com/MDverse/mdws/blob/main/params/residue_names.yml).

The .xtc ﬁles were analyzed using the gmxcheck command (https://manual.gromacs.org/current/onlinehelp/gmx-check.html) to extract the number of particles and the number of frames.

MDverse data explorer web app

The MDverse data explorer web application was built in Python with the Streamlit library. Data was downloaded from Zenodo (see the Data and code availability section).

System visualization and molecular graphics

Molecular graphics were performed with VMD (Humphrey et al., 1996) and Chimera (Pettersen et al., 2004). For all visualizations, .gro ﬁles containing molecular structure were used. In the case of the two structures in Fig. 3-B, .xtc ﬁles were manually assigned to their corresponding .gro (for the TonB protein) or .tpr (for the T4 Lysozyme) ﬁles based on their names in their datasets.

Origin of the structures displayed in this work:

TonB

Dataset URL: https://zenodo.org/record/3756664

Publication (DOI): https://doi.org/10.1039/D0CP03473H

T4 Lyzozyme

Dataset URL: https://zenodo.org/record/3989044

Publication (DOI): https://doi.org/10.1021/acs.jctc.0c01338

Benzene

Dataset URL: https://figshare.com/articles/dataset/Capturing_Protein_Ligand_Recognition_Pathways_in_Coarse-Grained_Simulation/12517490/1

Publication (DOI): https://doi.org/10.1021/acs.jpclett.0c01683

Ammonia

Dataset URL: https://figshare.com/articles/dataset/Alchemical_Hydration_Free-Energy_Calculations_Using_Molecular_Dynamics_with_Explicit_Polarization_and_Induced_Polarity_Decoupling_An_On_the_Fly_Polarization_Approach/11702442

Publication (DOI): https://doi.org/10.1021/acs.jctc.9b01139

Peptide with membrane

Dataset URL: https://zenodo.org/record/4371296

Publication (DOI): https://doi.org/10.1021/acs.jcim.0c01312

Kir channels

Dataset URL: https://zenodo.org/record/3634884

Publication (DOI): https://doi.org/10.1073/pnas.1918387117

Gasdermin

Dataset URL: https://zenodo.org/record/6797842

Publication (DOI): https://doi.org/10.7554/eLife.81432

Protein-RNA

Dataset URL: https://zenodo.org/record/1308045

Publication (DOI): https://doi.org/10.1371/journal.pcbi.1006642

G-quadruplex

Dataset URL: https://zenodo.org/record/5594466

Publication (DOI): https://doi.org/10.1021/jacs.1c11248

Ptb

Dataset URL: https://osf.io/4aghb/

Publication (DOI): https://doi.org/10.1073/pnas.2116543119

EPI-7170

Dataset URL: https://zenodo.org/record/7120845

Publication (DOI): https://doi.org/10.1038/s41467-022-34077-z

Gold nanoparticle

Dataset URL: Dataset URL: https://acs.figshare.com/articles/dataset/Fluorescence_Probing_of_Thiol_Functionalized_Gold_Nanoparticles_Is_Alkylthiol_Coating_of_a_Nanoparticle_as_Hydrophobic_as_Expected_/2481241Publication (DOI): https://doi.org/10.1021/jp3060813

Gd(DOTA)

Dataset URL: https://acs.figshare.com/articles/dataset/Modeling_Gd_sup_3_sup_Complexes_for_Molecular_Dynamics_Simulations_Toward_a_Rational_Optimization_of_MRI_Contrast_Agents/20334621 Publication (DOI): https://doi.org/10.1021/acs.inorgchem.2c01597.

Metalo cage

Dataset URL: https://acs.figshare.com/articles/dataset/Rationalizing_the_Activity_of_an_Artificial_Diels-Alderase_Establishing_Efficient_and_Accurate_Protocols_for_Calculating_Supramolecular_Catalysis/11569452

Publication (DOI): https://doi.org/10.1021/jacs.9b10302

AL1

Dataset URL: https://acs.figshare.com/articles/dataset/Nucleation_Mechanisms_of_Self-Assembled_Physisorbed_Monolayers_on_Graphite/8846045

Publication (DOI): https://doi.org/10.1021/acs.jpcc.9b01234

PTEG-1 (all-atom)

Dataset URL: https://figshare.com/articles/dataset/PTEG-1_PP_and_N-DMBI_atomistic_force_fields/5458144

Publication (DOI): https://doi.org/10.1039/C7TA06609K

PTEG-1 (coarse-grain)

Dataset URL: https://figshare.com/articles/dataset/Neat_and_P3HT-Based_Blend_Morphologies_for_PCBM_and_PTEG-1/12338633

Publication (DOI): https://doi.org/10.1002/adfm.202004799

Theophylline

Dataset URL: https://figshare.com/articles/dataset/A_Comparison_of_Methods_for_Computing_Relative_Anhydrous_Hydrate_Stability_with_Molecular_Simulation/21644393

Publication (DOI): https://doi.org/10.1021/acs.cgd.2c00832

Data and code availability

Data ﬁles produced from the data collection and processing are shared in Parquet format in the Zenodo repository: https://zenodo.org/record/7856806. They are freely available under the Creative Commons Attribution 4.0 International license (CC-BY).

Python scripts to search and index MD ﬁles, and to download and parse .mdp and .gro ﬁles are open-source (under the AGPL-3.0 license), freely available on GitHub (https://github.com/MDverse/mdws) and archived in Software Heritage (swh:1:dir:4d30b00345a732dcf9f79d3c8bfae38b35b8f2c4). A detailed documentation is provided along the scripts to easily reproduce the data collection and processing.

Jupyter notebooks used to analyze results and create the ﬁgures of this paper are open-source (under the BSD 3-Clause license), freely available on GitHub (https://github.com/MDverse/mdda) and archived in Software Heritage (swh:1:dir:1f8497f72134cef0a9724c955bb03c751f52cccd).

The code of the MDverse data explorer web application is open-source (under the BSD 3-Clause license), freely available on GitHub (https://github.com/MDverse/mdde) and archived in Software Heritage (swh:1:dir:1fc8b8eaabf4a9087e6d5b0ec5ed97031482bcbf).

Acknowledgements

We thank Lauri Mesilaakso, Bryan White, Jorge Hernansanz Biel, Zihwei Li and Kirill Baranov for their participation in the Copenhagen BioHackathon 2020, whose results showcased the need for a more advanced search strategy. We acknowledge Massimiliano Bonomi, Giovanni Bussi, Patrick Fuchs and Elise Lehoux for helpful discussions and suggestions. We also thank the Zenodo, Figshare and OSF support teams for providing ﬁgures on the content of their respective data repository and for their help in using APIs.

This work was supported by Institut français du Danemark (Blåtand program, 2021), the Data Intelligence Institute of Paris (diiP, IdEx Université Paris Cité, ANR-18-IDEX-0001, 2023). JKST and KL-L acknowledge funding by the Novo Nordisk Foundation [NNF18OC0033950 to KL-L], and workshops funded by The BioExcel Center-of-Excellence (grant agreements 823830, 101093290).

Author contributions

The original idea was conceived by EL together with JKST, MC, RH and LD. JKST, MC and PP supervised the project. JKST, PP, MC and SG conceived the search strategy. PP, JKST and LB implemented the search strategy. PP performed the analysis and interpreted the results with MS, JKST, MC and KL-L. PP and MO generated the MDverse web interface. JKST, PP and MC discussed all designs and results. MC and PP designed the ﬁgures. JKST, MB, MC and PP wrote the manuscript with input from all authors.

References

1. Abraham M
2. Apostolov R
3. Barnoud J
4. Bauer P
5. Blau C
6. Bonvin AMJJ
7. Chavent M
8. Chodera J
9. Čondić Jurkić K
10. Delemotte L
11. Grubmüller H
12. Howard RJ
13. Jordan EJ
14. Lindahl E
15. Ollila OHS
16. Selent J
17. Smith DGA
18. Stansfeld PJ
19. Tiemann JKS
20. Trellet M
21. et al.
2019Sharing Data from Molecular SimulationsJournal of Chemical Information and Modeling 59:4093–4099https://doi.org/10.1021/acs.jcim.9b00665 Google Scholar
1. Abraham MJ
2. Murtola T
3. Schulz R
4. Páll S
5. Smith JC
6. Hess B
7. Lindahl E.
2015GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputersSoftwareX 1-2:19–25https://doi.org/10.1016/j.softx.2015.06.001 Google Scholar
1. Abriata LA
2. Lepore R
3. Dal Peraro M.
2020About the need to make computational models of biological macromolecules available and discoverableBioinformatics 36:2952–2954https://doi.org/10.1093/bioinformatics/btaa086 Google Scholar
1. Aldeghi M
2. Heifetz A
3. Bodkin MJ
4. Knapp S
5. Biggin PC
2015Accurate calculation of the absolute free energy of binding for drug moleculesChemical Science 7:207–218https://doi.org/10.1039/c5sc02678d Google Scholar
1. Alessandri R
2. Grünewald F
3. Marrink SJ
2021The Martini Model in Materials ScienceAdvanced Materials 33:2008635https://doi.org/10.1002/adma.202008635 Google Scholar
1. Alessandri R
2. Sami S
3. Barnoud J
4. Vries AH
5. Marrink SJ
6. Havenith RWA
2020Resolving Donor–Acceptor Interfaces and Charge Carrier Energy Levels of Organic Semiconductors with Polar Side ChainsAdvanced Functional Materials 30:2004799https://doi.org/10.1002/adfm.202004799 Google Scholar
1. Amaro RE
2. Mulholland AJ
2020A Community Letter Regarding Sharing Biomolecular Simulation Data for COVID-19In: Journal of Chemical Information and Modeling American Chemical Society pp. 2653–2656https://doi.org/10.1021/acs.jcim.0c00319 Google Scholar
1. Antila HS
2. M Ferreira T
3. Ollila OHS
4. Miettinen MS
2021Using Open Data to Rapidly Benchmark Biomolecular Simulations: Phospholipid Conformational DynamicsIn: Journal of Chemical Information and Modeling American Chemical Society pp. 938–949https://doi.org/10.1021/acs.jcim.0c01299 Google Scholar
1. Armstrong DR
2. Berrisford JM
3. Conroy MJ
4. Gutmanas A
5. Anyango S
6. Choudhary P
7. Clark AR
8. Dana JM
9. Deshpande M
10. Dunlop R
11. Gane P
12. Gáborová R
13. Gupta D
14. Haslam P
15. Koča J
16. Mak L
17. Mir S
18. Mukhopadhyay A
19. Nadzirin N
20. Nair S
21. et al.
2019PDBe: improved ﬁndability of macromolecular structure data in the PDBNucleic Acids Research 48:D335–D343https://doi.org/10.1093/nar/gkz990 Google Scholar
1. Auer S
2018Towards an Open Research Knowledge GraphZenodo https://doi.org/10.5281/zenodo.1157185
1. Berendsen HJC
2. Postma JPM
3. Gunsteren WFV
4. DiNola A
5. Haak JR
1984Molecular dynamics with coupling to an external bathThe Journal of Chemical Physics 81:3684–3690https://doi.org/10.1063/1.448118 Google Scholar
1. Berendsen HJC
2. van der Spoel D
3. van Drunen R.
1995GROMACS: A Message-Passing Parallel Molecular Dynamics ImplementationComputer Physics Communications 91:43–56https://doi.org/10.1016/0010-4655(95)00042-E Google Scholar
1. Berman H
2. Henrick K
3. Nakamura H.
2003Announcing the worldwide Protein Data BankNature structural biology 10:980https://doi.org/10.1038/nsb1203-980 Google Scholar
1. Berman HM
2. Westbrook J
3. Feng Z
4. Gilliland G
5. Bhat TN
6. Weissig H
7. Shindyalov IN
8. Bourne PE
2000The Protein Data BankNucleic Acids Research 28:235–242https://doi.org/10.1093/nar/28.1.235 Google Scholar
1. Bernetti M
2. Bussi G.
2020Pressure control using stochastic cell rescalingThe Journal of Chemical Physics 153:114107https://doi.org/10.1063/5.0020514 Google Scholar
1. Bonomi M
2. Bussi G
3. Camilloni C
4. Tribello GA
5. Banáš P
6. Barducci A
7. Bernetti M
8. Bolhuis PG
9. Bottaro S
10. Branduardi D
11. Capelli R
12. Carloni P
13. Ceriotti M
14. Cesari A
15. Chen H
16. Chen W
17. Colizzi F
18. De S
19. De La Pierre M
20. Donadio D
21. et al.
2019Promoting transparency and reproducibility in enhanced molecular simulationsNature Methods 16:670–673https://doi.org/10.1038/s41592-019-0506-8 Google Scholar
1. Bottaro S
2. Lindorff-Larsen K.
2018Biophysical experiments and biomolecular simulations: A perfect match?Science 361:355–360https://doi.org/10.1126/science.aat4010 Google Scholar
1. Bowers KJ
2. Chow DE
3. Xu H
4. Dror RO
5. Eastwood MP
6. Gregersen BA
7. Klepeis JL
8. Kolossvary I
9. Moraes MA
10. Sacerdoti FD
11. Salmon JK
12. Shan Y
13. Shaw DE
2006Scalable Algorithms for Molecular Dynamics Simulations on Commodity ClustersACM/IEEE SC 2006 Conference (SC’06) :43–43https://doi.org/10.1109/sc.2006.54 Google Scholar
1. Bran AM
2. Cox S
3. White AD
4. Schwaller P
2023ChemCrow: Augmenting large-language models with chemistry toolsGoogle Scholar
1. Brooks BR
2. Brooks CL
3. Mackerell AD
4. Nilsson L
5. Petrella RJ
6. Roux B
7. Won Y
8. Archontis G
9. Bartels C
10. Boresch S
11. Caflisch A
12. Caves L
13. Cui Q
14. Dinner AR
15. Feig M
16. Fischer S
17. Gao J
18. Hodoscek M
19. Im W
20. Kuczera K
21. et al.
2009CHARMM: the biomolecular simulation programJournal of computational chemistry 30:1545–1614https://doi.org/10.1002/jcc.21287 Google Scholar
1. Burley SK
2. Kurisu G
3. Markley JL
4. Nakamura H
5. Velankar S
6. Berman HM
7. Sali A
8. Schwede T
9. Trewhella J.
2017PDB-Dev: a Prototype System for Depositing Integrative/Hybrid Structural ModelsStructure (London, England : 1993) 25:1317–1318https://doi.org/10.1016/j.str.2017.08.001 Google Scholar
1. Bussi G
2. Donadio D
3. Parrinello M.
2007Canonical sampling through velocity rescalingThe Journal of Chemical Physics 126:014101https://doi.org/10.1063/1.2408420 Google Scholar
1. Biol Commun
2023Reliability and reproducibility checklist for molecular dynamics simulationsCommunications Biology 6https://doi.org/10.1038/s42003-023-04653-0 Google Scholar
1. consortium TP
2019Promoting transparency and reproducibility in enhanced molecular simulationsNat Methods 16:670–673https://doi.org/10.1038/s41592-019-0506-8 Google Scholar
1. Consortium TU
2022UniProt: the Universal Protein Knowledgebase in 2023Nucleic Acids Research 51:D523–D531https://doi.org/10.1093/nar/gkac1052 Google Scholar
1. Dandekar BR
2. Mondal J.
2020Capturing Protein–Ligand Recognition Pathways in Coarse-Grained SimulationThe Journal of Physical Chemistry Letters 11:5302–5311https://doi.org/10.1021/acs.jpclett.0c01683 Google Scholar
1. Di Cosmo R
2. Zacchiroli S.
2017Software Heritage: Why and How to Preserve Software Source CodeIn: In: Proceedings of the 14th International Conference on Digital Preservation Japan: iPRES Google Scholar
1. Domański J
2. Stansfeld PJ
3. Sansom MSP
4. Beckstein O.
2010Lipidbook: a public repository for force-ﬁeld parameters used in membrane simulationsThe Journal of membrane biology 236:255–258https://doi.org/10.1007/s00232-010-9296-8 Google Scholar
1. Duncan AL
2. Corey RA
3. Sansom MSP
2020Deﬁning how multiple lipid species interact with inward rectiﬁer potassium (Kir2) channelsProc Natl Acad Sci USA 117:7803–7813https://doi.org/10.1073/pnas.1918387117 Google Scholar
1. Dybeck EC
2. Thiel A
3. Schnieders MJ
4. Pickard FC
5. Wood GPF
6. Krzyzaniak JF
7. Hancock BC
2023A Comparison of Methods for Computing Relative Anhydrous–Hydrate Stability with Molecular SimulationCrystal Growth & Design 23:142–167https://doi.org/10.1021/acs.cgd.2c00832 Google Scholar
1. Elofsson A
2. Hess B
3. Lindahl E
4. Onufriev A
5. Spoel DVD
6. Wallqvist A.
2019Ten simple rules on how to create open access and reproducible molecular simulations of biological systemsPLOS Computational Biology 15:e1006649https://doi.org/10.1371/journal.pcbi.1006649 Google Scholar
1. European Organization For Nuclear Research, OpenAIRE, Zenodo
2013CERNhttps://doi.org/10.25495/7GXK-RD71 Google Scholar
1. Fadda E.
2022Molecular simulations of complex carbohydrates and glycoconjugatesCurrent Opinion in Chemical Biology 69:102175https://doi.org/10.1016/j.cbpa.2022.102175 Google Scholar
1. Fan FJ
2. Shi Y.
2022Effects of data quality and quantity on deep learning for protein-ligand binding affinity predictionBioorganic & Medicinal Chemistry 72:117003https://doi.org/10.1016/j.bmc.2022.117003 Google Scholar
1. Fawzi NL
2. Parekh SH
3. Mittal J.
2021Biophysical studies of phase separation integrating experimental and computational methodsCurrent Opinion in Structural Biology 70:78–86https://doi.org/10.1016/j.sbi.2021.04.004 Google Scholar
1. Ferrer RS
2. Case DA
3. Walker RC
2012An overview of the Amber biomolecular simulation packageWiley Interdisciplinary Reviews: Computational Molecular Science 3:198–210https://doi.org/10.1002/wcms.1121 Google Scholar
1. Fuller JC
2. Jackson RM
3. Edwards TA
4. Wilson AJ
5. Shirts MR
2012Modeling of Arylamide Helix Mimetics in the p53 Peptide Binding Site of hDM2 Suggests Parallel and Anti-Parallel Conformations Are Both StablePLOS ONE 7:1–17https://doi.org/10.1371/journal.pone.0043253 Google Scholar
1. Färber M
2. Lamprecht D.
2021The data set knowledge graph: Creating a linked open data source for data setsQuantitative Science Studies 2:1324–1355https://doi.org/10.1162/qss_a_00161 Google Scholar
1. Genheden S
2. Ryde U.
2015The MM/PBSA and MM/GBSA methods to estimate ligand-binding affinitiesExpert Opinion on Drug Discovery 10:449–461https://doi.org/10.1517/17460441.2015.1032936 PubMed Google Scholar
1. Gertsen AS
2. Sørensen MK
3. Andreasen JW
2020Nanostructure of organic semiconductor thin ﬁlms: Molecular dynamics modeling with solvent evaporationPhysical Review Materials 4:075405https://doi.org/10.1103/physrevmaterials.4.075405 Google Scholar
1. Gowers R
2. Linke M
3. Barnoud J
4. Reddy T
5. Melo M
6. Seyler S
7. Domański J
8. Dotson D
9. Buchoux S
10. Kenney I
11. Beckstein O.
2016MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics SimulationsIn: Proceedings of the Python in Science Conference SciPy https://doi.org/10.25080/majora-629e541a-00e
1. Gupta C
2. Sarkar D
3. Tieleman DP
4. Singharoy A.
2022The ugly, bad, and good stories of large-scale biomolecular simulationsCurrent Opinion in Structural Biology 73:102338https://doi.org/10.1016/j.sbi.2022.102338 Google Scholar
1. Hoch JC
2. Baskaran K
3. Burr H
4. Chin J
5. Eghbalnia HR
6. Fujiwara T
7. Gryk MR
8. Iwata T
9. Kojima C
10. Kurisu G
11. et al.
2023Biological Magnetic Resonance Data BankNucleic Acids Research 51:D368–D376Google Scholar
1. Hollingsworth SA
2. Dror RO
2018Molecular Dynamics Simulation for AllNeuron 99:1129–1143https://doi.org/10.1016/j.neuron.2018.08.011 Google Scholar
1. Hospital A
2. Battistini F
3. Soliva R
4. Gelpí JL
5. Orozco M.
2020Surviving the deluge of biosimulation dataWIREs Computational Molecular Science 10:e1449https://doi.org/10.1002/wcms.1449 Google Scholar
1. Humphrey W
2. Dalke A
3. Schulten K.
1996VMD: visual molecular dynamicsJ Mol Graph 14:33–8Google Scholar
1. Hénin J
2. Lelièvre T
3. Shirts MR
4. Valsson O
5. Delemotte L.
2022Enhanced Sampling Methods for Molecular Dynamics Simulations [Article v1.0]Living Journal of Computational Molecular Science 4https://doi.org/10.33011/livecoms.4.1.1583 Google Scholar
1. Ingólfsson HI
2. Lopez CA
3. Uusitalo JJ
4. Jong DHd
5. Gopal SM
6. Periole X
7. Marrink SJ
2014The power of coarse graining in biomolecular simulationsWiley Interdisciplinary Reviews: Computational Molecular Science 4:225–248https://doi.org/10.1002/wcms.1169 Google Scholar
1. Ivanov P
2. Mu J
3. Leay L
4. Chang SY
5. Sharrad CA
6. Masters AJ
7. Schroeder SLM
2017Organic and Third Phase in HNO3/TBP/n-Dodecane System: No Reverse MicellesSolvent Extraction and Ion Exchange 35:251–265https://doi.org/10.1080/07366299.2017.1336048 Google Scholar
1. Jahani F
2. Torabi S
3. Chiechi RC
4. Koster LJA
5. Hummelen JC
2014Fullerene derivatives with increased dielectric constantsChemical Communications 50:10645–10647https://doi.org/10.1039/c4cc04366a Google Scholar
1. Jeon JH
2. Javanainen M
3. Martinez-Seara H
4. Metzler R.
2016Protein Crowding in Lipid Bilayers Gives Rise to Non-Gaussian Anomalous Lateral Diffusion of Phospholipids and ProteinsPhysical Review X https://doi.org/10.1103/physrevx.6.021006 Google Scholar
1. Jumper J
2. Evans R
3. Pritzel A
4. Green T
5. Figurnov M
6. Ronneberger O
7. Tunyasuvunakool K
8. Bates R
9. Žídek A
10. Potapenko A
11. Bridgland A
12. Meyer C
13. Kohl SAA
14. Ballard AJ
15. Cowie A
16. Romera-Paredes B
17. Nikolov S
18. Jain R
19. Adler J
20. Back T
21. et al.
2021Highly accurate protein structure prediction with AlphaFoldIn: Nature Nature Publishing Group pp. 583–589https://doi.org/10.1038/s41586-021-03819-2 Google Scholar
1. Kabelka I
2. Brožek R
3. Vácha R.
2021Selecting Collective Variables and Free-Energy Methods for Peptide Translocation across MembranesJournal of Chemical Information and Modeling 61:819–830https://doi.org/10.1021/acs.jcim.0c01312 Google Scholar
1. Kampfrath M
2. Staritzbichler R
3. Hernández GP
4. Rose AS
5. Tiemann JKS
6. Scheuermann G
7. Wiegreffe D
8. Hildebrand PW
2022MDsrv: visual sharing and analysis of molecular dynamics simulationsNucleic Acids Research 50:W483–W489https://doi.org/10.1093/nar/gkac398 Google Scholar
1. Karunasena C
2. Li S
3. Heifner MC
4. Ryno SM
5. Risko C.
2021Reconsidering the Roles of Noncovalent Intramolecular “Locks” in π-Conjugated MoleculesChemistry of Materials 33:9139–9151https://doi.org/10.1021/acs.chemmater.1c02335 Google Scholar
1. Kelly BD
2. Smith WR
2020Alchemical Hydration Free-Energy Calculations Using Molecular Dynamics with Explicit Polarization and Induced Polarity Decoupling: An On–the–Fly Polarization ApproachJournal of Chemical Theory and Computation 16:1146–1161https://doi.org/10.1021/acs.jctc.9b01139 Google Scholar
1. Kiirikki A
2. Antila H
3. Bort L
4. Buslaev P
5. Fernando F
6. Ferreira TM
7. Fuchs P
8. Garcia-Fandino R
9. Gushchin I
10. Kav B
11. Kula P
12. Kurki M
13. Kuzmin A
14. Madsen J
15. Miettinen M
16. Nencini R
17. Piggot T
18. Pineiro A
19. Samantray S
20. Suarez-Leston F
21. et al.
2023NMRlipids Databank makes data-driven analysis of biomembrane properties accessible for allChemRxiv https://doi.org/10.26434/chemrxiv-2023-jrpwm
1. Kinjo AR
2. Bekker GJ
3. Suzuki H
4. Tsuchiya Y
5. Kawabata T
6. Ikegawa Y
7. Nakamura H.
2017Protein Data Bank Japan (PDBj): updated user interfaces, resource description framework, analysis tools for large structuresNucleic Acids Research 45:D282–D288https://doi.org/10.1093/nar/gkw962 Google Scholar
1. Kirschner KN
2. Yongye AB
3. Tschampel SM
4. González-Outeiriño J
5. Daniels CR
6. Foley BL
7. Woods RJ
GLYCAM06: a generalizable biomolecular force ﬁeldCarbohydrates. Journal of computational chemistry 29:622–655https://doi.org/10.1002/jcc.20820 Google Scholar
1. Krishna S
2. Sreedhar I
3. Patel CM
2021Molecular dynamics simulation of polyamide-based materials – A reviewComputational Materials Science 200:110853https://doi.org/10.1016/j.commatsci.2021.110853 Google Scholar
1. Kyrychenko A
2. Karpushina GV
3. Svechkarev D
4. Kolodezny D
5. Bogatyrenko SI
6. Kryshtal AP
7. Doroshenko AO
2012Fluorescence Probing of Thiol-Functionalized Gold Nanoparticles: Is Alkylthiol Coating of a Nanoparticle as Hydrophobic as Expected?The Journal of Physical Chemistry C 116:21059–21068https://doi.org/10.1021/jp3060813 Google Scholar
1. Kümmerer F
2. Orioli S
3. Harding-Larsen D
4. Hoffmann F
5. Gavrilov Y
6. Teilum K
7. Lindorff-Larsen K.
2021Fitting Side-Chain NMR Relaxation Data Using Molecular SimulationsJournal of Chemical Theory and Computation 17:5262–5275https://doi.org/10.1021/acs.jctc.0c01338 Google Scholar
1. Lane TJ
2023Protein structure prediction has reached the single-structure frontierNature Methods :1–4https://doi.org/10.1038/s41592-022-01760-4 Google Scholar
1. Liu S
2. Cao S
3. Hoang K
4. Young KL
5. Paluch AS
6. Mobley DL
2016Using MD Simulations To Calculate How Solvents Modulate SolubilityJournal of Chemical Theory and Computation 12:1930–1941https://doi.org/10.1021/acs.jctc.5b00934 Google Scholar
1. Mahmud M
2. Kaiser MS
3. McGinnity TM
4. Hussain A.
2021Deep Learning in Mining Biological DataCognitive Computation 13:1–33https://doi.org/10.1007/s12559-020-09773-x Google Scholar
1. Marklund EG
2. Benesch JL
2019Weighing-up protein dynamics: the combination of native mass spectrometry and molecular dynamics simulationsCurrent Opinion in Structural Biology 54:50–58https://doi.org/10.1016/j.sbi.2018.12.011 Google Scholar
1. Martinez X
2. Baaden M.
2021UnityMol prototype for FAIR sharing of molecular-visualization experiences: from pictures in the cloud to collaborative virtual reality exploration in immersive 3D environmentsActa Crystallographica Section D 77:746–754https://doi.org/10.1107/s2059798321002941 Google Scholar
1. Marx V.
2013Biology: The Big Challenges of Big DataNature 498:255–260https://doi.org/10.1038/498255a Google Scholar
1. McKinney Wes
2. Walt Stéfan van der
3. Millman Jarrod
2010Data Structures for Statistical Computing in PythonProceedings of the 9th Python in Science Conference :56–61https://doi.org/10.25080/Majora-92bf1922-00a Google Scholar
1. Merz KMJ
2. Amaro R
3. Cournia Z
4. Rarey M
5. Soares T
6. Tropsha A
7. Wahab HA
8. Wang R.
2020Editorial: Method and Data Sharing and Reproducibility of Scientiﬁc ResultsIn: Journal of Chemical Information and Modeling American Chemical Society pp. 5868–5869https://doi.org/10.1021/acs.jcim.0c01389 Google Scholar
1. Meyer T
2. D’Abramo M
3. Hospital A
4. Rueda M
5. Ferrer-Costa C
6. Pérez A
7. Carrillo O
8. Camps J
9. Fenollosa C
10. Repchevsky D
11. Gelpí JL
12. Orozco M.
2010MoDEL (Molecular Dynamics Extended Library): A Database of Atomistic Molecular Dynamics TrajectoriesIn: Structure Elsevier pp. 1399–1409https://doi.org/10.1016/j.str.2010.07.013 Google Scholar
1. Michaud-Agrawal N
2. Denning EJ
3. Woolf TB
4. Beckstein O.
2011MDAnalysis: A toolkit for the analysis of molecular dynamics simulationsJournal of computational chemistry 32:2319–2327https://doi.org/10.1002/jcc.21787 Google Scholar
1. Miksa T
2. Simms S
3. Mietchen D
4. Jones S.
2019Ten principles for machine-actionable data management plansPLOS Computational Biology 15:1–15https://doi.org/10.1371/journal.pcbi.1006750 Google Scholar
1. Mulholland AJ
2. Amaro RE
2020COVID19 - Computational Chemists Meet the MomentJournal of Chemical Information and Modeling 60:5724–5726https://doi.org/10.1021/acs.jcim.0c01395 Google Scholar
1. Musen MA
2022Without Appropriate Metadata, Data-Sharing Mandates Are PointlessNature 609:222–222https://doi.org/10.1038/d41586-022-02820-7 Google Scholar
1. Newport TD
2. Sansom MSP
3. Stansfeld PJ
2018The MemProtMD database: a resource for membrane-embedded protein structures and their lipid interactionsNucleic Acids Research 47:gky1047.https://doi.org/10.1093/nar/gky1047 Google Scholar
1. Oliphant TE
2007Python for Scientiﬁc ComputingComputing in Science & Engineering 9:10–20https://doi.org/10.1109/MCSE.2007.58 Google Scholar
1. Panero P
2. Benito J
2022OpenAIRE Webinar: Zenodo - open digital repositoryZenodo https://doi.org/10.5281/zenodo.7417839
1. Parrinello M
2. Rahman A.
1981Polymorphic transitions in single crystals: A new molecular dynamics methodJournal of Applied Physics 52:7182https://doi.org/10.1063/1.328693 Google Scholar
1. Perera N
2. Dehmer M
3. Emmert-Streib F.
2020Named Entity Recognition and Relation Detection for Biomedical Information ExtractionFrontiers in Cell and Developmental Biology 8:673https://doi.org/10.3389/fcell.2020.00673 Google Scholar
1. Perilla JR
2. Goh BC
3. Cassidy CK
4. Liu B
5. Bernardi RC
6. Rudack T
7. Yu H
8. Wu Z
9. Schulten K.
2015Molecular dynamics simulations of large macromolecular complexesCurrent opinion in structural biology 31:64–74https://doi.org/10.1016/j.sbi.2015.03.007 Google Scholar
1. Perkel J.
2016Democratic Databases: Science on GitHubNature 538:127–128https://doi.org/10.1038/538127a Google Scholar
1. Pesce F
2. Lindorff-Larsen K.
2023Combining experiments and simulations to examine the temperature-dependent behaviour of a disordered proteinbioRxiv https://doi.org/10.1101/2023.03.04.531094
1. Pettersen EF
2. Goddard TD
3. Huang CC
4. Couch GS
5. Greenblatt DM
6. Meng EC
7. Ferrin TE
2004UCSF Chimera–a visualization system for exploratory research and analysisJournal of computational chemistry 25:1605–1612https://doi.org/10.1002/jcc.20084 Google Scholar
1. Phillips JC
2. Hardy DJ
3. Maia JDC
4. Stone JE
5. Ribeiro JV
6. Bernardi RC
7. Buch R
8. Fiorin G
9. Hénin J
10. Jiang W
11. McGreevy R
12. Melo MCR
13. Radak BK
14. Skeel RD
15. Singharoy A
16. Wang Y
17. Roux B
18. Aksimentiev A
19. Luthey-Schulten Z
20. Kalé LV
21. et al.
2020Scalable molecular dynamics on CPU and GPU architectures with NAMDThe Journal of Chemical Physics 153:044130https://doi.org/10.1063/5.0014475 Google Scholar
1. Piskorz TK
2. Gobbo C
3. Marrink SJ
4. Feyter SD
5. AHd Vries
6. JHv Esch
2019Nucleation Mechanisms of Self-Assembled Physisorbed Monolayers on GraphiteThe Journal of Physical Chemistry C 123:17510–17520https://doi.org/10.1021/acs.jpcc.9b01234 Google Scholar
1. Pohjolainen E
2. Chen X
3. Malola S
4. Groenhof G
5. Häkkinen H.
2016A Uniﬁed AMBER-Compatible Molecular Mechanics Force Field for Thiolate-Protected Gold NanoclustersJournal of Chemical Theory and Computation 12:1342–1350https://doi.org/10.1021/acs.jctc.5b01053 Google Scholar
1. Porubsky VL
2. Goldberg AP
3. Rampadarath AK
4. Nickerson DP
5. Karr JR
6. Sauro HM
2020Best Practices for Making Reproducible Biochemical ModelsCell Systems 11:109–120https://doi.org/10.1016/j.cels.2020.06.012 Google Scholar
1. Qiu L
2. Liu J
3. Alessandri R
4. Qiu X
5. Koopmans M
6. Havenith RWA
7. Marrink SJ
8. Chiechi RC
9. Koster LJA
10. Hummelen JC
2017Enhancing doping efficiency by improving host-dopant miscibility for fullerene-based n-type thermoelectricsJournal of Materials Chemistry A 5:21234–21241https://doi.org/10.1039/c7ta06609k Google Scholar
1. Rodríguez-Espigares I
2. Torrens-Fontanals M
3. Tiemann JKS
4. Aranda-García D
5. Ramírez-Anguita JM
6. Stepniewski TM
7. Worp N
8. Varela-Rial A
9. Morales-Pastor A
10. Medel-Lacruz B
11. Pándy-Szekeres G
12. Mayol E
13. Giorgino T
14. Carlsson J
15. Deupi X
16. Filipek S
17. Filizola M
18. Gómez-Tamayo JC
19. Gonzalez A
20. Gutiérrez-de Terán H
21. et al.
2020GPCRmd uncovers the dynamics of the 3D-GPCRomeNature Methods 17:777–787https://doi.org/10.1038/s41592-020-0884-y Google Scholar
1. Sami S
2. Alessandri R
3. Wijaya JBW
4. Grünewald F
5. AHd Vries
6. Marrink SJ
7. Broer R
8. Havenith RWA
2022Strategies for Enhancing the Dielectric Constant of Organic MaterialsThe Journal of Physical Chemistry C 126:19462–19469https://doi.org/10.1021/acs.jpcc.2c05682 Google Scholar
1. Sarkar A
2. Sasmal R
3. Empereur-mot C
4. Bochicchio D
5. Kompella SVK
6. Sharma K
7. Dhiman S
8. Sundaram B
9. Agasti SS
10. Pavan GM
11. George SJ
2020Self-Sorted, Random, and Block Supramolecular Copolymers via Sequence Controlled, Multicomponent Self-AssemblyJournal of the American Chemical Society 142:7606–7617https://doi.org/10.1021/jacs.0c01822 Google Scholar
1. Schaefer SL
2. Hummer G.
2022Sublytic gasdermin-D pores captured in atomistic molecular simulationseLife 11:e81432https://doi.org/10.7554/elife.81432 Google Scholar
1. Souza PCT
2. Alessandri R
3. Barnoud J
4. Thallmair S
5. Faustino I
6. Grünewald F
7. Patmanidis I
8. Abdizadeh H
9. Bruininks BMH
10. Wassenaar TA
11. Kroon PC
12. Melcr J
13. Nieto V
14. Corradi V
15. Khan HM
16. Domański J
17. Javanainen M
18. Martinez-Seara H
19. Reuter N
20. Best RB
21. et al.
2021Martini 3: a general purpose force ﬁeld for coarse-grained molecular dynamicsNature Methods :1–7https://doi.org/10.1038/s41592-021-01098-3 Google Scholar
1. Stansfeld PJ
2. Goose JE
3. Caffrey M
4. Carpenter EP
5. Parker JL
6. Newstead S
7. Sansom MSP
2015MemProtMD: Automated Insertion of Membrane Protein Structures into Explicit Lipid MembranesStructure (London, England : 1993) 23:1350–1361https://doi.org/10.1016/j.str.2015.05.006 Google Scholar
1. Stephens ZD
2. Lee SY
3. Faghri F
4. Campbell RH
5. Zhai C
6. Efron MJ
7. Iyer R
8. Schatz MC
9. Sinha S
10. Robinson GE
2015Big Data: Astronomical or Genomical?PLOS Biology 13:e1002195https://doi.org/10.1371/journal.pbio.1002195 Google Scholar
1. Sud M
2. Fahy E
3. Cotter D
4. Brown A
5. Dennis EA
6. Glass CK
7. Merrill AH
8. Murphy RC
9. Raetz CRH
10. Russell DW
11. Subramaniam S.
2007LMSD: LIPID MAPS structure databaseNucleic Acids Research 35:D527–D532https://doi.org/10.1093/nar/gkl838 Google Scholar
1. Tai K
2. Murdock S
3. Wu B
4. Ng MH
5. Johnston S
6. Fangohr H
7. Cox SJ
8. Jeffreys P
9. Essex JW
10. Sansom MSP
2004BioSimGrid: towards a worldwide repository for biomolecular simulationsIn: Organic & Biomolecular Chemistry The Royal Society of Chemistry pp. 3219–3221https://doi.org/10.1039/B411352G Google Scholar
1. Tiemann JKS
2. Guixà-González R
3. Hildebrand PW
4. Rose AS
2017MDsrv: viewing and sharing molecular dynamics simulations on the webNat Methods 14:1123–1124https://doi.org/10.1038/nmeth.4497 Google Scholar
1. van Rossum G.
1995Python TutorialAmsterdam: Centrum voor Wiskunde en Informatica (CWI Google Scholar
1. Vanommeslaeghe K
2. MacKerell AD
2012Automation of the CHARMM General Force Field (CGenFF) I: Bond Perception and Atom TypingJournal of Chemical Information and Modeling 52:3144–3154https://doi.org/10.1021/ci300363c Google Scholar
1. Virtanen SI
2. Kiirikki AM
3. Mikula KM
4. Iwaï H
5. Ollila OHS
2020Heterogeneous dynamics in partially disordered proteinsPhysical Chemistry Chemical Physics 22:21185–21196https://doi.org/10.1039/d0cp03473h Google Scholar
1. Vuorio J
2. Vattulainen I
3. Martinez-Seara H.
2017Atomistic ﬁngerprint of hyaluronan–CD44 bindingPLoS Computational Biology 13:e1005663https://doi.org/10.1371/journal.pcbi.1005663 Google Scholar
1. Waltemath D
2. Adams R
3. Beard DA
4. Bergmann FT
5. Bhalla US
6. Britten R
7. Chelliah V
8. Cooling MT
9. Cooper J
10. Crampin EJ
11. Garny A
12. Hoops S
13. Hucka M
14. Hunter P
15. Klipp E
16. Laibe C
17. Miller AK
18. Moraru I
19. Nickerson D
20. Nielsen P
21. et al.
2011Minimum Information About a Simulation Experiment (MIASE)PLOS Computational Biology 7:1–4https://doi.org/10.1371/journal.pcbi.1001122 Google Scholar
1. Wang J
2. Wolf RM
3. Caldwell JW
4. Kollman PA
5. Case DA
2004Development and testing of a general amber force ﬁeldJournal of computational chemistry 25:1157–1174https://doi.org/10.1002/jcc.20035 Google Scholar
1. Wilkinson MD
2. Dumontier M
3. Aalbersberg IJJ
4. Appleton G
5. Axton M
6. Baak A
7. Blomberg N
8. Boiten JW
9. Santos LBdS
10. Bourne PE
11. Bouwman J
12. Brookes AJ
13. Clark T
14. Crosas M
15. Dillo I
16. Dumon O
17. Edmunds S
18. Evelo CT
19. Finkers R
20. Gonzalez-Beltran A
21. et al.
2016The FAIR Guiding Principles for scientiﬁc data management and stewardshipScientiﬁc data 3:160018https://doi.org/10.1038/sdata.2016.18 Google Scholar
1. Wilson SL
2. Way GP
3. Bittremieux W
4. Armache JP
5. Haendel MA
6. Hoffman MM
2021Sharing biological data: why, when, and howFEBS Letters 595:847–863https://doi.org/10.1002/1873-3468.14067 Google Scholar
1. Yoo J
2. Winogradoff D
3. Aksimentiev A.
2020Molecular dynamics simulations of DNA-DNA and DNA-protein interactionsCurrent Opinion in Structural Biology 64:88–96https://doi.org/10.1016/j.sbi.2020.06.007 Google Scholar
1. Young TA
2. Martí-Centelles V
3. Wang J
4. Lusby PJ
5. Duarte F.
2020Rationalizing the Activity of an “Artiﬁcial Diels-Alderase”: Establishing Efficient and Accurate Protocols for Calculating Supramolecular CatalysisJournal of the American Chemical Society 142:1300–1310https://doi.org/10.1021/jacs.9b10302 Google Scholar
1. Zheng X
2. Chan MHY
3. Chan AKW
4. Cao S
5. Ng M
6. Sheong FK
7. Li C
8. Goonetilleke EC
9. Lam WWY
10. Lau TC
11. Huang X
12. Yam VWW
2022Elucidation of the key role of Pt…Pt interactions in the directional self-assembly of platinum(II) complexesProceedings of the National Academy of Sciences 119:e2116543119https://doi.org/10.1073/pnas.2116543119 Google Scholar
1. Zhu J
2. Salvatella X
3. Robustelli P.
2022Small molecules targeting the disordered transactivation domain of the androgen receptor induce the formation of collapsed helical statesNature Communications 13:6390https://doi.org/10.1038/s41467-022-34077-z Google Scholar
1. Zhu S.
2019Validation of the Generalized Force Fields GAFF, CGenFF, OPLS-AA, and PRODRGFF by Testing Against Experimental Osmotic Coefficient Data for Small Drug-Like MoleculesJournal of Chemical Information and Modeling 59:4239–4247https://doi.org/10.1021/acs.jcim.9b00552 Google Scholar

Article and author information

Author information

Johanna K. S. Tiemann
Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, DK-2200 Copenhagen N, Denmark
ORCID iD: 0000-0001-7551-6245
- For correspondence:⠀johanna.tiemann@gmail.com (JKST);, Matthieu.Chavent@ipbs.fr (MC);, pierre.poulain@u-paris.fr (PP)
- Novozymes A/S, 2800 Kgs. Lyngby, Denmark
Magdalena Szczuka
Institut de Pharmacologie et Biologie Structurale, CNRS, Université de Toulouse, 205 route de Narbonne, 31400, Toulouse, France
ORCID iD: 0009-0001-8044-540X
Lisa Bouarroudj
Université Paris Cité, CNRS, Institut Jacques Monod, F-75013 Paris, France
ORCID iD: 0009-0001-1177-7873
Mohamed Oussaren
Université Paris Cité, CNRS, Institut Jacques Monod, F-75013 Paris, France
ORCID iD: 0009-0003-8309-2157
Steven Garcia
Independent researcher, Amsterdam, Netherlands
ORCID iD: 0009-0000-1155-930X
Rebecca J. Howard
Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
ORCID iD: 0000-0003-2049-3378
Lucie Delemotte
Department of Applied Physics, Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden
ORCID iD: 0000-0002-0828-3899
Erik Lindahl
Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden, Department of Applied Physics, Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden
ORCID iD: 0000-0002-2734-2794
Marc Baaden
Laboratoire de Biochimie Théorique, CNRS, Université Paris Cité, 13 rue Pierre et Marie Curie, F-75005 Paris, France
ORCID iD: 0000-0001-6472-0486
Kresten Lindorff-Larsen
Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, DK-2200 Copenhagen N, Denmark
ORCID iD: 0000-0002-4750-6039
Matthieu Chavent
Institut de Pharmacologie et Biologie Structurale, CNRS, Université de Toulouse, 205 route de Narbonne, 31400, Toulouse, France
ORCID iD: 0000-0003-4524-4773
- For correspondence:⠀johanna.tiemann@gmail.com (JKST);, Matthieu.Chavent@ipbs.fr (MC);, pierre.poulain@u-paris.fr (PP)
Pierre Poulain
Université Paris Cité, CNRS, Institut Jacques Monod, F-75013 Paris, France
ORCID iD: 0000-0003-4177-3619
- For correspondence:⠀johanna.tiemann@gmail.com (JKST);, Matthieu.Chavent@ipbs.fr (MC);, pierre.poulain@u-paris.fr (PP)

Version history

Preprint posted: May 2, 2023
Sent for peer review: June 28, 2023
Reviewed Preprint version 1: September 20, 2023
Reviewed Preprint version 2: July 1, 2024
Version of Record published: August 30, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.90061. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 3,812
downloads: 222
citations: 14

Views, downloads and citations are aggregated across all versions of this paper published by eLife.