Reproducibility: How replicable is biomedical science in Brazil?

The results of a project to estimate the reproducibility of research in Brazil have just been published.

Efforts to estimate the reproducibility of published research results have focused on specific areas of research to date, such as cancer biology and psychology. The Brazilian Reproducibility Initiative is different in that it was set up to explore the reproducibility of research done in a particular country.

To identify the experiments that would be repeated for the initiative, the project team reviewed a random sample of life sciences articles from Brazil and drew up a list of 10 common experimental models that used either rodents or cell lines as models. Then, following a public call asking for labs in Brazil that could repeat such studies, the team selected three classes of experiment for replication: experiments that used the MTT reduction assay; experiments that used the reverse transcription polymerase chain reaction (RT-PCR); and experiments that used the elevated plus maze (EPM) test of anxiety.

In a preprint posted in April the team reported that replication rates varied between 15 and 45% according to five predefined criteria. In this interview members of the coordinating team for the project – Olavo Amaral, Clarissa Carneiro, Kleber Neves, Bruna Valério and Mariana Abreu – discuss what they found and how the reproducibility of research in Brazil could be improved.

Olavo Amaral (left), Kleber Neves, Mariana Abreu and Bruna Valério of the Brazilian Reproducibility Initiative. Photo credit: Patricia Bado.

How many experiments/articles did you attempt to replicate? And what was the overall cost of the initiative?

We initially attempted to replicate sixty experiments – twenty experiments for each of the three methods – with three labs attempting each replication – giving a total of 180 replications. In the end we managed to obtain data for 143 replications of 56 experiments – with 97 replications of 47 experiments being considered valid. Our total budget was around 1.2 million Brazilian real, which currently amounts to around 208k US dollars, funded by the Serrapilheira Institute, a recently established philanthropic funder from Brazil.

How were the protocols for each replication attempt written, reviewed and revised?

The coordinating team extracted methodological information from each original article and converted it into a protocol draft, typically including a lot of gaps to be filled, which was sent to each lab performing the replication. Labs were supposed to independently fill in these gaps in order to arrive at their best shot at a direct replication – meaning we ended up with three non-identical replication protocols for each experiment. Original authors were contacted for missing information, but their responses were not used in protocol development, because we were aiming for a naturalistic estimate of how often labs could successfully replicate a finding based solely on the information contained in published articles. All the protocols were peer reviewed by other members of the Initiative – but not by the lab that did the original experiment – and pre-registered at the Open Science Framework.

How often did the experimental groups run into difficulties when repeating the experiments?

Methodological difficulties were frequent, and we did our best to separate failures in getting the method to work from failures to replicate the results. This was done by multiple strategies. First, all labs were blind to the original results, as well as to the identity of the group that performed the original experiment. Second, they were asked to set up predefined criteria for methodological validity, although doing this was not easy for most labs. Third, we established a validation committee who judged whether the replication was methodologically valid based on a report on the execution of the experiment and any issues that came up in it, but without access to the results. Still, some difficulties only became apparent when results were in – for example, a lack of PCR amplification in particular experimental units – and these cases were dealt with in a case-by-case version following a set of general guidelines.

How were the data collected?

We built detailed spreadsheets for each experiment based on the developed protocols to allow for an external assessment of how the experiment was performed. This spreadsheet included details on multiple steps of the experiment, from the handling of animals and culturing of cells, to when reagents were delivered, to when they were opened and how they were stored, to who performed each step of the protocol, to the results in raw and processed form. This allowed us to perform an external validation of the replications after they were performed.

What happened next?

After replications were finished, data collection sheets were sent to the coordinating team, who made a document listing protocol deviations and other issues arising in the experiment. This was sent back to the laboratory so that it could review these observations and clarify any standing issues. Each lab was also asked to rate how much it thought the experiment deviated from the registered protocol.

After this step, labs received the original article with the results for the first time. They were asked to answer whether they thought they had successfully replicated the protocol and results, and to provide justifications for their answers.

In parallel, summaries of protocol deviations were sent to a validation committee, consisting of members of the coordinating team and of independent labs in the initiative, who were asked to judge whether the experiment was a valid replication based on these summaries. If any of the reviewers raised concerns about the replication, it was discussed by the committee to decide whether it should be included in the analysis.

Almost a third of replications were invalidated due to various issues, such as protocol deviations, insufficient biological variation between experimental units or incomplete documentation. Labs then received this assessment and were asked about their agreement with it, as well as about the reasons for protocol deviations when they occurred.

Please explain the five criteria used to assess the replication experiments, and summarize the main conclusions.

We used five different predefined criteria to assess replication success, yielding success rates between 15 and 45%: two criteria were based on effect size comparisons, two were based on statistical significance, and one was subjective.

Going through the criteria one by one. First, did the original effect size fall within the 95% prediction interval of a meta-analysis of the replications? We found that 45% of the original effect sizes fell into this interval. This criterion – which takes into account the variability of the replications – was only applicable if we had more than one replication for the experiment.

Second, was the replication estimate within the 95% confidence interval of the original result? Here we found that 26% of the replication estimates fell into this confidence interval. This criterion assessed if the estimate obtained by aggregating the replications was within the “margin of error” for the original experiment.

Third, for what fraction of experiments was the replication estimate statistically significant at p<0.05? This widely used statistical criterion evaluates whether the aggregate of the available replications shows evidence of a non-zero effect in the same direction as the original experiment: a significant result means that an effect as large as the effect observed has a 5% or less chance of happening if the true effect was null. We found that 19% of the replication estimates were statistically significant at p<0.05.

Fourth, what fraction of experiments had at least half of their replications significant at p<0.05. This criterion evaluates whether at least half of the individual replications of an experiment were statistically significant on their own. We found this to be the case for 15% of experiments.

Fifth, what fraction of experiments had at least half of their replications judged successful by the labs performing them? For this criterion, we provided the team performing the replication with the original result after the experiment was concluded, and asked whether they thought it was successfully replicated. Labs were free to use whatever criteria they wished for this decision. This criterion was fulfilled by 40% of experiments.

What were, in your opinion, the main lessons to be learnt from the initiative from the point of view of researchers, institutions, funders and journals?

One thing we learned is that reasons for irreproducibility vary a lot according to research field and type of experiment. We came into the project influenced by discussions from fields such as psychology, which give a lot of attention to issues such as analysis flexibility and publication bias as sources of irreproducibility.

In our replications, however, there was a lot of heterogeneity across labs, even though they were blinded and had no incentives towards specific results, suggesting that part of the irreproducibility was likely due to the technical complexity of the methods and procedures. Preregistering these kinds of experiments was also complicated – labs had a hard time in defining their methods before having direct experience with the biological model and ended up breaking their own protocols quite frequently.

A more general takeaway is that most academic labs are not used to working in a confirmatory fashion, with commitment to predefined rules and protocols – and that we cannot just summon this from scratch. If we want to have the capacity to rigorously confirm findings, we must build up teams that are used to working with a higher degree of coordination, planning and rigor, either in particular institutions or in distributed consortia. Not all research has to be done this way – it is probably fine if most labs keep working in exploratory fashion. But it’s important that we can confirm important findings robustly when needed, and this requires building up this kind of expertise within research institutions.

How did the research community in Brazil react to the initiative?

We were very well received by the Brazilian community from the start, and the project managed to raise a lot of interest. We had 97 labs across the country signing up to participate, with 75 joining at some point and 56 making it to the end – which exceeded our most optimistic predictions.

There hasn’t been much backlash against the project either. People tend to think that replication efforts will inevitably raise controversy, but that was not the case in ours – perhaps because we took care not to point fingers at particular results or researchers. I think this speaks to the fact that most people in lab biology acknowledge that we have a problem in replicating results from the literature and think that addressing it is a worthy effort.

One question we have been asked more than once, however, is “why do it in Brazil?” This is usually asked out of concern that low replication rates may give the impression that the problem is specific to Brazilian science – even though this is likely not the case – and that this will jeopardize public support for science in the country. But we feel that studying the issue and talking about it openly is our best bet for achieving change – which largely depends on national actors, such as institutions and funders. Tackling the problem at a local level maximizes the chances that we can make a positive impact on the scientific environment around us.

Lastly, if you were to do the project again, what would you do differently?

I think we underestimated the difficulties of managing such a large, decentralized project and the number of ways things can go wrong along the way. Going back, we would probably have done more replications of each experiment – say 5 or 6 – to allow for some of them not getting done, even if this meant having to select less experiments. We’d also have clearer guidelines on how to deal with methods going wrong and experiments requiring repetition – something that we ended up building ad hoc as things came up.

We’d also want to have had more direct contact with labs at the start and over the course of the project – but this would probably have required a smaller consortium of very engaged labs. It took us a long time to realize that following labs as they worked could be more revealing than anything we could find out about the literature. Although we were able to learn a lot from what labs reported after finishing the replications, we could probably have been more proactive in asking questions about how the work was performed from the start.

We would also have been less naive about being able to preregister experiments without tinkering with the experimental model first. We’d probably allow labs a chance at a pilot experiment before trying to reach a final, confirmatory protocol, to allow for methods to be optimized.

Finally, we’d probably seek a wider range of expert advice when starting out. This includes technical advice about the chosen methods, statistical help for developing the analysis plan, and management advice from people who are used to run big team projects. We know more about these issues now, but had to learn the hard way by hitting a lot of walls. Hopefully our experience can also help future replication projects in lab biology – and let us know if you need help in planning one!

Interview by Peter Rodgers, Chief Magazine Editor, eLife.