Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations

London School of Hygiene & Tropical Medicine, United Kingdom
European Centre for Disease Prevention and Control (ECDC), Sweden
Karlsruhe Institute of Technology, Germany
Robert Koch Institute, Germany
University of Massachusetts Amherst, United States
Boston Children’s Hospital and Harvard Medical School, United States
Third Faculty of Medicine, Charles University, Czech Republic
Ecole Polytechnique Federale de Lausanne, Switzerland
Éducation nationale, France
Eidgenossische Technische Hochschule, Switzerland
Forschungszentrum Jülich GmbH, Germany
Frankfurt Institute for Advanced Studies, Germany
Fraunhofer Institute for Industrial Mathematics, Germany
Heidelberg University, Germany
Helmholtz Centre for Infection Research, Germany
IEM, Inc, United States
Independent researcher, Austria
Independent researcher, United States
Institut d’Investigacions Biomèdiques August Pi i Sunyer, Universitat Pompeu Fabra, Spain
Institute of Computer Science of the CAS, Czech Republic
Institute of Information Theory and Automation of the CAS, Czech Republic
Inverence, Spain
Los Alamos National Laboratory, United States
LUMSA University, Italy
Masaryk University, Czech Republic
Massachusetts Institute of Technology, United States
Max-Planck-Institut für Dynamik und Selbstorganisation, Germany
Medical University of Gdansk, Poland
Paul Scherrer Institute, Switzerland
Politecnico di Milano, Italy
Technical University of Kaiserlautern, Germany
Technische Universität Ilmenau, Germany
Universidad Carlos III de Madrid, Spain
Universidad Nacional de Educación a Distancia (UNED), Spain
Universitat de Barcelona, Spain
Universitat Politècnica de Catalunya, Spain
Universitat Trier, Germany
University of Cologne, Germany
University of Halle, Germany
University of Ljubljana, Slovenia
University of Milano-Bicocca, Italy
University of Molise, Italy
University of Oxford, United Kingdom
University of Palermo, Italy
University of Pavia, Italy
University of Perugia, Italy
University of Rome "La Sapienza", Italy
University of Rome "Tor Vergata", Italy
University of Southern California, United States
University of Sydney, Australia
University of Virginia, United States
University of Warsaw, Poland
University of Bialystok, Poland
University of Wroclaw, Poland
Universtät Leipzig, Germany
Warsaw University of Technology, Poland
Wroclaw University of Science and Technology, Poland

Apr 21, 2023

https://doi.org/10.7554/eLife.81916

Open access
Copyright information

Peer review process
Decision letter
Author response

Peer review process

This article was accepted for publication as part of eLife's original publishing model.

History

Version of Record published June 2, 2023
Accepted Manuscript published April 21, 2023
Accepted February 20, 2023
Received July 18, 2022
Preprint posted June 16, 2022

Go to the preprint

Decision letter

Amy Wesolowski

Reviewing Editor; Johns Hopkins Bloomberg School of Public Health, United States
Neil M Ferguson

Senior Editor; Imperial College London, United Kingdom
Jeffrey L Shaman

Reviewer; Columbia University, United States
Sen Pei

Reviewer; Columbia University Medical Center, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Neil Ferguson as the Senior Editor. The following individuals involved in the review of your submission have agreed to reveal their identity: Jeffrey L Shaman (Reviewer #1); Sen Pei (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The primary comment was regarding the novelty and additional insights gained from this work. While both reviewers noted that the methodology was sound, there are other papers reporting very similar findings/work in this setting (and others) but the added value of this work, in particular, was not clear. The authors are encouraged to better articulate the added value of the work.

Reviewer #1 (Recommendations for the authors):

I guess my main question is: do we need another report on multi-model 'ensembling'? I'm not sure. This work is more substantive and validated than multi-model scenario efforts (i.e. it is forecasting not scenario play), which are often wildly speculative and in many instances shouldn't be published in high-profile journals (and I've been on a few of those papers).

I will let the editor decide.

A few other comments.

The authors use a flexible submission structure that does not appear to be strictly regularized. They write: 'Teams could express their uncertainty around any single forecast target by submitting predictions for up to 23 quantiles (from 0.01 to 0.99) of the predictive probability distribution. Teams could also submit a single-point forecast.' Were there any issues arising from this? For instance, given that there was flexibility in what was submitted-some only submitting some quantiles or a point prediction, leading to variable missingness across quantiles-are there instances where the average mean or median value does not increase monotonically with quantile?

'Coverage' is an evocative term; in weather, they more typically use 'reliability', defined as the correspondence between the forecast probability of an event and the observed frequency of that event. Consider at least noting that coverage, as defined here, is reliability. Calibration is used to describe reliability, and I note this is used in the text.

The use of relative WIS is nice.

I believe your definition of F^(-1) (α), which is confusing as I reflexively read this as a matrix inverse (perhaps use G(α) instead), is for the supremum (least upper bound), not the infimum (greatest lower bound), i.e. G(α)=sup{t:F_i (t){greater than or equal to}α}. If not, I think the {greater than or equal to} should be {less than or equal to}.

Note that in US flu forecasting, there is an expectation of observation revisions. Forecasts are validated against final revised observations, despite what was available in real time.

Reviewer #2 (Recommendations for the authors):

This is a solid and well-written work summarizing the efforts of the European COVID-19 Forecast Hub.

https://doi.org/10.7554/eLife.81916.sa1

Author response

Comments are grouped by theme.

Novelty

There are other papers reporting very similar findings/work in this setting (and others) but the added value of this work, in particular, was not clear.

I guess my main question is: do we need another report on multi-model 'ensembling'?

We agree with reviewers that our findings add depth rather than breadth to the evidence base for multi-model ensembles in real time infectious disease forecasting. As mentioned by Reviewer #2, this work was unique and unprecedented for European policy makers in spanning multiple countries while aiming to inform continent-wide public health and we believe holds particularly strong value in highlighting the relevance of forecasting at multiple policy-making scales (national, regional, international).

We have added commentary on the specific value of this effort to European policy makers as well as forecast producers in both the background and discussion sections.

Methodological limitations

‘Teams could also submit a single-point forecast.' Were there any issues arising from this?

Several teams did submit point forecasts, or forecasts with less than the full set of quantiles (5 of 29 models evaluated here). We have historically reported absolute error for all models in real time but in this paper they are excluded from the evaluations using the interval score, which rely on the full set of quantiles.

The exclusion from the ensemble of forecasts without the full set of quantiles was not clear from the current paper text. We have now updated the text to make this exclusion explicit (Methods section, under “Forecast evaluation”, and re-stated for clarity in the Results).

Note that in US flu forecasting, there is an expectation of observation revisions. Forecasts are validated against final revised observations, despite what was available in real time.

As discussed in the text (Discussion, page 14), we excluded forecasts with revised observations. As we noted: “More generally it is unclear if the expectation of observation revisions should be a feature built into forecasts. Further research is needed to understand the perspective of end-users of forecasts in order to assess this.” In the context of this paper we felt the fairest approach to evaluation was to exclude forecasts made for revised observations, while recognising that evaluating forecasts against updated data is a valid approach.

We have added a note on this alternative approach in the discussion.

Definitions and clarifications

'Coverage' is an evocative term; in weather, they more typically use 'reliability', defined as the correspondence between the forecast probability of an event and the observed frequency of that event. Consider at least noting that coverage, as defined here, is reliability. Calibration is used to describe reliability, and I note this is used in the text.

Thank you for suggesting this as we had not observed this difference in usage. Here we used coverage in line with previous work discussing interval scores for quantile forecasts.

We have added a note and reference in the Methods section for our use of “coverage”.

I believe your definition of F^(-1) (α), which is confusing as I reflexively read this as a matrix inverse (perhaps use G(α) instead), is for the supremum (least upper bound), not the infimum (greatest lower bound), i.e. G(α)=sup{t:F_i (t){greater than or equal to}α}. If not, I think the {greater than or equal to} should be {less than or equal to}.

We apologise if the notation was perceived as ambiguous. We followed the notation of the original reference cited in the paper¹ (eq.1.1). As an example, the 0.05 quantile of a distribution with cumulative distribution function F(x) would be the greatest value of t that is a lower bound of the set of all t that fulfil F(t)> = 0.05, which is the definition as written (it would be the supremum if the direction of the inequality was reversed). We could replace it with the minimum here (the lowest t with F(t) > = 0.05) for all practical intents and purposes, but decided to stay with the original notation so that it can be tracked to the given reference.

We have clarified reference to notation in the text.

1. C. Genest, “Vincentization Revisited,” The Annals of Statistics, vol. 20, no. 2, pp. 1137–1142, 1992, Available: https://www.jstor.org/stable/2242003

https://doi.org/10.7554/eLife.81916.sa2