Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations

  1. Katharine Sherratt  Is a corresponding author
  2. Hugo Gruson
  3. Rok Grah
  4. Helen Johnson
  5. Rene Niehus
  6. Bastian Prasse
  7. Frank Sandmann
  8. Jannik Deuschel
  9. Daniel Wolffram
  10. Sam Abbott
  11. Alexander Ullrich
  12. Graham Gibson
  13. Evan L Ray
  14. Nicholas G Reich
  15. Daniel Sheldon
  16. Yijin Wang
  17. Nutcha Wattanachit
  18. Lijing Wang
  19. Jan Trnka
  20. Guillaume Obozinski
  21. Tao Sun
  22. Dorina Thanou
  23. Loic Pottier
  24. Ekaterina Krymova
  25. Jan H Meinke
  26. Maria Vittoria Barbarossa
  27. Neele Leithauser
  28. Jan Mohring
  29. Johanna Schneider
  30. Jaroslaw Wlazlo
  31. Jan Fuhrmann
  32. Berit Lange
  33. Isti Rodiah
  34. Prasith Baccam
  35. Heidi Gurung
  36. Steven Stage
  37. Bradley Suchoski
  38. Jozef Budzinski
  39. Robert Walraven
  40. Inmaculada Villanueva
  41. Vit Tucek
  42. Martin Smid
  43. Milan Zajicek
  44. Cesar Perez Alvarez
  45. Borja Reina
  46. Nikos I Bosse
  47. Sophie R Meakin
  48. Lauren Castro
  49. Geoffrey Fairchild
  50. Isaac Michaud
  51. Dave Osthus
  52. Pierfrancesco Alaimo Di Loro
  53. Antonello Maruotti
  54. Veronika Eclerova
  55. Andrea Kraus
  56. David Kraus
  57. Lenka Pribylova
  58. Bertsimas Dimitris
  59. Michael Lingzhi Li
  60. Soni Saksham
  61. Jonas Dehning
  62. Sebastian Mohr
  63. Viola Priesemann
  64. Grzegorz Redlarski
  65. Benjamin Bejar
  66. Giovanni Ardenghi
  67. Nicola Parolini
  68. Giovanni Ziarelli
  69. Wolfgang Bock
  70. Stefan Heyder
  71. Thomas Hotz
  72. David E Singh
  73. Miguel Guzman-Merino
  74. Jose L Aznarte
  75. David Morina
  76. Sergio Alonso
  77. Enric Alvarez
  78. Daniel Lopez
  79. Clara Prats
  80. Jan Pablo Burgard
  81. Arne Rodloff
  82. Tom Zimmermann
  83. Alexander Kuhlmann
  84. Janez Zibert
  85. Fulvia Pennoni
  86. Fabio Divino
  87. Marti Catala
  88. Gianfranco Lovison
  89. Paolo Giudici
  90. Barbara Tarantino
  91. Francesco Bartolucci
  92. Giovanna Jona Lasinio
  93. Marco Mingione
  94. Alessio Farcomeni
  95. Ajitesh Srivastava
  96. Pablo Montero-Manso
  97. Aniruddha Adiga
  98. Benjamin Hurt
  99. Bryan Lewis
  100. Madhav Marathe
  101. Przemyslaw Porebski
  102. Srinivasan Venkatramanan
  103. Rafal P Bartczuk
  104. Filip Dreger
  105. Anna Gambin
  106. Krzysztof Gogolewski
  107. Magdalena Gruziel-Slomka
  108. Bartosz Krupa
  109. Antoni Moszyński
  110. Karol Niedzielewski
  111. Jedrzej Nowosielski
  112. Maciej Radwan
  113. Franciszek Rakowski
  114. Marcin Semeniuk
  115. Ewa Szczurek
  116. Jakub Zielinski
  117. Jan Kisielewski
  118. Barbara Pabjan
  119. Kirsten Holger
  120. Yuri Kheifetz
  121. Markus Scholz
  122. Biecek Przemyslaw
  123. Marcin Bodych
  124. Maciej Filinski
  125. Radoslaw Idzikowski
  126. Tyll Krueger
  127. Tomasz Ozanski
  128. Johannes Bracher
  129. Sebastian Funk
  1. London School of Hygiene & Tropical Medicine, United Kingdom
  2. European Centre for Disease Prevention and Control (ECDC), Sweden
  3. Karlsruhe Institute of Technology, Germany
  4. Robert Koch Institute, Germany
  5. University of Massachusetts Amherst, United States
  6. Boston Children’s Hospital and Harvard Medical School, United States
  7. Third Faculty of Medicine, Charles University, Czech Republic
  8. Ecole Polytechnique Federale de Lausanne, Switzerland
  9. Éducation nationale, France
  10. Eidgenossische Technische Hochschule, Switzerland
  11. Forschungszentrum Jülich GmbH, Germany
  12. Frankfurt Institute for Advanced Studies, Germany
  13. Fraunhofer Institute for Industrial Mathematics, Germany
  14. Heidelberg University, Germany
  15. Helmholtz Centre for Infection Research, Germany
  16. IEM, Inc, United States
  17. Independent researcher, Austria
  18. Independent researcher, United States
  19. Institut d’Investigacions Biomèdiques August Pi i Sunyer, Universitat Pompeu Fabra, Spain
  20. Institute of Computer Science of the CAS, Czech Republic
  21. Institute of Information Theory and Automation of the CAS, Czech Republic
  22. Inverence, Spain
  23. Los Alamos National Laboratory, United States
  24. LUMSA University, Italy
  25. Masaryk University, Czech Republic
  26. Massachusetts Institute of Technology, United States
  27. Max-Planck-Institut für Dynamik und Selbstorganisation, Germany
  28. Medical University of Gdansk, Poland
  29. Paul Scherrer Institute, Switzerland
  30. Politecnico di Milano, Italy
  31. Technical University of Kaiserlautern, Germany
  32. Technische Universität Ilmenau, Germany
  33. Universidad Carlos III de Madrid, Spain
  34. Universidad Nacional de Educación a Distancia (UNED), Spain
  35. Universitat de Barcelona, Spain
  36. Universitat Politècnica de Catalunya, Spain
  37. Universitat Trier, Germany
  38. University of Cologne, Germany
  39. University of Halle, Germany
  40. University of Ljubljana, Slovenia
  41. University of Milano-Bicocca, Italy
  42. University of Molise, Italy
  43. University of Oxford, United Kingdom
  44. University of Palermo, Italy
  45. University of Pavia, Italy
  46. University of Perugia, Italy
  47. University of Rome "La Sapienza", Italy
  48. University of Rome "Tor Vergata", Italy
  49. University of Southern California, United States
  50. University of Sydney, Australia
  51. University of Virginia, United States
  52. University of Warsaw, Poland
  53. University of Bialystok, Poland
  54. University of Wroclaw, Poland
  55. Universtät Leipzig, Germany
  56. Warsaw University of Technology, Poland
  57. Wroclaw University of Science and Technology, Poland

Decision letter

  1. Amy Wesolowski
    Reviewing Editor; Johns Hopkins Bloomberg School of Public Health, United States
  2. Neil M Ferguson
    Senior Editor; Imperial College London, United Kingdom
  3. Jeffrey L Shaman
    Reviewer; Columbia University, United States
  4. Sen Pei
    Reviewer; Columbia University Medical Center, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Neil Ferguson as the Senior Editor. The following individuals involved in the review of your submission have agreed to reveal their identity: Jeffrey L Shaman (Reviewer #1); Sen Pei (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The primary comment was regarding the novelty and additional insights gained from this work. While both reviewers noted that the methodology was sound, there are other papers reporting very similar findings/work in this setting (and others) but the added value of this work, in particular, was not clear. The authors are encouraged to better articulate the added value of the work.

Reviewer #1 (Recommendations for the authors):

I guess my main question is: do we need another report on multi-model 'ensembling'? I'm not sure. This work is more substantive and validated than multi-model scenario efforts (i.e. it is forecasting not scenario play), which are often wildly speculative and in many instances shouldn't be published in high-profile journals (and I've been on a few of those papers).

I will let the editor decide.

A few other comments.

The authors use a flexible submission structure that does not appear to be strictly regularized. They write: 'Teams could express their uncertainty around any single forecast target by submitting predictions for up to 23 quantiles (from 0.01 to 0.99) of the predictive probability distribution. Teams could also submit a single-point forecast.' Were there any issues arising from this? For instance, given that there was flexibility in what was submitted-some only submitting some quantiles or a point prediction, leading to variable missingness across quantiles-are there instances where the average mean or median value does not increase monotonically with quantile?

'Coverage' is an evocative term; in weather, they more typically use 'reliability', defined as the correspondence between the forecast probability of an event and the observed frequency of that event. Consider at least noting that coverage, as defined here, is reliability. Calibration is used to describe reliability, and I note this is used in the text.

The use of relative WIS is nice.

I believe your definition of F^(-1) (α), which is confusing as I reflexively read this as a matrix inverse (perhaps use G(α) instead), is for the supremum (least upper bound), not the infimum (greatest lower bound), i.e. G(α)=sup{t:F_i (t){greater than or equal to}α}. If not, I think the {greater than or equal to} should be {less than or equal to}.

Note that in US flu forecasting, there is an expectation of observation revisions. Forecasts are validated against final revised observations, despite what was available in real time.

Reviewer #2 (Recommendations for the authors):

This is a solid and well-written work summarizing the efforts of the European COVID-19 Forecast Hub.

https://doi.org/10.7554/eLife.81916.sa1

Author response

Comments are grouped by theme.

Novelty

There are other papers reporting very similar findings/work in this setting (and others) but the added value of this work, in particular, was not clear.

I guess my main question is: do we need another report on multi-model 'ensembling'?

We agree with reviewers that our findings add depth rather than breadth to the evidence base for multi-model ensembles in real time infectious disease forecasting. As mentioned by Reviewer #2, this work was unique and unprecedented for European policy makers in spanning multiple countries while aiming to inform continent-wide public health and we believe holds particularly strong value in highlighting the relevance of forecasting at multiple policy-making scales (national, regional, international).

We have added commentary on the specific value of this effort to European policy makers as well as forecast producers in both the background and discussion sections.

Methodological limitations

‘Teams could also submit a single-point forecast.' Were there any issues arising from this?

Several teams did submit point forecasts, or forecasts with less than the full set of quantiles (5 of 29 models evaluated here). We have historically reported absolute error for all models in real time but in this paper they are excluded from the evaluations using the interval score, which rely on the full set of quantiles.

The exclusion from the ensemble of forecasts without the full set of quantiles was not clear from the current paper text. We have now updated the text to make this exclusion explicit (Methods section, under “Forecast evaluation”, and re-stated for clarity in the Results).

Note that in US flu forecasting, there is an expectation of observation revisions. Forecasts are validated against final revised observations, despite what was available in real time.

As discussed in the text (Discussion, page 14), we excluded forecasts with revised observations. As we noted: “More generally it is unclear if the expectation of observation revisions should be a feature built into forecasts. Further research is needed to understand the perspective of end-users of forecasts in order to assess this.” In the context of this paper we felt the fairest approach to evaluation was to exclude forecasts made for revised observations, while recognising that evaluating forecasts against updated data is a valid approach.

We have added a note on this alternative approach in the discussion.

Definitions and clarifications

'Coverage' is an evocative term; in weather, they more typically use 'reliability', defined as the correspondence between the forecast probability of an event and the observed frequency of that event. Consider at least noting that coverage, as defined here, is reliability. Calibration is used to describe reliability, and I note this is used in the text.

Thank you for suggesting this as we had not observed this difference in usage. Here we used coverage in line with previous work discussing interval scores for quantile forecasts.

We have added a note and reference in the Methods section for our use of “coverage”.

I believe your definition of F^(-1) (α), which is confusing as I reflexively read this as a matrix inverse (perhaps use G(α) instead), is for the supremum (least upper bound), not the infimum (greatest lower bound), i.e. G(α)=sup{t:F_i (t){greater than or equal to}α}. If not, I think the {greater than or equal to} should be {less than or equal to}.

We apologise if the notation was perceived as ambiguous. We followed the notation of the original reference cited in the paper1 (eq.1.1). As an example, the 0.05 quantile of a distribution with cumulative distribution function F(x) would be the greatest value of t that is a lower bound of the set of all t that fulfil F(t)> = 0.05, which is the definition as written (it would be the supremum if the direction of the inequality was reversed). We could replace it with the minimum here (the lowest t with F(t) > = 0.05) for all practical intents and purposes, but decided to stay with the original notation so that it can be tracked to the given reference.

We have clarified reference to notation in the text.

1. C. Genest, “Vincentization Revisited,” The Annals of Statistics, vol. 20, no. 2, pp. 1137–1142, 1992, Available: https://www.jstor.org/stable/2242003

https://doi.org/10.7554/eLife.81916.sa2

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Katharine Sherratt
  2. Hugo Gruson
  3. Rok Grah
  4. Helen Johnson
  5. Rene Niehus
  6. Bastian Prasse
  7. Frank Sandmann
  8. Jannik Deuschel
  9. Daniel Wolffram
  10. Sam Abbott
  11. Alexander Ullrich
  12. Graham Gibson
  13. Evan L Ray
  14. Nicholas G Reich
  15. Daniel Sheldon
  16. Yijin Wang
  17. Nutcha Wattanachit
  18. Lijing Wang
  19. Jan Trnka
  20. Guillaume Obozinski
  21. Tao Sun
  22. Dorina Thanou
  23. Loic Pottier
  24. Ekaterina Krymova
  25. Jan H Meinke
  26. Maria Vittoria Barbarossa
  27. Neele Leithauser
  28. Jan Mohring
  29. Johanna Schneider
  30. Jaroslaw Wlazlo
  31. Jan Fuhrmann
  32. Berit Lange
  33. Isti Rodiah
  34. Prasith Baccam
  35. Heidi Gurung
  36. Steven Stage
  37. Bradley Suchoski
  38. Jozef Budzinski
  39. Robert Walraven
  40. Inmaculada Villanueva
  41. Vit Tucek
  42. Martin Smid
  43. Milan Zajicek
  44. Cesar Perez Alvarez
  45. Borja Reina
  46. Nikos I Bosse
  47. Sophie R Meakin
  48. Lauren Castro
  49. Geoffrey Fairchild
  50. Isaac Michaud
  51. Dave Osthus
  52. Pierfrancesco Alaimo Di Loro
  53. Antonello Maruotti
  54. Veronika Eclerova
  55. Andrea Kraus
  56. David Kraus
  57. Lenka Pribylova
  58. Bertsimas Dimitris
  59. Michael Lingzhi Li
  60. Soni Saksham
  61. Jonas Dehning
  62. Sebastian Mohr
  63. Viola Priesemann
  64. Grzegorz Redlarski
  65. Benjamin Bejar
  66. Giovanni Ardenghi
  67. Nicola Parolini
  68. Giovanni Ziarelli
  69. Wolfgang Bock
  70. Stefan Heyder
  71. Thomas Hotz
  72. David E Singh
  73. Miguel Guzman-Merino
  74. Jose L Aznarte
  75. David Morina
  76. Sergio Alonso
  77. Enric Alvarez
  78. Daniel Lopez
  79. Clara Prats
  80. Jan Pablo Burgard
  81. Arne Rodloff
  82. Tom Zimmermann
  83. Alexander Kuhlmann
  84. Janez Zibert
  85. Fulvia Pennoni
  86. Fabio Divino
  87. Marti Catala
  88. Gianfranco Lovison
  89. Paolo Giudici
  90. Barbara Tarantino
  91. Francesco Bartolucci
  92. Giovanna Jona Lasinio
  93. Marco Mingione
  94. Alessio Farcomeni
  95. Ajitesh Srivastava
  96. Pablo Montero-Manso
  97. Aniruddha Adiga
  98. Benjamin Hurt
  99. Bryan Lewis
  100. Madhav Marathe
  101. Przemyslaw Porebski
  102. Srinivasan Venkatramanan
  103. Rafal P Bartczuk
  104. Filip Dreger
  105. Anna Gambin
  106. Krzysztof Gogolewski
  107. Magdalena Gruziel-Slomka
  108. Bartosz Krupa
  109. Antoni Moszyński
  110. Karol Niedzielewski
  111. Jedrzej Nowosielski
  112. Maciej Radwan
  113. Franciszek Rakowski
  114. Marcin Semeniuk
  115. Ewa Szczurek
  116. Jakub Zielinski
  117. Jan Kisielewski
  118. Barbara Pabjan
  119. Kirsten Holger
  120. Yuri Kheifetz
  121. Markus Scholz
  122. Biecek Przemyslaw
  123. Marcin Bodych
  124. Maciej Filinski
  125. Radoslaw Idzikowski
  126. Tyll Krueger
  127. Tomasz Ozanski
  128. Johannes Bracher
  129. Sebastian Funk
(2023)
Predictive performance of multi-model ensemble forecasts of COVID-19 across European nations
eLife 12:e81916.
https://doi.org/10.7554/eLife.81916

Share this article

https://doi.org/10.7554/eLife.81916