Arguments for the biological and predictive relevance of the proportional recovery rule

  1. Jeff Goldsmith  Is a corresponding author
  2. Tomoko Kitago
  3. Angel Garcia de la Garza
  4. Robinson Kundert
  5. Andreas Luft
  6. Cathy Stinear
  7. Winston D Byblow
  8. Gert Kwakkel
  9. John W Krakauer
  1. Department of Biostatistics, Columbia Mailman School of Public Health, United States
  2. Burke Neurological Institute, United States
  3. Weill Cornell Medicine, United States
  4. Cereneo, Center for Neurology and Rehabilitation, Switzerland
  5. Vascular Neurology and Neurorehabilitation, Department of Neurology, University Hospital Zurich, University of Zurich, Switzerland
  6. Department of Medicine, University of Auckland, New Zealand
  7. Department of Exercise Sciences, University of Auckland, New Zealand
  8. Rehabilitation Research Centre, Reade, Netherlands
  9. Rehabilitation Medicine, Amsterdam UMC - Location VUMC, Amsterdam Movement Sciences, Netherlands
  10. Department of Neurology, Johns Hopkins University, United States
  11. Department of Neuroscience, Johns Hopkins University, United States
  12. Department of Physical Medicine and Rehabilitation, United States
  13. Santa Fe Institute, United States

Abstract

The proportional recovery rule (PRR) posits that most stroke survivors can expect to reduce a fixed proportion of their motor impairment. As a statistical model, the PRR explicitly relates change scores to baseline values – an approach that arises in many scientific domains but has the potential to introduce artifacts and flawed conclusions. We describe approaches that can assess associations between baseline and changes from baseline while avoiding artifacts due either to mathematical coupling or to regression to the mean. We also describe methods that can compare different biological models of recovery. Across several real datasets in stroke recovery, we find evidence for non-artifactual associations between baseline and change, and support for the PRR compared to alternative models. We also introduce a statistical perspective that can be used to assess future models. We conclude that the PRR remains a biologically relevant model of stroke recovery.

Editor's evaluation

This fundamental work provides a comprehensive look at validity of the Proportional Recovery Rule, which states that patients will recover a fixed proportion of lost function after stroke. By undertaking a thorough investigation of the statistical properties of the analysis of change and baseline values the authors elucidate the statistical framework that can be used regardless of the topic of study. In a compelling model comparison across several large sets of data, the authors confirm support for the Proportional Recovery Rule over other models of recovery.

https://doi.org/10.7554/eLife.80458.sa0

Introduction

Intuition, experience, and data suggest that patients with worse motor impairment in the immediate post-stroke period will also typically see the largest absolute reductions in impairment during the first 3–6 months of recovery. However, rigorously quantifying this observation has proved challenging. The proportional recovery rule (PRR) was an early attempt to describe the relationship between initial impairment and recovery through the investigation of upper-extremity Fugl-Meyer assessments (FMA-UE) at baseline and at subsequent follow-up visits, with recovery defined as the change over time (Prabhakaran et al., 2008; Krakauer and Marshall, 2015). This work indicated that, on average, a large subset of patients recovered roughly 70% of the maximal potential recovery from impairment, but a biologically distinct and smaller subgroup recovered much less (‘non-recoverers’). Since its introduction, the PRR has been applied across several neurological domains, implemented in various ways across studies, and evaluated using several different statistical metrics (Lazar et al., 2010; Winters et al., 2015; Winters et al., 2017; Veerbeek et al., 2018; Byblow et al., 2015). In a recent article, we sought to reestablish the original conceptualization of the PRR as a regression model for describing of recovery from upper limb impairment (Kundert et al., 2019).

The PRR has, in recent years, come under fire. The criticisms echo concerns raised anytime change as an outcome is related to the baseline value, particularly before and after an intervention – whether the outcome of interest is blood pressure, CD4 cell count, or behavior. The concern hinges on statistical questions about the relationship between baseline values and change scores – a relationship that is often fraught, counterintuitive, and has confounded researchers and statisticians for decades (Oldham, 1962; Gill et al., 1985; Tu, 2016). Bringing these issues to the fore and illustrating the ways they affect the specific case of stroke recovery has spurred an important and informative debate. Specifically, there are questions about the usefulness of correlations in the case when there is mathematical coupling from inclusion of the baseline value in the change score, the distinction between population-level descriptions and patient-level predictions; the usefulness of the PRR in functional domains other than upper-extremity motor control; the identification of ‘non-recoverers’, both prospectively and retrospectively; and the possibility that ceiling effects exaggerate associations or introduce non-linearity not accounted for by the PRR (Hope et al., 2018; Hawe et al., 2018; Bonkhoff et al., 2020; Bowman et al., 2021).

Although the critiques of the PRR have been rigorous and careful, never asserting that the PRR was entirely irrelevant for recovery, the tone and content of some of this work might make it understandable for readers to infer otherwise. In ‘Recovery after stroke: not so proportional after all?’ (Hope et al., 2018), the authors say in their discussion that they ‘are not claiming that the proportional recovery rule is wrong’, only that they do not see evidence that the rule holds and that other models for recovery are possible. Hawe et al., 2018, included some similar caveats in the discussion of a paper titled ‘Taking proportional out of recovery’. Senesh and Reinkensmeyer, 2019 (title: ‘Breaking proportional recovery after stroke’) summarized these two papers in their abstract by saying they ‘explained that [proportional recovery] should be expected because of mathematical coupling between the baseline and change score’. Some of the authors of the original criticisms have since moderated their stance and important points of consensus are emerging; see, for example, Bonkhoff et al., 2020, and Bonkhoff et al., 2022. Against this rapidly evolving and potentially confusing backdrop, we hope to bring together all the thorny issues so that readers can reach some degree of closure on the PRR.

The growing meta-literature often discusses the PRR in an abstract way, focusing on a baseline x, a single follow-up y, the change δ=y-x, and all the various correlations among them. Readers could be forgiven for asking how an intuitive formula for expected recovery spawned its own cottage industry, and why arguments about the PRR have become so esoteric. We suspect that many, by now, would prefer a simple judgment on the truth of the PRR without a winding statistical detour. We’re sympathetic to this perspective, but find the detour necessary as the nuanced and sometimes counterintuitive statistical arguments are critical to get right for the sake of furthering our understanding of the biological mechanisms of recovery.

Our goals are to synthesize and discuss the statistical issues relevant to the PRR clearly, to describe appropriate analysis techniques, to identify areas of emerging consensus, and to resolve arguments where possible. We discuss valid but non-standard hypothesis tests for correlations, framed to distinguish true signal from artifact. This is not the same as distinguishing between the PRR and other recovery mechanisms that induce strong correlations; we therefore evaluate competing models based on their ability to predict outcomes. Much of this discussion assumes that ‘recoverers’ and ‘non-recoverers’ can be differentiated, and mainly focuses on models for recoverers. That said, the existence of distinct recovery groups complicates the quantification of recovery. We discuss the hazards of applying advanced statistical methods in settings where they aren’t justified by the data, and discuss ways of testing whether observed data are consistent with a hypothesized mechanism. Throughout, we use generated datasets and data taken from several published studies of recovery to illustrate our statistical points. In some places, the discussion is unavoidably technical, but the aim is to focus on the details that are pertinent to the core question: Is there a true systematic relationship between impairment and change in impairment after stroke? If the answer is yes, then this would imply that there is important biological work to be done to explain the mechanism for this phenomenological regularity.

Results

Limitations of the correlation between baseline and change as a measure of association

Although the PRR is best understood as a regression model (Kundert et al., 2019), correlations between baseline and follow-up, and between baseline and change, have often been used to summarize data and are presented as evidence for the rule. A focus on cor(x,δ), where δ=y-x is the change between follow-up (y) and baseline (x), is intuitive in the context of recovery. The value of this correlation can be affected in unexpected ways due to mathematical coupling, most broadly defined as the setting where one value (x) is also included in the definition of the second (δ). The following relationship is known to hold:

(1) corx,δ=σycorx,y-σxσy2+σx2-2σxσycorx,y= kcorx,y-11+k-2kcorx,y

This equation shows the dependence of cor(x,δ) on cor(x,y) and the variance ratio k=σy2σx2 . The relationship between these quantities, visualized first by Bartko and Pettigrew, 1968, and more recently as a three-dimensional surface by Hope et al., 2018, is shown as a contour plot in Figure 1. For reasons that will become clear shortly, we highlight the contour corresponding to k=1, where baseline and follow-up have equal variance. Other contour lines in this figure correspond to fixed values of k ranging between 0.01 and 4.

The left panels show four generated datasets, labeled A, B, C, and D.

In the top row, panels show outcome values and baseline (x) and follow-up (y). In the bottom row, panels show change (δ) against initial impairment (66 − x). The right panel shows a contour plot of Equation 1, with contours corresponding to values of the variance ratio k and the contour for k=1 highlighted. Points on this surface show correlation values obtained for Datasets A through D. A figure similar to the contour plot is shown in Hope et al., 2018, with different axes and orientations.

An immediate observation from this plot is that the range of possible values for cor(x,δ) depends on cor(x,y). When cor(x,y)=0, for example, cor(x,δ) is restricted to lie in [-1,0] rather than [-1,1]. By itself, this suggests that usual tests for the significance of a correlation in which the null value is assumed to be 0 are inappropriate for investigations of recovery. However, as we’ll see shortly, this does not mean that hypothesis tests are impossible – only that more appropriate ones are necessary.

The dependence of cor(x,δ) on cor(x,y) and k=σy2σx2 is the basis for two related criticisms of using the correlation between baseline and change as a statistical measure of recovery. First, the canonical example of coupling is the setting in which baseline (x) and follow-up (y) are uncorrelated and have the same variance. This situation is represented by the point on the surface where cor(x,y)=0 and the variance ratio k=σy2σx2=1. In this setting cor(x,δ)=-0.71 – a value that, for some, is unexpectedly high and casts doubt on any large correlation between baseline and change (Hope et al., 2018; Hawe et al., 2018).

A broader argument relates to settings where measurements at follow-up have much lower variance than initial values, as is typically the case for studies of recovery. In cases where k is small, cor(x,δ) may be ‘(non-trivially) stronger than cor(x,y)’ and therefore spurious or misleading (Hope et al., 2018). Settings with small values of k have been described as ‘degenerate’ in that cor(x,δ) will approach −1 (Bowman et al., 2021); as Figure 1 makes clear, this is a concern for any value of cor(x,y).

The recognition of these statistical issues, and the role they have played in understanding recovery, reveals some limitations of using correlation to measure the association between baseline and change and produce evidence for the significance of this association. By itself, cor(x,δ) will give at best an incomplete understanding of recovery, and traditional hypothesis tests (focusing on 0 as a null value) are inappropriate. That said, these criticisms don’t invalidate the PRR – they aren’t even directly relevant to the PRR. Instead, they clarify the importance of understanding the relationship between cor(x,δ), cor(x,y), and the variance ratio k=σy2σx2 ; the importance of each these in the studies of recovery; and the use of appropriate summaries of the data.

Distinguishing true and artifactual signals

To paraphrase the previous section, when the variance ratio is small, large values of cor(x,δ) can arise from a wide range of cor(x,y). It has been argued that high correlations between baseline and change are invalid unless they are accompanied by high correlations between baseline and follow-up.

Refutation of this view has a long history in the statistical literature. Oldham, 1962, argues that a variance ratio other than 1 is evidence for some real effect or process: ‘Unless some agent has caused a change of standard deviation between the two occasions, σx2 will equal σy2 ’. This argument is slightly more complicated when the outcome measure is bounded, a setting Oldham did not consider but that arises in stroke recovery. Heterogeneous recovery that depends on initial impairment could be the agent that causes a reduction in variance; alternatively, variance may be reduced because recovery is homogeneous but subject to ceiling effects or because the impairment scale is non-linear. In any case, when k<1 the amount of recovery is related to the baseline value in a way that is not attributable to mathematical coupling, and differentiating between explanations is necessary after ruling out coupling. Oldham’s method, which derives from the role of the variance ratio in studies where baseline and change are important, formalizes this concept and is a commonly used approach for understanding the relationship between baseline and change; it will be presented in the next section.

The value of cor(x,δ) depends on the variance ratio k and on cor(x,y). The variance ratio can be used as a measure of the extent to which δ depends on baseline values, regardless of the value of cor(x,y), with the understanding that this dependence can arise from the PRR or other recovery mechanisms. Meanwhile, cor(x,y) indicates whether follow-up values are related to baseline, regardless of k. Thus cor(x,y) is relevant for questions of patient-level prediction although, as we’ll argue later, correlations are less useful than direct measures of prediction accuracy.

We simulate four datasets with different values of cor(x,y) and k:

  • A: cor(x,y)=0 and k = 1

  • B: cor(x,y)=0.9 and k = 1

  • C: cor(x,y)=0 and k=0.16

  • D: cor(x,y)=0.9 and k=0.16

All datasets have x values generated from a Normal distribution with mean 30 and standard deviation 14, and consist of 30 generated subjects. Datasets A and B have follow-up values (y) with mean 30, while Datasets C and D have follow-up values with mean 53 (variances at follow-up are determined by the variance ratio). Generated datasets did not include values below 0 or above 66. These datasets were generated to clarify the relationship between cor(x,y), k, and cor(x,δ): they are not assumed to follow the PRR or any other explicit model for recovery. The left panels in Figure 1 show baseline and follow-up values (top row) and scatterplots of δ against initial impairment (bottom row), with initial impairment defined as FMmaxx=66x. The right panel indicates the placement of these datasets on the contour plot of cor(x,δ).

Dataset A is a canonical example of mathematical coupling, a setting that results in cor(x,δ)=-0.71. Like Dataset A, Dataset C has cor(x,y)=0, but the variance ratio is lower. Given our emphasis on k and on cor(x,y), any debate about whether cor(x,δ) for this dataset might be called ‘spurious’ is less relevant than (i) the true reduction in variance that results from recovery and (ii) the true association between the baseline value and the magnitude of change. Indeed, despite the inability of the baseline value to usefully predict follow-up in Dataset C, these data represent a case in which baseline values can be used to predict change in a non-artifactual way, which is a setting that Oldham and many others since have argued is important (Oldham, 1962; Tu and Gilthorpe, 2007).

In contrast, Datasets B and D have large cor(x,y), which suggests an ability to predict follow-up using baseline with some degree of accuracy. In Dataset B, change from baseline to follow-up is constant with some patient-level noise, and accurate predictions at follow-up are a simple byproduct of that constancy. Dataset B is also, arguably, irrelevant: whether due to measurements that are truly non-linear, ceiling effects, or proportionality, motor impairment recovery is marked by heterogenous recovery across subjects.

Dataset D represents the least controversial scenario among our generated data: there is recovery heterogeneity that is predictable based on initial values and results in a variance ratio that is less than one, and the initial values meaningfully predict outcomes at follow-up. The variance ratio and correlations don’t arise either from mathematical coupling or from regression to the mean due to measurement error. Real data that are similar to these are suggestive of an underlying biological recovery process in which baseline values predict change and final outcomes.

Taken together, the generated datasets in this section illustrate the relationship between cor(x,y), k, and corx,δ , as well as the kinds of observed data that can give rise to various combinations of these values. The examples highlight discrepancies between cor(x,y) and corx,δ to illustrate their shortcomings when viewed individually. We also identify a setting, typified by Dataset D, in which each measure suggests the presence of a relevant association. However, it is not the case that data like these necessarily imply that recovery follows the PRR. Other biological models, the presence of strong ceiling effects, or other mechanisms could produce data similar to Dataset D, and how to compare competing models will be considered in later sections.

Recasting Oldham’s method

Equation (1) and Figure 1 show that cor(x,y) and k determine the value of cor(x,δ) in ways that can be counterintuitive. When cor(x,y)=0, for example, an appropriate null value for a hypothesis test of cor(x,δ) is –0.71 rather than 0, because this corresponds to ‘no recovery’ or k=1; a table in the appendix provides null values for hypothesis tests of cor(x,δ) under a range of values of cor(x,y). However, like others, we think the possible confusion around cor(x,δ) as a measure of evidence make it less suitable than other approaches.

Oldham’s method suggests to use cor(x+y2,x-y), or simply cor(x+y,x-y), in place of cor(x,δ). This correlation is zero, rather than −0.71, in the canonical example of mathematical coupling. Indeed, this correlation is zero if and only if k=1 – regardless of cor(x,y), and even in many cases where measurement error or other processes might affect the ability to reliably measure outcomes at baseline and follow-up. Thus, Oldham’s method often guards against false conclusions due to mathematical coupling and regression to the mean due to measurement error.

Instead of cor(x+y,x-y), we prefer to focus on the variance ratio k. Values of k that differ from 1 suggest associations between baseline and change that do not arise from mathematical coupling or regression to the mean. In parallel, we examine the correlation cor(x,y), which is equal to zero under the null hypothesis that follow-up values are uncorrelated with baseline. These are important but distinct; a suggestion that very small values of k can only produce spurious correlations cor(x,y) would stem from conflating the two. We again emphasize that these tests are intended to assess whether correlations are ‘artifactual’ (arising from mathematical coupling or regression to the mean), but not to evaluate support for the PRR in comparison to competing models. For instance, as noted by both (Hope et al., 2018 and Hawe et al., 2018), Oldham’s method does not address the possibility of ceiling effects; in our view, determining which process gives rise to observed correlations comes after assessing whether those correlations are artifacts driven by coupling.

Parametric hypothesis tests are available for both k and cor(x,y), but depend on assumptions that may be unmet in the context of stroke recovery. We also argue for specific null values of both k and cor(x,y), although other choices are possible; one might instead choose ‘random recovery’ described in Lohse et al., 2021, as a null distribution and identify corresponding values. Instead of a parametric or simulation-based approach (Tu et al., 2005), we suggest a resampling-based one that can compare to the null values we suggest, and we describe how this method can be used for other null distributions. See Methods for details. A web-based app is available to carry out this analysis (https://jeff-goldsmith.shinyapps.io/prr_dashboard/).

The statistical significance of the recovery proportion and R2 values have often been used as evidence for the PRR. Appendix 1 provides a detailed description of the connections between these and cor(x,δ), k, and cor(x,y). In short, null-value hypothesis testing for the recovery proportion and the interpretation of R2 values can be difficult for reasons that are analogous to those for correlations: appropriate null values are available but non-standard, and the range of possible values for R2 can be limited. We urge caution in interpreting these quantities and so prefer to use other measures of association.

Comparing distinct biological models for recovery

The PRR was developed as a model to explain recovery from motor impairment after stroke. Quantitative results, illustrated in Dataset C, show it’s possible for recovery, defined as the change from baseline to follow-up, to be related to baseline values even when follow-up is not predicted well by baseline. But the ability to predict patient outcomes at follow-up is important in itself, and predictive performance can form a basis for comparing the PRR to alternative biological models for recovery. We suggest cross validation (CV) as a means to compare models, using median absolute prediction error (MAPE) in held-out data to measure predictive accuracy (lower MAPE values reflect higher prediction accuracy). Details are available in Methods.

To illustrate how prediction accuracy might be used to distinguish between competing models for recovery, we generate three datasets under different recovery mechanisms and use CV to evaluate candidate models. The first two datasets, which we’ll label C200 and D200, are generated using the same process as Datasets C and D above but consist of 200 patients. Dataset E200 is similar to Dataset B, but has a latent follow-up mean of 70 rather than 30. Follow-up values greater than 66 are subject to strict threshold. That is, Dataset E200 implements large, constant recovery (with some noise) and imposes a ceiling.

We consider several models for the association between y and x. First, we assume that y values are randomly distributed around a common mean value, and do not depend on x; this is an intercept-only regression model, and the mean of observed y values is used to predict future outcomes. We implement constant recovery with a ceiling effect as a special case of Tobit regression, assuming y=min{x+c+ϵ, 66}, where the only free parameter is the constant c. This uses all data points, including those at ceiling, when estimating model parameters. We adopt an exponential form for recovery, motivated by van der Vliet et al., 2020, which includes an explicit ceiling. Next, we assume that y depends on x, but allow the association to be smooth and non-linear using an additive model (generalized additive model [GAM]); details of this model are given in Appendix 1. Finally, we implement the PRR to estimate δ given x, with y taken to be x+δ. In all cases, we use available data to estimate model parameters. Note that except for the PRR, models focus on predicting the follow-up value directly. Also, the additive model includes several other models as special cases, but may overfit due to its flexibility.

The top row in Figure 2 shows baseline and follow-up values, and the bottom row shows scatterplots of observed (ceiled) y against x. The panels in the bottom row of Figure 2 clarify the relationship between the data generating mechanism and the ability of x to predict y; in each panel, fitted values from each of the candidate models are overlaid. Unsurprisingly, for Dataset C200, there is no apparent relationship between x and y – these data are generated under cor(x,y)=0. Dataset D200 is a case when baseline values are predictive of change and outcomes. In Dataset E200, the expected y increases linearly with x until reaching a plateau, and then is uniformly near the maximum value.

The top panels show three generated datasets with true outcome values and baseline (x) and follow-up (y); a horizontal line indicates a ceiling on observed values.

In the bottom row, panels show the observed (ceiled) value at follow-up against the baseline value. Fitted values from an intercept-only model (Int), a generalized additive model (GAM), and the proportional recovery rule (PRR) are shown in the bottom row.

Figure 3 shows the result of our CV procedure applied to Datasets C200, D200, and E200; panels show the distribution of MAPE across repeated training/testing splits. For Dataset C200, the intercept-only, exponential, and additive models are the best performers; the PRR suffers from the lack of an intercept, which produces a bias in the predictions, while constant recovery with a ceiling is not well suited to this mechanism. For Dataset D200, the intercept-only and constant recovery models are the worst performers, while the (true) PRR, exponential, and additive models are similar. Finally, for Dataset E200, the (true) constant recovery model performs best. The additive model is flexible enough to capture the underlying non-linear association between y and x, and very slightly underperforms compared to constant recovery. The exponential model is a reasonable approximation, but the PRR makes relatively poor predictions.

Each panel shows the distribution of median absolute prediction errors (MAPE) obtained using cross validation for each of five models.

Panels correspond to the generated datasets shown in Figure 2. Models compared are an intercept-only model (Int), a generalized additive model (GAM), and the proportional recovery rule (PRR).

These results suggest that CV can effectively identify a best (or worst) model for prediction accuracy. When two or more models are similarly accurate, other considerations may be relevant. In Datasets C200 and D200, for example, the additive model contains the true, simpler model as a special case and is needlessly complex. We also emphasize a limitation of CV, which is the exclusive focus on prediction accuracy. This is analogous to using only cor(x,y) to understand recovery, rather than cor(x,y) and k together. As we have discussed elsewhere, dismissing observations like those in Dataset C200 as uninteresting because baseline does not predict follow-up would be a mistake: the correlation between baseline and change does not arise from mathematical coupling and may reflect important recovery mechanisms. We therefore suggest CV as one component of a careful analysis.

Results for reported datasets

We now evaluate three previously reported datasets, described in the Methods section, using the statistical techniques given above.

Figure 4 illustrates these example datasets. In the left panels, we show observed FM values for patients at baseline and follow-up (top row), and scatterplots of change against baseline (bottom row). Throughout, we differentiate recoverers and non-recoverers as specified in the original analyses. The right panel shows the contour plot of cor(x,δ), with points corresponding to observed values among recoverers for each dataset. These values cluster in the bottom right corner, with reasonably large values of cor(x,y) and k<1.

The left panels show three real datasets.

In the top row, panels show outcome values and baseline (x) and follow-up (y); points are colored to indicate recoverers (purple) and non-recoverers (yellow) using the definitions from each paper describing the data. In the bottom row, panels show change (delta) against initial impairment (66 − x), again separating recoverers and non-recoverers. The right panel shows a contour plot of Equation 1, with contours corresponding to values of the variance ratio k and the contour for k=1 highlighted. Points on this surface show correlation values obtained for the real datasets.

We next conducted a bootstrap analysis on the subsample of recoverers to obtain inference for k and cor(x,y). We display the results using 1000 bootstrap samples for each dataset in the contour plot in Figure 5, showing the null value corresponding to random recovery for reference, and summarize the results using 95% confidence intervals in Table 1. For each dataset, there is strong evidence that k and cor(x,y) differ from our suggested null values (1 and 0, respectively), and from those corresponding to random recovery. That is, we have evidence both that recovery is related to baseline values in a way that reduces variance at follow-up, and that baseline values are predictive of follow-up. These results inform conclusions about the statistical significance of observed associations between baseline, change, and follow-up values using plausible null hypotheses, and specifically address concerns about high correlations induced by canonical mathematical coupling. However, several mechanisms (including the PRR and constant recovery in the presence of strong ceiling effects) could give rise to these correlations.

Each panel shows the results of the bootstrap procedure used to obtain inferences about the value of correlations and variance ratio.

Red points are the values obtained for the full dataset, and blue points are values obtained in each of 1000 bootstrap samples; points are overlaid on the contour plot of Equation 1.

Table 1
Values and 95% confidence intervals for k and cor(x,y) for each of three datasets.

Confidence intervals are obtained through a bootstrap procedure with 1000 bootstrap samples.

Dataset namekcor(x,y)
Stinear and Byblow0.39 [0.27, 0.54]0.73 [0.66, 0.79]
Winters0.07 [0.05, 0.09]0.59 [0.49, 0.69]
Zarahn0.13 [0.03, 0.24]0.75 [0.55, 0.92]

We next compared the performance of five models in terms of predictive accuracy using CV. As for generated data, we considered an intercept-only model, a model assuming constant recovery with a ceiling effect, an additive model, an exponential model, and the PRR. For each dataset, we generated 1000 training/testing splits and summarize prediction accuracy using MAPEs. The plots below show the distribution of MAPE across splits for each model and dataset. In these examples, constant recovery with a ceiling underperforms against competing methods. The exponential model, additive model, and PRR prediction accuracies are often comparable, although the PRR appears superior for the Stinear and Byblow and Zarahn datasets. A figure in Appendix 1 shows fitted values for each model applied to training datasets. In keeping with the results for prediction accuracy, the constant recovery model is visually a poor fit. In some instances, the additive model is too flexible, especially for the Zarahn data. The additive and exponential models are similar to the PRR in terms of fitted values despite their additional flexibility and complexity.

To provide some frame of reference for the MAPEs reported in Figure 6, in Table 2, we provide values for the percent of outcome variation explained by the PRR for each dataset.

Table 2
The proportion of follow-up (y) variation explained by the proportional recovery rule (PRR) for each of three datasets.

Follow-up fitted y^PRR values are obtained by adding predicted change from the PRR to observed baseline values, and we compute 1(yiy^ iPRR)2(yiy¯)2.

Dataset namePercent of outcome variation explained
Stinear and Byblow0.55
Winters0.35
Zarahn0.56
Each panel shows the distribution of median absolute prediction errors (MAPEs) obtained using cross validation for each of five models.

Panels correspond to the datasets shown in Figure 4. Models compared are an intercept-only model, a generalized additive model, and the proportional recovery rule (PRR).

The preceding results suggest that data among recoverers is consistent with the PRR for each study. The strong correlations between baseline and follow-up are not driven by canonical forms of mathematical coupling or regression to the mean. While similar values of k and cor(x,y) can arise from a variety of mechanisms, the PRR clearly outperforms a model that assumes constant recovery in the presence of a ceiling effect in terms of prediction accuracy. Competing models for recovery, specifically those allowing for non-linear association between baseline and follow-up, do not produce more accurate predictions than the PRR, and are similar in their fitted values.

Distinguishing between recoverers and non-recoverers

To this point, we have focused on recoverers under the implicit assumption that a biologically distinct group of non-recoverers exists and can be identified. This assumption is supported by prior studies of upper limb motor control and deficits in other domains (Prabhakaran et al., 2008; Byblow et al., 2015; Winters et al., 2016; Zandvliet et al., 2020), and promising work suggests that it is possible to identify non-recoverers using baseline characteristics (Byblow et al., 2015). However, the identification of non-recoverers has not been approached in the same way across studies, and there is concern that data-driven methods can produce misleading results in some circumstances (Hawe et al., 2018; Lohse et al., 2021). It is necessary to understand the limitations of some methods for data-driven subgroup identification before assessing related arguments about recovery and the PRR.

Clustering refers to a collection of methods for unsupervised learning that has several weaknesses in the context of recovery. It is unclear what clustering approach is best, and results can be sensitive to this choice. Perhaps more critically, determining the ‘true’ number of clusters present is imprecise and open to interpretation. Although tools like the Gap statistic (Tibshirani et al., 2002; James et al., 2013) can provide guidance, the choice between one, two, or more clusters is not supported by hypothesis tests, confidence intervals, or other methods for statistical inference. Practitioners are encouraged to try several numbers of clusters and cautioned to be aware that there is rarely a single best selection (James et al., 2013). Results are typically considered exploratory and assessed for validity based on visual inspection or additional supporting information. As such, clustering can provide only limited evidence for or against the existence of recoverers and non-recoverers (or finer partitions of patients) when examining a specific dataset. At worst, one might simply assume that distinct recoverer and non-recoverer clusters exist and can be separated using clustering. Failure to critically examine this assumption using visual and quantitative data checks can yield misleading results.

The weaknesses inherent to cluster identification can lead to flawed scientific conclusions. Hawe et al., 2018, and Lohse et al., 2021, show how errors can arise using ‘random’ recovery. Put briefly, random recovery assumes that follow-up values y are uniformly distributed between the baseline x and the maximum possible value. If data arise via this mechanism and it is assumed that two groups exist, the use of clustering will often uncover one cluster that appears to follow the PRR with a recovery proportion of roughly 0.75. Building on this observation, Lohse et al., 2021, propose an approach for comparing the PRR to random recovery. Data under random recovery can be generated by drawing baseline values x from the observed values with replacement (to mimic the distribution of baseline values in the sample) and then drawing follow-up values y from a uniform distribution between the baseline x and the maximum possible value. This generated data can be analyzed through clustering followed by regression to obtain an observed slope in the ‘recoverer’ cluster. By repeating this process many times, one can obtain the distribution of slopes in the ‘recoverer’ cluster under random recovery. Treating this as a null distribution, Lohse et al., 2021, argue, provides a way to quantify whether a slope obtained through the same analysis of real data is consistent with random recovery. The authors assert that ‘current data do not support the claim that recovery is proportional any more than that recovery is random’.

However, this narrow focus on the distribution of slopes that arises from random recovery bypasses a critical question – whether the results of the clustering analysis themselves are consistent with random recovery. Put differently, one must first ask whether there is evidence for the existence of distinct clusters in observed data using random recovery to generate a null distribution. Viewing data generated under random recovery alongside data obtained in a study of upper limb motor control recovery, as in Figure 7, emphasizes this difference. While clustering can misleadingly identify a ‘recoverer’ group when data actually follow random recovery, visual inspection suggests that an obviously different mechanism underlies the Winters dataset. The null distribution of within-cluster dispersion, rather than slopes, provides a way to quantify this difference; see Methods for details. Given the contrast between panels in Figure 7, it is not surprising that this approach rejects the null of random recovery for these data (p<0.001).

Both panels show change between baseline and follow-up against initial impairment (66 – baseline).

Left panel shows data generated under a ‘random recovery’ process, in which outcome values are drawn from a uniform distribution over the baseline value and ceiling. Right panel presents again data from Winters et al., 2015.

From this, we conclude that an inappropriate application of unsupervised learning (i.e. one that presumes two clusters exists) can produce misleading results. This problem is exacerbated by the lack of concrete statistical guidance for determining the number of clusters in a sample. (Similar issues can arise in other ways, for example from a definition of non-recoverers as those who recover less than expected under the PRR.) Simulations of random recovery, like those in Hawe et al., 2018, and Lohse et al., 2021, are useful for illustrating these issues but should not be misinterpreted: the ability to produce slopes like the PRR through analyses of generated data does not refute the PRR (or any other model of recovery). Nor does it argue against the existence of meaningful recoverer and non-recoverer groups. In many studies of upper limb motor control recovery, a distinct group of non-recoverers is clear from a visualization of the data (and supported through appropriate hypothesis tests) or can be identified through a distinct biomarker, and does not artifactually arise through clustering.

Finally, we distinguish the identification of a biologically distinct group comprising non-recoverers from the stratification of patients into severe and non-severe groups based on initial impairment. Zarahn et al., 2011, for example, used a subjective approach to classify patients with baseline FM ≤ 10 as severely affected. More recently, Bonkhoff et al., 2022, used Bayesian hierarchical models and formal model selection approaches to identify a threshold (also found to be FM = 10) above and below which patients exhibit different recovery patterns. Our point in this section, meanwhile, is that the severely affected population contains biologically meaningful subgroups – recoverers and non-recoverers – and that these can be identified using appropriate unsupervised learning methods.

Discussion

The PRR was discovered while searching for a possible regularity in the relationship between initial impairment, as measured by the FM-UE, and recovery in the context of upper limb paresis in the time shortly after stroke (Krakauer and Marshall, 2015). It was understood as a description of the biological change process that underlies observed recovery, and subsequently evaluated in other recovery settings. Although it was not intended to form the basis for patient-level predictions, the strong correlations between initial impairment and recovery has suggested that accurate predictions are possible. At its core, the PRR models the association between baseline and change; this is always a fraught statistical problem, and recent publications on the PRR have revived longstanding concerns in the context of stroke recovery (Hope et al., 2018; Hawe et al., 2018; Bonkhoff et al., 2020).

We have revisited the arguments for the PRR as a descriptive and predictive model, focusing on key statistical questions at each step. We identified scenarios in which observed correlations are ‘artifactual’ – induced either by mathematical coupling or by regression to the mean due to measurement error – versus those when they are real signals, emphasizing the variance ratio and tests similar to Oldham’s method. For non-artifactual signals, we used CV to compare models for recovery (e.g. PRR versus constant recovery with a ceiling); this also provides a concrete metric for the clinical usefulness of predictions made by each method. Finally, we considered the problem of distinguishing recoverers from non-recoverers, and the limitations of unsupervised learning for this problem.

Our findings in these datasets suggest that the association between initial impairment and change is non-artifactual, and the PRR is better as a biological and predictive model than several competing models, including constant recovery with a ceiling effect. These data also suggest that a biologically distinct group of non-recoverers exists. We distinguish between the usefulness of the PRR as a biological and a predictive model deliberately: biological models are important for understanding the recovery process itself, and accurate predictions are important in clinical care. Although the PRR is useful for both, it isn’t necessary that a single model address both considerations simultaneously.

We acknowledge several limitations and caveats. The statistical considerations for recovery are nuanced and often counterintuitive. Our arguments deviate from usual null-value hypothesis testing, and recognize that zero is often not the appropriate null value. This is related to the important observation that ‘large’ correlation values can arise in a variety of settings, which complicates but does not invalidate statistical approaches. While we distinguish between the usefulness of a model as either biological or predictive, our preferred method for comparing models is based only on their predictive accuracy. CV can identify models that have better or worse predictive performance, but in itself does not examine the validity of underlying model assumptions or ensure that better-performing models are accurate enough to make meaningful clinical predictions. Recoverers and non-recoverers should be identified in a way that is not artificial, and that does not serve to introduce evidence for a model of recovery that would not otherwise exist. Lastly, we recognize that not every dataset will be similar to those we presented; in those cases, we hope the tools we suggest will be used to evaluate the PRR against other models.

Indeed, a recent large-scale cohort study examining the PRR presents data that differ in notable ways from the studies we consider (Lee et al., 2021). The authors are concerned about ceiling effects in the FMA-UE, but use an analysis approach that differs from the framework presented here. Lee et al., 2021, use a logistic regression to model the probability of achieving full recovery; they observe that more than half of all patients, and more than 60% of patients with baseline FMA-UE above 46, recover to the maximum possible value. These data mimic those in our generated Dataset E200, and might benefit from the comparison between the PRR and a non-linear model in terms of predictive accuracy. Certainly, the FMA-UE lacks resolution, especially at higher values, and a refined outcome measure that ameliorates such ceiling effects is needed. There is nothing in this study that precludes the possibility that PRR would apply for this new measure. Meanwhile, we’re skeptical of broader claims made regarding the non-generalizability of the PRR. The cohort in Lee et al., 2021, is much more mildly impaired at baseline than data we have observed; this could have several causes, but the choice to exclude patients with FMA-UE hand subscore of 0 at day 7 likely removes many moderate and severely affected patients who are unlikely to recover to ceiling. Thus, in a sense the study selected for a cohort that is not only clinically non-representative, but one that makes it hard to derive the PRR by looking for changes at the high-end of the FMA-UE, a range known to fail to detect subtle residual deficits.

The PRR provides an appealingly simple model for understanding recovery, even if the statistical approaches for evaluating it are not direct. We argue that the PRR is and will continue to be relevant as a model for recovery, but we agree with others that more complex analytic strategies are necessary to move beyond it. Recovery depends on factors other than initial impairment. The PRR estimates an average recovery proportion, but individuals will recover more or less than that average; determining whether this is ‘noise’ or heterogeneity that can be predicted using other forms of measure at baseline is an important next step. Similarly, reliable techniques to identify non-recoverers at baseline are needed. Different outcome measures and different recovery settings may not be well described by the PRR. A specific focus on predicting long-term outcomes is an important goal; supervised learning methods may be helpful in this, although many of these methods have unclear interpretations.

More complex models than the PRR do not inherently resolve the statistical issues we’ve raised and may in fact make them harder to identify. For example, it’s been suggested that mixed models could account for differences in baseline values and change across subjects in a way that avoids mathematical coupling by modeling baseline and follow-up values directly (van der Vliet et al., 2020). This is not the case. Consider a model for outcome values at baseline (time = 0) and a single follow-up (time = 1) that includes both random intercepts and random slopes. When such a model is applied to Datasets A, C, and D, the random intercepts and slopes will be correlated; low random intercepts will suggest steeper random slopes in a way that mimics baseline and change scores. Put differently, to predict a follow-up value from baseline, one must use the baseline value to predict a random slope using the correlation between random effects. Indeed, for Dataset A that correlation will equal –0.71 – the same as cor(x,δ). Distinguishing between mathematical coupling and recovery that changes the variance ratio remains a problem, but now one that involves the correlation of random effects. Alternative coding for the time variable can induce a model that mimics Oldham’s method, and avoids coupling based on similar arguments (Blance et al., 2005).

Our point is not that mixed models are a wrong choice – in fact, we are enthusiastic about work that combines serial measurement, non-linear trends, random effects, and mixture models (van der Vliet et al., 2020). Taken together, these elements can overcome many of the limitations introduced by studies that include only baseline and single follow-up observations. Serial measurements and careful modeling provide insight into patients’ recovery trajectory, reduce effects of measurement error, and identify non-recoverers early in the recovery process. Nonetheless, non-linear mixed models fit to serial measurements are not a panacea, and do not avoid the challenges described in this paper. The same fundamental issues that arise from within-subject correlation and changing variances over time due to recovery and measurement ceilings should be considered when using this or any other model framework. Indeed, more complex models may serve to mask these basic issues by making them implicit rather than explicit.

Allowing for differences in datasets, implementations, and performance criteria, our results are consistent with those reported in recent papers examining the PRR as a predictive model. Bonkhoff et al., 2020, fit several models to a subset of patients selected to mitigate ceiling effects and avoid including non-recoverers, and found that that PRR was among the best performing models in terms of out-of-sample prediction accuracy. The percent of outcome variation explained was lower in their work than in our analyses (21% versus 35–56%), leading the authors to suggest that the PRR may be better than other models but insufficient for making clinical predictions in the context of precision neurology. Our view is a less pessimistic take on the same information: that the PRR outperforms competing models (including those intended to account for ceiling effects) attests to its value, and the percent of outcome variation explained is suggestive of an important biological mechanism that should be investigated and understood — even if R2 values are not as high as has been suggested. After all, if it were found that a risk factor accounted for greater than 20% of the variance in the chance of getting a disease it would immediately be investigated and prevention attempted. From the standpoint of clinical prediction, it seems likely that accurate models developed in the future will include initial impairment as a covariate, and may include a term that reflects proportional recovery.

There is growing consensus around the need for careful comparisons of different models for recovery and for analytic strategies that minimize the impact of ceiling effects. Bonkhoff et al., 2020, and Bonkhoff et al., 2022, use a statistically rigorous model selection approach based on leave-one-out cross-validated deviances, which is particularly well suited to comparing Bayesian hierarchical models. These papers also consider only patients with initial impairments below 45 to avoid ceiling effects induced by mildly affected stroke patients. We use cross-validated MAPE because it explicitly focuses on prediction accuracy, and the results can therefore be interpreted in the context of single-subject predictions. We also hesitate to discard a substantial fraction of our data and miss the opportunity to model recovery for mildly affected patients at baseline, although we recognize that the narrower outcome distribution for patients at or near ceiling can affect measures of model performance. These analytic preferences reflect different approaches to questions and challenges whose importance is increasingly agreed upon.

In light of this, we stress again that the PRR is intended as a model for recovery from a specific form of impairment as measured by the FM, and is best understood as an attempt to mathematically capture a component of the recovery process. The emphasis on change between baseline and follow-up is deliberate: biological recovery is a process that causes a change, which then leads to the final value. Understanding whether recovery varies across patients and what mechanism might drive such variability is a fundamental scientific question. That the PRR also has some value for predicting final outcomes is to be welcomed but not necessary for its biological importance.

Applications of the PRR in studies of upper limb motor control recovery have often found that, averaging across recoverers, roughly 70% of lost function is regained in the time shortly after stroke. We’ve argued that this is attributable to spontaneous biological recovery (Cramer, 2008; Zeiler and Krakauer, 2013; Cassidy and Cramer, 2017). The existence of such a mechanism does not imply that behavioral interventions are unable to improve patient outcomes. Instead, future clinical trials should seek to improve on the proportion of impairment reduction, reduce the proportion of non-recoverers, or induce changes that are distinct from (and better than) those expected under the PRR.

Conclusions

Consideration of associations between baseline, follow-up, and change continues to be of interest across scientific domains despite the statistical challenges, and for good reason: these can be and often are related in ways that are not due to artifacts. Our goal here was to clarify statistical reasoning concerning the problem of relating baseline to change in the context of stroke recovery. It is for the reader to decide whether the work presented here is either a bumpy statistical detour or an interesting scenic route. In either case, rigor about this issue is essential for continued biological and clinical progress, and it would be unfortunate if the recent spate of papers leads to dismissal of the PRR out of either confusion or exhaustion.

As a model that relates baseline values to change, the PRR requires the application of careful, and sometimes counterintuitive, statistical techniques. We have described well-established but non-standard approaches to distinguish between artifacts due to coupling from signals that arise from true associations, introduced tools to compare competing models for recovery based on their predictive accuracy, and elaborated on issues that can arise when using some data-driven methods to distinguish recoverers and non-recoverers. In our analyses of real data, we found evidence for signals not attributable to coupling, and obtained better predictions using the PRR than a model that assumes constant recovery (with some noise) up to a ceiling. This suggests that the PRR is non-artifactual and remains relevant, at the very least, as a model for recovery. Future work should seek to both explain the mechanistic basis for PRR and improve upon it as a predictive model.

Methods

Reported datasets

Details of inclusion and exclusion criteria are available in the referenced literature. All patients received usual care according to evidence-based stroke guidelines for physical therapists, but no systematic interventions.

  • Stinear and Byblow: combined data from two studies. First, Byblow et al., 2015, assessed an a priori prediction of proportional recovery based on corticospinal tract integrity using transcranial magnetic stimulation to determined motor evoked potential status (MEP+, MEP-) for 93 patients within 1 week of first-ever ischemic stroke stroke. FMA-UE were obtained at 2, 6, 12, and 26 weeks. Second, Stinear et al., 2017, added data from recurrent ischemic stroke and intracerebral hemorrhage patients with new upper limb weakness to form a larger dataset of 157 patients, all with known MEP status. Following this work, we define recoverers and non-recoverers as MEP+ and MEP-, respectively.

  • Winters: first-ever ischemic stroke patients were recruited for the prospective cohort study entitled Early Prediction of functional Outcome after Stroke (EPOS) (Nijland et al., 2010; Veerbeek et al., 2011). Data comprise baseline (day 2 post-stroke) and follow-up (day 187 post-stroke) FMA-UE observations for 223 patients; 211 were originally reported in Winters et al., 2015, with recoverers and non-recoverers identified using a hierarchical clustering based on average pairwise Mahalanobis distances.

  • Zarahn: data consist of patients with first time ischemic stroke with some degree of clinical hemiparesis (NIH stroke scale for the arm ≥ 1). FMA-UE was assessed both at ~2 days post-stroke and ~3 months post-stroke and were originally reported in Zarahn et al., 2011. We focus on the 30 patients in the imaged subsample with publicly available FMA-UE values. Recoverers and non-recoverers are determined using baseline FMA-UE >10.

Ethical approvals were obtained for each dataset; see referenced literature for details. The Winters and Zarahn datasets are included in supplementary materials. The Stinear and Byblow data were collected under ethical approvals that do not permit placing data online. Data can be made available under reasonable request to the author.

Resampling approach for inference on cor(x,y) and k

We suggest a bootstrap procedure to obtain confidence intervals for cor(x,y) and k. A single bootstrap sample can be constructed by selecting subjects (pairs of both x and y values) from a full dataset with replacement. Next, we compute the values k and cor(x,y) for the bootstrap sample. This is repeated a large number of times (e.g. 1000) to produce an empirical distribution for the quantities of interest, which can be used to derive corresponding confidence intervals. Simulations evaluating this approach suggest that coverage of 95% CIs is roughly 0.95 for moderate sample sizes.

Our suggested null values for k and cor(x,y) are 1 and 0, respectively, but other values can be used. For null hypotheses framed in terms of data generating mechanisms rather than parameter values, such as the ‘random’ recovery null hypothesis (Lohse et al., 2021), it can be difficult to derive the corresponding null values. In these cases, we suggest to obtain empirical values by generating a large dataset under the null and computing the k and cor(x,y) directly.

CV to compare models

Data are divided into training and testing sets; training data are used to fit models, and these models are applied to data in the testing set to obtain predicted outcomes. The difference between predicted and observed outcomes in the testing data reflects each model’s ability to make accurate predictions of data not used in model development. Comparing models in terms of their MAPE is an established technique for choosing among candidate models, and the MAPE provides a measure of the anticipated prediction error for a given model.

CV has several possible implementations; here, random training and testing splits are comprised of 80% and 20% of the data, respectively, and the training data is used to fit each model. Given model fits, predictions are made for the testing dataset and compared to actual outcomes, and the difference is summarized using the MAPE. This process is then repeated 1000 times, so that the distribution of MAPE across training and testing splits is obtained.

Defining a null distribution for within-cluster dispersion

Although the Gap statistic does not allow for inference to determine the number of clusters within a sample, it provides a useful metric for comparing observed data to a null hypothesis of random recovery. Intuitively, for a given number of clusters k, the Gap statistic Gap(k) compares the observed degree of within-cluster similarity to the expected degree of within-cluster similarity under a reference distribution. Suppose a clustering analysis has produced clusters C1, C2, , Ck . Assuming the Euclidean distance is used to measure distance between observations in the same cluster, we measure overall within-cluster dispersion using the pooled within cluster sum of squares around the cluster mean:

Wk=r=1kzi Crzi- μr2.

Here, μr is the within-cluster mean and zi is the vector of observed values for subject 1in (e.g. zi=(xi, yi)).

The value of Wk decreases as k increases, regardless of whether additional true clusters are identified. The Gap statistic therefore standardizes log(Wk) to provide guidance on the choice of k within a dataset. To standardize log(Wk), one compares the observed value to its expectation under a reference distribution:

Gapn(k)=En{log(Wk)}log(Wk).

The expectation En*logWk is approximated through the analysis of multiple datasets generated under the reference distribution. For each dataset one computes log(Wk), and the expected value is taken as the average across these. The reference distribution in the calculation of the Gap statistic is often uniform over the range of each feature in the clustering analysis. In the context of recovery, this suggests a square over all possible values of x and y rather than the triangular region defined by random recovery, but the spirit is similar.

We use the Gap statistic to measure the strength within-cluster dispersion against that expected under a reference distribution. To compare observed data to a null distribution, we suggest a resampling-based approach to construct a null distribution for Gapn(k). For random recovery, to construct a single resampled dataset, we suggest to sample observed x values with replacement; generate corresponding y values under random recovery; and obtain Gapn(k). This process is repeated many times, and the observed value Gapn(k) is compared to the distribution of Gapn(k). This process is very similar to one suggested by Lohse et al., 2021, with the exception that we focus on evidence for clustering through Gapn(k) rather that the distribution of slope values obtained in analyses of resulting clusters.

The Gap statistic can be computed for any clustering method. By extension, our method for comparing the observed amount of within-cluster dispersion to that expected under a null distribution can as well. In our analyses we follow recent literature, and use hierarchical clustering based on Mahalanobis distances between points. Simulations suggest that our testing approach achieves nominal size: when data are in fact generated under random recovery, we reject the null hypothesis 5% of the time under α=0.05.

Implementation and reproducibility

All analyses were implemented in R; supplementary materials contain source code and RMarkdown documents to reproduce all figures and analyses.

Appendix 1

Table of null values

In the table below, we provide values of cor(x,δ) for a range of input values cor(x,y), keeping k fixed at 1. These are a result of Equation 1 in the manuscript, and are intended to provide context for correlations between baseline and change that arise through coupling rather than as a result of a recovery process that affects the variance ratio.

kcor(x,y)cor(x,δ)
10–0.707
10.1–0.671
10.2–0.632
10.3–0.592
10.4–0.548
10.5–0.5
10.6–0.447
10.7–0.387
10.8–0.316
10.9–0.224

Correlations, variance ratios, and regression

A simple linear regression of δ on ii=max-x, where max is the maximum possible value of the scale and max-x is initial impairment, can be written

δ=β0+β1ii+ϵ.

The following expressions relate regression parameters and diagnostics to corx,δ,cor(x,y) and k:

  • β^1=1cor(x,y)k

  • R2=cor(x,δ)2=(β^1k2β^11)2

The next subsection contains derivations of these expressions. From these, we conclude first that cor(x,y)=0 implies β^1=1. This suggests that usual hypothesis tests of the slope which assume a null value of 0 should instead assume a null value of 1. Moreover, this test measures the association between baseline and follow-up, and slopes that differ significantly from one suggest that baselines can be used to predict follow-up values. Second, we see that the amount of variation explained depends on both cor(x,y), through the estimated slope, and on the variance ratio; for example, in the canonical example of mathematical coupling, R2=0.5. Resampling-based tests for the slope and R2 can be performed analogously to those for k and cor(x,y) using these (or other) null values.

Like cor(x,δ), R2 should be interpreted with caution: the preceding expression provides a way to determine the expected or null R2 for a given slope and k=1, which can be used as a frame of reference for observed R2 values. R2 values that depart from the null value may suggest k1 and, by extension, changes that are related to baseline in a way that systematically reduces the variance ratio. As in the previous section, based on this analysis it will remain unclear whether a statistically significant difference is attributable to proportional recovery, ceiling effects, or some other process; distinguishing between models is the subject of later sections.

The above expression for R2 is in the context of a regression of change on initial impairment, which will differ from the R2 arising from a regression of the follow-up value on the baseline observation. Because these R2 values are the square of cor(x,δ) and of cor(x,y), respectively, much of our previous discussion applies here. The initial value x may explain a higher proportion of variation in δ than in y; this setting is not necessarily artifactual, and high R2 in the regression of change on baseline can be important even when R2 in the regression of follow-up on baseline is low, just as high cor(x,δ) can be important even when cor(x,y) is low.

The inclusion of the intercept β0 in the simple linear regression differs from the usual formulation of the PRR and helps to establish clear connections between a correlation-based and a regression-based perspective. This is helpful for evaluating the results of past studies, especially results expressed in terms of percent variation explained, and refines the use of both correlations and regressions as evidence for the PRR.

Similar to correlations and variance ratios, results from a regression-based analysis can, when understood correctly, help assess evidence for the PRR in a given dataset: establishing statistical significance using appropriate tests, along with evaluation of regression diagnostics, will suggest whether data are consistent with the PRR. However, all these approaches measure only linear associations, and neither compare the PRR to alternative models for recovery nor evaluate the ability of the PRR to make accurate predictions about patient outcomes.

Derivation of expressions for β^1 and R2

In the preceding section, we discuss a simple linear regression of δ on ii=max-x, where max is the maximum possible value of the scale and max-x is initial impairment. This regression can be written

δ=β0+β1ii+ϵ.

The inclusion of an intercept in the simple linear regression differs from the usual formulation of the PRR, but we find model helpful because it connects correlations to regression parameters and summaries. Specifically, the following expressions relate regression parameters and diagnostics to corx,y and k:

  • β^1=1cor(x,y)k

  • R2=cor(x,δ)2=(β^1k2β^11)2

In a simple linear regression, the OLS estimate of the intercept is the ratio of the covariance of the predictor and response and the variance of the predictor. In this specific regression, we have:

β^1=cov(δ,ii)var(ii)=cov(yx,maxx)var(maxx)=cov(y,x)+cov(x,x)var(x)=var(x)var(x)cov(y,x)var(x)=1 cor(x,y)σxσyvar(x)=1cor(y,x)σyσx=1 cor(y,x)k

Next, we note that in a simple linear regression, R2 is the squared correlation between the outcome and response. Then:

R2=cor(ii,δ)2=cor(maxx,δ)2=cor(x,δ)2

Lastly, for linear regressions that include an intercept, R2 is the squared correlation between the outcome and fitted values obtained from the model. Starting from this, we find:

R2=cor(β^0+β^1ii,δ)2=(cov(β^0+β^1ii,δ)var(β^0+β^1ii)var(δ))2=(cov(β^0+β^1(maxx),yx)var(β^0+β^1(maxx))var(yx))2=(cov(β^1x,yx)var(β^1x)var(yx))2=(cov(β^1x,yx)var(β^1x)var(yx))2=(β^1cov(x,yx)β^1var(x)var(yx))2=(var(x)cov(x,y)var(x)[var(x)+var(y)2cov(x,y)])2=(var(x)cor(x,y)var(x)var(y)var(x)[var(x)+var(y)2cor(x,y)var(x)var(y)])2=(var(x)cor(x,y)var(x)kvar(x)[var(x)+var(x)k2cor(x,y)var(x)k])2=(var(x)(1cor(x,y)k)var(x)1+k2cor(x,y)k)2=(1cor(x,y)kk+22cor(x,y)k1)2=(β^1k2β^11)2

Specification of the GAM

We use a GAM to allow smooth, non-linear associations between baseline x and follow-up y. This is intended to provide a flexible candidate model for comparison with other mechanistic model implementations, including the PRR and constant recovery in the presence of strong ceiling effects.

Our implementation uses the gam function in the mgcv R package (Wood, 2012). A thorough overview of the theoretical and practical underpinnings of this widely used package can be found in related monograph (Wood, 2017). Briefly, the default gam approach estimates non-linear associations using a rich thin-plate spline expansion with an explicit penalization to enforce smoothness in the result; for intuition, a high degree of penalization results in linear fits, while less penalization allows for greater non-linearity. The smoothing parameter(s) are selected using generalized CV as part of the model-fitting procedure.

Code supplements contain all model fitting procedures; for clarity, we note that our implementations are of the form: mgcv::gam (y ~ s(x), data = winters_df).

Fitted values across training/testing splits

We use CV to compare the performance of five models in terms of predictive accuracy. We considered an intercept-only model, a model assuming constant recovery with a ceiling effect, an additive model, an exponential model, and the PRR. For each dataset, we generated 1000 training/testing splits; results for prediction accuracy using MAPEs are shown in Figure 6 in the main paper. The Appendix 1—figure 1 shows fitted values for each model (except the intercept-only model) applied to training datasets generated in the CV procedure. The constant recovery model is visually a poor fit, and occasionally the additive model is too flexible, especially for the Zarahn data. The additive and exponential models yield fitted values that are similar to the PRR despite their additional flexibility and complexity.

Appendix 1—figure 1
Fitted values for each of four models, obtained within separate training / testing splits as part of the cross validation procedure.

Data availability

In this work, we reanalyze previously reported datasets. Two of these (referred to as "Winters" and "Zarahn") are included with the submission as a data file. A third (referred to as "Stinear & Byblow") contain data that were collected under ethical approvals that do not permit placing data online. Data can be made available under reasonable request to the author. Researchers interested specifically in the data referred to as "Stinear & Byblow" should contact Cathy Stinear <c.stinear@auckland.ac.nz> and / or Winston Byblow <w.byblow@auckland.ac.nz>, who serve as custodians of the data on behalf of study participants. Please include a description of the planned analyses and a justification for the use of these data specifically. Commercial research using these data is not permitted. All code used in the analyses of all datasets is provided as part of the submission, and all data that can be made publicly available (i.e. "Winters" and "Zarahn") are included with the submission.

References

  1. Software
    1. Lee H
    2. Kim D
    3. Sohn M
    4. Shin Y
    5. Oh G
    6. Lee Y
    7. Joo M
    (2021)
    Revisiting the proportional recovery model in view of the ceiling effect of fugl-meyer assessment
    STROKEAHA-120.
    1. Lohse K
    2. Rachel H
    3. Sean D
    4. Stephen S
    (2021)
    Statistical limitations on drawing inferences about proportional recovery
    Neurorehabilitation and Neural Repair 35:10–22.
  2. Book
    1. Wood S
    (2012)
    Mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation
    University of Bath.

Decision letter

  1. Taraz Lee
    Reviewing Editor; University of Michigan, United States
  2. Michael J Frank
    Senior Editor; Brown University, United States
  3. Howard Bowman
    Reviewer; University of Kent, United Kingdom
  4. Yu-Kang Tu
    Reviewer; Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taiwan
  5. Thomas Hope
    Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting the paper "The proportional recovery rule redux: Arguments for its biological and predictive relevance" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Howard Bowman (Reviewer #3).

Comments to the Authors:

We are sorry to say that, after consultation with the reviewers, we have decided that this work will not be considered further for publication by eLife.

The reviewers had significant concerns about both the clarity of the manuscript and the methods employed. They felt that there remained confounds in the analyses that cannot be sufficiently ruled out. Ultimately, there was not sufficient enthusiasm that the manuscript would have broad appeal to the readership of eLife.

Reviewer #2 (Recommendations for the authors):

This paper addresses the inherent potential of motor recovery after stroke. The authors would like to validate the concept of proportional recovery which assumes that stroke patient typically show a fixed amount of recovery of about 70% irrespective of where they start. The authors suggest that after controlling for possibly confounding variables like ceiling effects or mathematical coupling, they still could validate the proportional recovery rule. The Results section is difficult to read for a non-statistician. However, the findings reported in the paper are rather incremental than novel. Most importantly, there are no data that explain the neurobiological basis of proportional recovery. Hence, the novelty of this submission remains low apart from more sophisticated statistical models which ultimately come to a similar conclusion than reported years before.

As stated above, the Results section is very difficult to read and appears somewhat lengthy. Lesion analyses or any other data about the possible neurobiological foundations of the proportional recovery are missing. I acknowledge the relatively large sample size, but the Fugl-Meyer score based view without any further para-clinical evidence is somewhat meager. Therefore, I do not see that this paper has a major impact on the field, as in the best case it confirms the proportional recovery rule which has been published many years before.

Reviewer #3 (Recommendations for the authors):

This paper presents a defence of the proportional recovery theory of how stroke patients with upper limb motor impairments recover. This is a theory that has come under attack due to confounds associated with the analyses used to assess the theory. The paper acknowledges these confounds, which include compression (to ceiling) enhanced mathematical coupling and over-fitting in the differentiation of recovers and non-recovers. The paper then presents a number of methods: use of non-zero nulls; focus on the variance-ratio; bootstrapping to assess variability in parameter and model fits; comparison of model fits; etc. These certainly improve upon the methods classically used in the literature to justify the proportional recovery hypothesis. Additionally, we appreciate the constructive tone of this paper and its effort to arrive at a consensus position on the proportional recovery issue. Indeed, the overall conclusion that there is a proportional (to lost) recovery pattern amongst the moderately impaired group is a position that we have sympathy with, although that pattern is much weaker than some previous claims in the literature.

However, there seem to be remaining problems with the analyses employed, although, it is difficult to definitively judge the methods employed without more detail than the paper currently offers.

The two most important issues that seem to be weaknesses are as follows.

1) Remaining confounds with key measures: Goldsmith et al. argue that particular combinations of correlation between baseline and follow-up, i.e. cor(x,y); correlation between baseline and change, i.e. cor(x,δ); and the variance-ratio, i.e. k, is diagnostic of proportional recovery (PPR). In particular, they argue that the combination of cor(x,y), cor(x,δ) and k shown in table 1 and figure 5 justify their claim that the three datasets they consider, Stinear and Byblow, Winters and Zarahn, definitively exhibit a PPR pattern. However, the simulations presented in Bonkhoff et al. (2020), show that this same combination of measures can be exhibited by a constant recovery model in the presence of ceiling. Indeed, this is the fundamental confound described by Hope et al., Bonkhoff et al. and Bowman et al., which led to the conclusion that the only way the ceiling confound could be avoided is by throwing data points at ceiling away. (This is actually what I believe you should do with the data in this paper. Note, in Bonkhoff et al. (2020), we verified with model recovery simulations that the throwing away procedure enables the correct model to be recovered.) To make this completely clear, I have prepared a figure, which I am assured will be attached with this review.

https://submit.elifesciences.org/eLife_files/2021/09/17/00098237/00/98237_0_attach_15_32737_convrt.pdf

This shows table 1 and figure 5 from Goldsmith et al. on the left and figure 6 from Bonkhoff et al. on the right. I have added annotations to figure 6, which show where the Stinear and Byblow and Winters data sets would sit in the panels for constant recovery. The Zarahn data set does not so obviously correspond to a vertical line, but fits with the general combination of measurements.

2) Limitations of model fits: certainly, the effort to fit models to the three data sets is welcome. However, there is a concern that this model comparison is not a completely fair test of competing theories to proportional recovery. The model fitting comparison in figure 6 is interesting to see. Although, I do find it difficult to interpret the findings, for the following reasons: (a) a criterion has not been given to judge when one model is, in a statistical sense, doing better than another model. I would have expected an assessment of where the MAPE for the best model sits in the bootstrap distribution of MAPEs for each of the other models. This would enable an inference of the kind that the MAPE of the best model is beyond the 95% confidence level of the second best model, and similar for the third. Something of this kind was done in Bonkhoff et al. (2020). (b) I suspect the reason that the Generalised Additive Model does not win is because it is too flexible. It would have been very useful to have seen (perhaps in an appendix) the exact functional form of the Generalised Additive Model. So, I cannot be sure, but I suspect it was not set up to be specifically for saturation at ceiling patterns, with a positive slope leading to a saturation on the right. Consequently, I suspect that the model overfits for many of the bootstrap samples. van der Vliet et al. (2020) give a strong precedent for using a simple exponential, which I suspect would fit the data much better. (c) when I look at figure 4, left panels, lower row, I feel that I do see a strong ceiling pattern in the data. However, this would be much easier to assess if simple x against y (baseline against followup) scatter plots were also presented and model fits for each model were depicted on the scatter plot. This could for example go into an appendix.

van der Vliet, R., Selles, R. W., Andrinopoulou, E. R., Nijland, R., Ribbers, G. M., Frens, M. A., … and Kwakkel, G. (2020). Predicting upper limb motor impairment recovery after stroke: a mixture model. Annals of neurology, 87(3), 383-393.

I have the following more specific comments for the authors.

Abstract: the following statement is made," We describe approaches that can assess associations between baseline and changes from baseline while avoiding artifacts either due to mathematical coupling or regression to the mean due to measurement error". For reasons justified above, and below, I am doubtful that this statement and similar statements made elsewhere in this paper are appropriate.

Introduction section, paragraph beginning "The growing meta-literature discusses …", statement: "the nuanced and sometimes counterintuitive statistical arguments are critical to get right for the sake of furthering our understanding of the biological mechanisms of recovery." I could not agree more than I do with this statement.

Results section, paragraph beginning "A broader argument relates to settings..", with regard to the reference to Bonkhoff et al. (2020) in this paragraph, the reference to "degenerate" is actually most explicitly made in Bowman et al. (2021).

Results section, subsection "Distinguishing true and artifactual signals", paragraph beginning "The value of cor(x, δ) depends on the variance ratio k and..", statement: "The variance ratio can be used as a measure of the extent to which recovery depends on baseline values, regardless of the value of cor(x, y)." As indicated above, if this statement is suggesting that the variance-ratio can be used to unambiguously identify a proportional recovery pattern, we do not agree. Indeed, the variance-ratio could be small even when there is no recovery at all; that is, if the mean at follow-up is the same, or indeed below, the mean at baseline.

Results section, subsection "Distinguishing true and artifactual signals", paragraph beginning "All datasets have x values generated from a Normal distribution…", how is ceiling handled in these simulations – I would have expected that sometimes a followup score there would be, "by chance", above 66. Can you add a explanation of what happens to these?

Results section, subsection "Distinguishing true and artifactual signals", a minor frustration is that the contour plot on the right of figure 1 is the opposite way around to the surface plot in Hope et al., with a baseline by follow-up correlation of one furthest to the right. At the least, could this discrepancy be mentioned in the caption to help the reader.

Results section, subsection "Distinguishing true and artifactual signals", para beginning "Dataset A is a canonical example of mathematical coupling", I am unsure about quite a lot of the points made in this paragraph. It seems to me that a lot of what is said in this paragraph does not sit well with point 1 above.

Results section, subsection "Distinguishing true and artifactual signals", para beginning "Dataset D represents the least controversial..": I would agree that dataset D is a clear example of proportional recovery. However, our central point in previous publications is that on the basis of the methods typically used, including in this paper, that judgment has to be made informally on the basis of visual inspection. This is because one can obtain exactly the same values of statistical measures (r(X,Y), r(x,δ), k), in a dataset without PRR, but which contains a strong ceiling effect, and indeed, datasets do typically exhibit strong ceiling effects; that is, from what I have seen, empirically collected datasets do not typically look like dataset D, because of the ceiling effect.

Results section, subsection "Distinguishing true and artifactual signals", para beginning "Taken together, the..", text: "We also identify a setting, typified by Dataset D, in which each measure suggests the presence of a relevant association. It is not the case that data like these necessarily imply that recovery follows the PRR. Other biological models could produce data similar to Dataset D, and how to compare competing models will be considered in later sections." Just to say, I was a little confused here. What you are saying here seems to be inconsistent with what you said in the previous paragraph in re. what can be deduced from dataset D.

Results section, subsection "Recasting Oldham's method", paragraph beginning "Instead of cor(x + y, xy), we prefer to focus…": again, some of what is said in this paragraph seems to ignore the confounds that arise from ceiling effects.

Results section, subsection "Correlations, variance ratios, and regression", paragraph beginning "The following expressions relate..", Would it be possible to see a derivation of these two equations? These are not completely standard since \δ = y-x and ii = max-x. This could go in an appendix.

Results section, subsection "Comparing distinct biological models for recovery", paragraph beginning "To illustrate how prediction accuracy…", last sentence: can you give more details of how ceiling is enforced?

Results section, subsection "Comparing distinct biological models for recovery", paragraph beginning "We consider three models for the association …", you say that "Second, we implement the PRR (without intercept) to estimate δ given x, with y taken to be x + δ.", Don't you need to tell us the slope in this eqn, in order to know it is proportional to lost? This is required to rule out proportional to spared or constant recovery. Also, you refer to using a "generalized additive model". Could you give more details of the functional form of this model, perhaps in an appendix? At the least, could you include a relevant reference?

Results section, subsection "Results for reported datasets", paragraph beginning "We next conducted the bootstrap analysis on the subsample…": again, the confound created by ceiling effects impacts this paragraph.

Results section, subsection "Results for reported datasets", paragraph beginning "We compared the performance of three models …": this is where my second point above applies.

Results section, subsection "Results for reported datasets", table 2 caption. Sorry for being dumb, but could you give more explanation of how the R-squared is being calculated here. In particular, could you confirm whether the R-squared equation given in the "Correlations, variance ratios, and regression" subsection is the one being applied here. If it is not this one, can you give the equation that is being used. This is a key issue, since the R-squared values here are much smaller than those given in relevant papers published for some of these data sets.

Results section, subsection "Results for reported datasets", paragraph beginning "The preceding results for …": as previously discussed, the central finding of the work of Hope et al. and Bonkhoff et al. is that, in the presence of ceiling, recovery patterns completely different to the PRR, for example proportional to spared function, can look like PRR, when the measures considered in this paper, cor(\δ, x), cor(x,y), var ratio, are taken. How has the work presented here countered that criticism? I cannot see that it has.

Discussion section, paragraph beginning "Indeed, a recent large-scale cohort study …", it is stated that, "Where Hope et al. (2018) and Bowman et al. (2021) note ceiling effects can compress scores in a way that induces a variance reduction, Lee et al. (2021) observe more than half of all patients, and more than 60% of patients with baseline FMA-UE above 46, recover to the maximum possible value." I do not understand what is being said here. Lee et al. is put in opposition to Hope et al. and Bowman et al., but isn't the Lee et al. ceiling effect going to exactly lead to a reduction in variance at followup?

Discussion section, paragraph beginning "More complex models than the PRR do.." What is being said in this paragraph about Bowman et al. (2020) is not completely clear to me and this may well be a presentational failure on our part when writing Bowman et al. The only place in Bowman et al. (2020) where we refer to mixed models is the following statement, "In particular, van der Vliet et al. present an impressive Bayesian mixture modeling of Fugl-Meyer upper extremity measurements following stroke. Importantly, the authors avoid the confounded correlation of initial scores with change by simply fitting their models to behavioral time-series, that is, raw Fugl-Meyer upper extremity scores, without involving the recovery measure."

However, in the next paragraph, you commend van der Vliet et al. for the same piece of work that we are referring to. Indeed, we were under the impression that the above quoted statement in Bowman et al. (2020), was only a reiteration of what van der Vliet et al. say themselves, which is as follows, "Our current longitudinal mixture model of FM-UE recovery, as opposed to the proportional recovery model, cannot be confounded by mathematical coupling. Hope et al. showed that the correlations between baseline FMUE score (distribution X) and the amount of recovery defined as endpoint FM-UE minus baseline FM-UE (distribution Y-X) found in proportional recovery research could be inflated by mathematical coupling. However, because mathematical coupling applies to correlations of data points (baseline and endpoint FM-UE) and not to models of longitudinal data, the recovery coefficients in our research represent nonconfounded measures of recovery as a proportion of potential recovery. In addition, mathematical coupling does not apply to the outcomes of the cross-validation, as we report correlations between the model predictions and the observed values for endpoint FM-UE and ΔFM-UE rather than correlations of the form X and Y-X."

This leaves us confused, as to what about Bowman et al. (2020) is being criticised in your paragraph beginning "More complex models ….".

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "The proportional recovery rule redux: Arguments for its biological and predictive relevance" for further consideration by eLife. Your revised article has been evaluated by Michael Frank (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed for further consideration, as outlined below:

Essential revisions:

1) Please expand your discussion of the similarities between the current work and previous critiques of the PRR. This should include a brief comment on the justification of some of the analysis choices (e.g. inclusion/exclusion of patients at ceiling) and the relevance of the findings presented here to other areas of research.

2) Address the statistical concerns raised by Reviewer #2

3) Based on some of the reviewers' comments, you may decide to move some of the analyses to a supplement. This is at your discretion.

Reviewer #1 (Recommendations for the authors):

The authors have completed an extensive reworking of their paper, which does a good job of responding to the majority of my comments. Perhaps most importantly, the authors have convincingly countered the argument that I made in my first review that their findings could have arisen from a Constant model with a ceiling. It is excellent to see this rebuttal of my concern. So, I am basically happy with this revision.

It is again good to see the emphasis in this paper on model comparison – that certainly seems to be the right way to go. I also note that the findings in this paper are somewhat consistent with what we have reported in Bonkhoff et al., 2020 – a PRR effect, although considerably weaker than had previously been claimed. So, as the authors point out, there does seem to be convergence in the field, which is excellent.

It remains a pity that the claim that PRR is definitively better on the basis of the presented model fits is still a weak claim. In particular, violin plots for GAM, Exp and PRR are heavily overlapping in figure 6. As you surely know, what is needed is a more sophisticated model comparison, most likely of a Bayesian variety, using something like the model evidence, free energy or deviance information criterion.

The key inference suggested in the paper hinges on an Occams razor argument that if one cannot distinguish between GAM, Exp and PRR, one should pick PRR, since it is more simple. [Although note, I do not believe that Exp is a supermodel of PRR, i.e. I don't think one can get the PRR pattern from Exp, which slightly complicates this line of argument.]

I think that in the absence of a more sophisticated model comparison, of the kind I just highlighted, the Occams razor argument probably is the best one can do. However, since the cross validation is out-of-sample, the flexibility of the model is actually taken into account there.

Obviously, it would be great if this issue could be resolved with a more sophisticated model comparison in a further re-write, however, I don't think this is fair to require at this stage of the review process, but it would be good to acknowledge this weakness in the discussion, and note that competitor approaches are doing better in this respect.

Specific points:

I did not pick this up in the first review, but I spent a lot of time on this review being confused by the right panel of figure 1, which re-represents the surface plot from Hope et al. (2018). I'm now pretty sure your colour bar needs to be inverted. As it is, the yellow contour corresponds to the degenerate region in Hope et al's surface plot, i.e. it does not matter what the value of cor(x,y) is, you always get a correlation of x with change near to -1. However, shouldn't that be for low values of the variability ratio (i.e. log(k) substantially smaller than zero)? Your yellow contour corresponds to log(k) much *bigger* than zero. This issue is also present in Figures 4 and 5.

Top of page 8: "which when baseline (x) and follow-up (y) are uncorrelated and have the same variance." Should the "when" be deleted here?

Page 9, towards bottom: "relationship between of".

Page 10, 2nd para: "data represent in which baseline values".

Page 17, beginning of last para: you talk about there being five models, but in the next sentence you only list 4.

Page 26: you reference Lohse et al. 2021, but I couldn't find this in the bibliography.

Page 38 top of page: there are a couple of typos here. Also, you again talk about there being five models, but in the next sentence you only list 4.

Reviewer #2 (Recommendations for the authors):

1. I think this paper could be made more succinct. It is quite long because the authors used many examples to demonstrate their arguments. Although I feel this approach is well-suited to clinical audience, I am not sure whether the targeted readers may get lost in the technical discussion. For example, I am not sure using bootstrapping is necessary, as we can either directly test the spurious correlation against a more appropriate null hypothesis or using simulation (see the paper in Eur J Oral Sci 2005; 113: 279-288).

2. There are two major issues involving in assessing the relation between change and the baseline. The first is the hypothesis testing and the other is the prediction. Because of mathematical coupling, the usual null hypothesis that the correlation coefficient is zero is inappropriate. This is because the distribution of correlation coefficient is in a restricted space defined by the correlation between x and y, as shown in Figure 1. By the way, I think the paper by Bartko, J. J. and Pettigrew, K. D. (The Teacher's Corner: A Note on the Correlation of Parts with Wholes. Am Stat 22, 41-41, 1968) should be cited, as they are the first to produce such a figure for the correlation between change and the baseline. Another paper of interest is the one by Sutherland, T. M. (The correlation between feed efficiency and rate of gain, a ratio and its denominator. Biometrics 21, 739-749, 1965). Oldham's method is to address this issue of wrong null hypothesis by testing the difference in the variances of X and Y. Other methods are also available as discussed in my previous paper (Tu and Gilthorpe 2007) cited by the authors. As discussed by the authors, Oldham's method or any others based on the same idea has its limitation, if data may be truncated. A possible solution is to design a more sensitive tool to measure the outcome.

3. For the prediction, I think we need to be cautious to use R2 for comparing the performance of different studies. Because R2 is the square of r, i.e. the correlation coefficient. Consequently, the range of R2 is also restricted by the same conditions as r is. Therefore, if different studies have different correlations between X and Y, I do not think their R2 are directly comparable.

4. Moreover, a high R2 does not necessarily mean that patients' recovery can be precisely predicted. Suppose a zero correlation between X and Y, the R2 for using X to predict Y – X will be 0.71^2 = 0.5. Although half of the change's variance can be predicted by the model (i.e. by the baseline value X), this model is useless. This is because R2=0.5 is what we would expect from two sets of random numbers. The baseline and the follow-up values behave like two random variables with zero correlation, this means that X has no use for predicting Y.

5. Regarding to subgroups of patients with different responses, the authors are correct in giving warnings on identifying clusters of patients by using unsupervised methods. Due to regression to the mean, patients with greater baseline diseases are more likely to be identified as good responders, while those with milder diseases are more likely to be identified as poor responders. In fact, the response is actually a continuous spectrum.

6. Finally, I feel a little uneasy that equating the relation between change and the baseline values to the proportional recovery rule. I feel the latter is more akin to the percentage change. Please see the paper: Testing the relation between percentage change and baseline value. Sci Rep 6, 23247 (2016).

I think the authors made great efforts to clarify various misconceptions about testing the relation between change and the baseline. To get your messages across the intended readers, I suggest making your discussion more focused. For example, I do not feel that introducing bootstrapping or GAM is necessary. In contrast, the key concepts of mathematical coupling and regression to the mean were not explained in the paper. In my experience, they are among the most different concepts to comprehend even for statisticians.

Reviewer #3 (Recommendations for the authors):

This is a defence of the Proportional Recovery Rule (PRR), which asserts that stroke survivors recover a fixed proportion of lost function after stroke, from recent, formal/statistical inspired criticisms. The analyses and data show that the PRR is likely to be relevant to ongoing efforts to understand and model post-stroke recovery, and the initial severity of upper limb motor impairments (hemiparesis) after stroke predicts 35-55% of the variance in their subsequent recovery from those deficits.

This manuscript includes a lot of careful analyses that seem reasonable to me and are novel as far as I know in this specific domain. The authors' conclusions also appear to be supported by their methods and data – and their data are impressive, with three large, relevant datasets. I note that other reviewers have identified some detailed issues with the analysis and the text, but in the main the authors have done a very good job of addressing those concerns. The one exception, in my view at least, concerns the best strategy for dealing with patients at or near ceiling in the first two weeks post-stroke. This issue was raised by another reviewer, with whom I co-authored a paper outlining our favoured response, which is to exclude these patients from analyses seeking to quantify the explanatory power of the PRR (Bonkhoff et al., 2021, JNNP). By failing to exclude those acute-near-ceiling patients, the authors have left me somewhat sceptical of the reliability of the variances explained that they report. But this is perhaps more a matter of taste than of statistical rigour, and I can certainly appreciate the authors' reluctance to exclude hard-won empirical data. In their place, I might reference this issue as a possible limitation in the manuscript.

In other words, this paper is not 'broken' in my view, so I have no objection to its publication – and no real requirements for revisions.

That said, I also do not believe that this paper makes a particularly compelling contribution to the field: it 'defeats' a straw man caricature of the criticisms made of the PRR, and offers evidence in support of a position with which even the rule's most critical commentators already agree (and indeed for which some of them have already published supporting evidence).

The original criticism of the PRR was that the analyses used to support it would yield apparently enormous effect sizes entirely regardless of whether the PRR was really relevant to recovery (Hope et al., 2019, Brain). There was never any assertion that the PRR was irrelevant to recovery: that conclusion could never have been justified merely from the recognition that the empirical support for the PRR was unreliable. A follow-up analysis with a large dataset (albeit not as large as that used here) suggested that the PRR was indeed really relevant to recovery, but that it explained a less of the variance in recovery than prior empirical analyses had suggested (Bonkhoff et al., 2020, Brain). That latter work shares much in common with the analyses presented here: i.e., it is a model comparison analysis, with the PRR as one of the considered models. The current work is also a model comparison analysis, with the PRR as one of the considered models, but implemented with different methods and tested against a lot more data. In other words, this paper is in my view a conceptual replication of the earlier study: using different methods and data to run a broadly similar analysis, which yields broadly similar conclusions to those reported previously. Similarly, while the lengthy discussion of clustering is well done, it adds little (in my view) to the conclusions already drawn in Bonkhoff et al., 2022, JNNP.

In other words, much of what this manuscript claims to prove, in response to the critics, has already been reported by some of those same critics. In this sense, the paper is akin to a conceptual replication; using excellent data and novel methods (for the domain at least) to draw conclusions that converge with what has come before, and expressing it all in an accessible manner. These are all strengths in my view: I would merely ask that the links to prior results be made more explicit, at least so that others can follow the timeline of the debate more easily. Or indeed if it's the authors' view that I am wrong, that this be justified more explicitly.

Finally, I am concerned that this issue is rather too narrow to appeal to the readership of eLife. To bolster the more general appeal, I would recommend adding a paragraph or two on variants of this debate that have raged in other subjects – examples of which the authors themselves have given in other papers on this topic.

https://doi.org/10.7554/eLife.80458.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Reviewer #2 (Recommendations for the authors):

This paper addresses the inherent potential of motor recovery after stroke. The authors would like to validate the concept of proportional recovery which assumes that stroke patient typically show a fixed amount of recovery of about 70% irrespective of where they start. The authors suggest that after controlling for possibly confounding variables like ceiling effects or mathematical coupling, they still could validate the proportional recovery rule. The Results section is difficult to read for a non-statistician. However, the findings reported in the paper are rather incremental than novel. Most importantly, there are no data that explain the neurobiological basis of proportional recovery. Hence, the novelty of this submission remains low apart from more sophisticated statistical models which ultimately come to a similar conclusion than reported years before.

We appreciate the time taken by the reviewers to evaluate our work but disagree that the results presented here are incremental.

Proportional recovery has shaped thought about recovery from stroke across several domains for nearly a decade; recently, the statistical foundations of work focused on the PRR have been critiqued from a variety of perspectives. Our goal is not necessarily to “validate” the PRR, but to provide a comprehensive rejoinder to critiques that contain some merit but overstate their conclusions and miss important insights. We have made efforts to provide this in a way that is accessible, recognizing that the arguments are undoubtedly nuanced, and have suggested several novel techniques for assessing a range of statistical and mechanistic models for recovery. We have proposed new tools for evaluating hypotheses regarding the existence of multiple recovery groups. Finally, we have used these tools to reevaluate existing data in light of the recent criticisms by testing alternative models for recovery.

Given the current discourse around the PRR, it is critical that this perspective be available to interested readers. Otherwise, it is likely that the field might conclude the PRR has been invalidated or lacks a robust statistical framework. Neither is true; our results (and, indeed, frequently the results of those offering critiques) suggest that the PRR is among the best predictive models for recovery currently available.

As stated above, the Results section is very difficult to read and appears somewhat lengthy. Lesion analyses or any other data about the possible neurobiological foundations of the proportional recovery are missing. I acknowledge the relatively large sample size, but the Fugl-Meyer score based view without any further para-clinical evidence is somewhat meager. Therefore, I do not see that this paper has a major impact on the field, as in the best case it confirms the proportional recovery rule which has been published many years before.

We appreciate this feedback. Specifically, our goal in this work is not to produce new clinical evidence or develop a new model framework. Instead, we collate and address the range of statistical critiques of the PRR that have appeared in recent years. We disagree that our contribution is support a rule that has been published and studied, and therefore is incremental at best; rather, this manuscript introduces a range of novel statistical approaches that can used to understand results related to recovery. This nuanced statistical understanding is relevant beyond any specific argument regarding the PRR, because relating baseline predictors to follow-up values is a fundamental issue in clinical practice and scientific discovery.

Reviewer #3 (Recommendations for the authors):

This paper presents a defence of the proportional recovery theory of how stroke patients with upper limb motor impairments recover. This is a theory that has come under attack due to confounds associated with the analyses used to assess the theory. The paper acknowledges these confounds, which include compression (to ceiling) enhanced mathematical coupling and over-fitting in the differentiation of recovers and non-recovers. The paper then presents a number of methods: use of non-zero nulls; focus on the variance-ratio; bootstrapping to assess variability in parameter and model fits; comparison of model fits; etc. These certainly improve upon the methods classically used in the literature to justify the proportional recovery hypothesis. Additionally, we appreciate the constructive tone of this paper and its effort to arrive at a consensus position on the proportional recovery issue. Indeed, the overall conclusion that there is a proportional (to lost) recovery pattern amongst the moderately impaired group is a position that we have sympathy with, although that pattern is much weaker than some previous claims in the literature.

Thank you for the careful reading of our work. Your assessment of our contributions here – new methods that improve on those that have typically been used to evaluate the PRR and other models for recovery – is in line with our own. As you note, our goal is to be constructive: there are a number of valid critiques of traditional methods, and our work arose after careful consideration of recent criticisms. And it seems we agree (to some extent) about the usefulness of proportional recovery for at least some groups, as well as about the issues in previously reported effect sizes and significances.

However, there seem to be remaining problems with the analyses employed, although, it is difficult to definitively judge the methods employed without more detail than the paper currently offers.

The two most important issues that seem to be weaknesses are as follows.

1) Remaining confounds with key measures: Goldsmith et al. argue that particular combinations of correlation between baseline and follow-up, i.e. cor(x,y); correlation between baseline and change, i.e. cor(x,δ); and the variance-ratio, i.e. k, is diagnostic of proportional recovery (PPR). In particular, they argue that the combination of cor(x,y), cor(x,δ) and k shown in table 1 and figure 5 justify their claim that the three datasets they consider, Stinear and Byblow, Winters and Zarahn, definitively exhibit a PPR pattern. However, the simulations presented in Bonkhoff et al. (2020), show that this same combination of measures can be exhibited by a constant recovery model in the presence of ceiling. Indeed, this is the fundamental confound described by Hope et al., Bonkhoff et al. and Bowman et al., which led to the conclusion that the only way the ceiling confound could be avoided is by throwing data points at ceiling away. (This is actually what I believe you should do with the data in this paper. Note, in Bonkhoff et al. (2020), we verified with model recovery simulations that the throwing away procedure enables the correct model to be recovered.) To make this completely clear, I have prepared a figure, which I am assured will be attached with this review.

https://submit.elifesciences.org/eLife_files/2021/09/17/00098237/00/98237_0_attach_15_32737_convrt.pdf

This shows table 1 and figure 5 from Goldsmith et al. on the left and figure 6 from Bonkhoff et al. on the right. I have added annotations to figure 6, which show where the Stinear and Byblow and Winters data sets would sit in the panels for constant recovery. The Zarahn data set does not so obviously correspond to a vertical line, but fits with the general combination of measurements.

The statistical issues surrounding the PRR often begin with concerns about mathematical coupling – correlations that can be induced when relating baseline to change scores, even when baseline and follow-up are independent. Additionally, it is possible that observing different outcome variances at baseline and follow-up can produce stronger than correlations between baseline and change, regardless of the correlation between baseline and follow-up. We found it necessary to discuss the interdependency of cor(x,y), cor(x, δ) and k as a way of introducing important statistical concepts, providing groundwork for some of our proposed techniques, and differentiating between “mathematical coupling” and the effects of compressing the outcome distribution. These concepts are revisited when we note many similar issues arise even when alternative statistical methods, including mixed models, are used.

That said, we do not intend these metrics to be “definitive” or even “diagnostic” in support of the PRR. Indeed, we are careful to say that multiple mechanisms could give rise to these observations, including constant recovery with a strong ceiling effect. We will comment in more detail on our suggestions for comparing methods in our response to your next point, including methods for addressing the ceiling confound. We have also reiterated that the correlations and variance ratio is not intended as definitive in our revised work.

The Figure you provided is a useful way to frame results. We are not surprised that one could find correlations and variance ratios similar to those found in Byblow and Stinear and Winters et al. through constant recovery and a ceiling effect. However, this figure suggests a constant recovery of 35 and 45 points in those respective studies. That would suggest that nearly all mildly and moderately affected stroke patients are expected to recover to ceiling, which is not consistent with the data or our experience. Meanwhile, we note that one could add similar annotations to the “Proportional to Lost” column in Bonkhoff et al. 2020, and conclude a recovery proportion of roughly 60% and 80% for Byblow and Stinear and Winters et al., respectively.

2) Limitations of model fits: certainly, the effort to fit models to the three data sets is welcome. However, there is a concern that this model comparison is not a completely fair test of competing theories to proportional recovery. The model fitting comparison in figure 6 is interesting to see. Although, I do find it difficult to interpret the findings, for the following reasons: (a) a criterion has not been given to judge when one model is, in a statistical sense, doing better than another model. I would have expected an assessment of where the MAPE for the best model sits in the bootstrap distribution of MAPEs for each of the other models. This would enable an inference of the kind that the MAPE of the best model is beyond the 95% confidence level of the second best model, and similar for the third. Something of this kind was done in Bonkhoff et al. (2020). (b) I suspect the reason that the Generalised Additive Model does not win is because it is too flexible. It would have been very useful to have seen (perhaps in an appendix) the exact functional form of the Generalised Additive Model. So, I cannot be sure, but I suspect it was not set up to be specifically for saturation at ceiling patterns, with a positive slope leading to a saturation on the right. Consequently, I suspect that the model overfits for many of the bootstrap samples. van der Vliet et al. (2020) give a strong precedent for using a simple exponential, which I suspect would fit the data much better. (c) when I look at figure 4, left panels, lower row, I feel that I do see a strong ceiling pattern in the data. However, this would be much easier to assess if simple x against y (baseline against followup) scatter plots were also presented and model fits for each model were depicted on the scatter plot. This could for example go into an appendix.

van der Vliet, R., Selles, R. W., Andrinopoulou, E. R., Nijland, R., Ribbers, G. M., Frens, M. A., … and Kwakkel, G. (2020). Predicting upper limb motor impairment recovery after stroke: a mixture model. Annals of neurology, 87(3), 383-393.

The discussion of model comparisons is critical because, as we noted in response to your previous comment, we intend this as a more helpful metric for distinguishing between models for recovery than an examination of correlations and the variance ratio. We agree that trend toward explicit model comparison is helpful, and respond to your three points below.

a)We have not provided a statistical metric that allows claims of statistical significance because these are generally not available. Cross validation provides a way to compare prediction accuracy, but does not examine a model parameter or pose a statistical hypothesis to be tested; confidence intervals or p-values for comparing models are therefore not available from this procedure. Instead, the results provide an understanding of the distribution of prediction errors that can guide discussions of model fit. Often, when multiple models provide similar fits, the most parsimonious model is preferred. Our results suggest that the distribution of prediction errors for several models are similar, and we suggest the PRR as a simple model that is comparable (or perhaps slightly better than) more flexible fits.

b) Our implementation of the GAM uses the mgcv package in R, which is one of the best known tools for modeling non-linear associations. The general modeling framework is described in “Generalized additive models: an introduction with R”, and the software contains many methodological and computational improvements since that book was written. mcgv uses a rich b-spline basis with explicit smoothness penalties to prevent overfitting. We include a more complete specification in the appendix, and show a random sample of GAM model fits in CV splits for each dataset. The GAM includes the PRR as a special case, and so in some sense the worse performance is because fits are “too flexible” – but this stems from minor fluctuations rather than dramatic overfitting, and the GAM fits are often similar to those obtained from the PRR.

The GAM model does not specifically introduce a saturation point or a positive slope. Van der Vliet suggests an exponential model for long-term recovery trajectories rather than as a model relating baseline to follow-up, but this is a realistic parametric form and we have included it in updated results.

Additionally, constant recovery with a ceiling effect of the form y=min{x+c+ϵ, 66} can be viewed as a special case of Tobit regression in which the only free parameter is the constant c. Tobit regression includes all data points, including those at ceiling, when estimating model parameters, but assumes a latent observation that follows the specified linear form. We have implemented this special case of Tobit regression and included it as a competing model.

Updated model comparison results are included in the revised manuscript. In short, constant recovery with a ceiling substantially underperforms against competing methods. The GAM, exponential, and PRR prediction accuracies are often comparable, although the PRR appears superior for the Stinear and Byblow and Zarahn datasets. These results more completely substantiate our claim that “competing models for recovery, specifically, a non-linear association between baseline and follow-up, which might arise from near-constant recovery in the presence of a ceiling effect, does not produce more accurate predictions than the PRR.”

c) We have added the suggested figure, to the appendix. This shows fits produced for the training datasets in our CV procedure for each model. In keeping with the results for prediction accuracy, the constant recovery model is visually a poor fit for these datasets. In some instances, the GAM is indeed too flexible, especially for the Zarahn data. That said, the GAM and exponential models are – unsurprisingly, given the results for prediction accuracy – similar to the PRR in terms of fitted values despite their additional flexibility and complexity.

I have the following more specific comments for the authors.

Abstract: the following statement is made," We describe approaches that can assess associations between baseline and changes from baseline while avoiding artifacts either due to mathematical coupling or regression to the mean due to measurement error". For reasons justified above, and below, I am doubtful that this statement and similar statements made elsewhere in this paper are appropriate.

In our response to previous points and your comments below, we argue that this claim can be made. Meanwhile, we have attempted to be more clear that some tools for assessing association, especially those based on correlations and the variance ratio, are sensitive to ceiling effects.

Introduction section, paragraph beginning "The growing meta-literature discusses …", statement: "the nuanced and sometimes counterintuitive statistical arguments are critical to get right for the sake of furthering our understanding of the biological mechanisms of recovery." I could not agree more than I do with this statement.

As elsewhere, it’s welcome to have points of agreement.

Results section, paragraph beginning "A broader argument relates to settings..", with regard to the reference to Bonkhoff et al. (2020) in this paragraph, the reference to "degenerate" is actually most explicitly made in Bowman et al. (2021).

Thank you for noting this. We have updated the reference as suggested.

Results section, subsection "Distinguishing true and artifactual signals", paragraph beginning "The value of cor(x, δ) depends on the variance ratio κ and..", statement: "The variance ratio can be used as a measure of the extent to which recovery depends on baseline values, regardless of the value of cor(x, y)." As indicated in our first point in the Public Review section, if this statement is suggesting that the variance-ratio can be used to unambiguously identify a proportional recovery pattern, we do not agree. Indeed, the variance-ratio could be small even when there is no recovery at all; that is, if the mean at follow-up is the same, or indeed below, the mean at baseline.

We have edited this statement to reiterate your point (and ours), namely that the variance ratio can be small under a variety of data generating mechanisms.

Results section, subsection "Distinguishing true and artifactual signals", paragraph beginning "All datasets have x values generated from a Normal distribution…", how is ceiling handled in these simulations – I would have expected that sometimes a followup score there would be, "by chance", above 66. Can you add a explanation of what happens to these?

In these simulations, we initialized random number generation using seed values that produced no observations above 66 or below 0. We made this choice because our focus here was on issues that arise from mathematical coupling and small variance ratios, while ceiling effects as a possible driver of the variance ratio is deferred to later sections. However, the same conclusions can be drawn for datasets in which ceilings and floors are explicitly enforced.

Results section, subsection "Distinguishing true and artifactual signals", a minor frustration is that the contour plot on the right of figure 1 is the opposite way around to the surface plot in Hope et al., with a baseline by follow-up correlation of one furthest to the right. At the least, could this discrepancy be mentioned in the caption to help the reader.

Our formulation of Equation (1) has cor(x, δ) as a function of cor(x, y) and k; we therefore prefer to have cor(x, δ) on the y axis and cor(x, y) on the x axis in Figure 1. We have included your suggested note of the discrepancy to orient readers in comparing figures.

Results section, subsection "Distinguishing true and artifactual signals", para beginning "Dataset A is a canonical example of mathematical coupling", I am unsure about quite a lot of the points made in this paragraph. It seems to me that a lot of what is said in this paragraph does not sit well with point 1.

In this paragraph and elsewhere, we have tried to clarify that our points in this section are intended not as definitive evidence of the PRR and acknowledge that other mechanisms can give rise to correlations and variance ratios similar to those obtained when data are in fact generated by the PRR. Indeed, in this section we do not use the PRR to simulate data, instead focusing on the relationship between of cor(x,y), k, and cor(x,δ). We maintain that beginning with these statistical observations is important for understanding later arguments, and in fact are not in conflict with many of the recent critiques of the PRR.

Results section, subsection "Distinguishing true and artifactual signals", para beginning "Dataset D represents the least controversial..": I would agree that dataset D is a clear example of proportional recovery. However, our central point in previous publications is that on the basis of the methods typically used, including in this paper, that judgment has to be made informally on the basis of visual inspection. This is because one can obtain exactly the same values of statistical measures (r(X,Y), r(x,δ), k), in a dataset without PRR, but which contains a strong ceiling effect, and indeed, datasets do typically exhibit strong ceiling effects; that is, from what I have seen, empirically collected datasets do not typically look like dataset D, because of the ceiling effect.

There has been a broad collection of publications critical of the PRR in recent years, some by the referee and co-authors and some by other groups. Often these begin with a straightforward discussion of mathematical coupling, and have moved on to critiques that involve a reduction in outcome, clustering, and other statistical issues. Our goal in this manuscript is to provide a broad reflection on and rejoinder to these critiques; we felt it necessary to begin with mathematical coupling, correlations, and variance ratios, but in this section explicitly note that several mechanisms can give rise to the same observed values. We introduce simulated datasets primarily as illustrations, and in later sections propose methods to distinguish between mechanisms that underlie observed data.

Please note as well that we referred to this as the least controversial among the four datasets we presented, rather than as an example of the PRR, because it is a setting where both cor(x,y) and cor(x,δ) are large.

Results section, subsection "Distinguishing true and artifactual signals", para beginning "Taken together, the..", text: "We also identify a setting, typified by Dataset D, in which each measure suggests the presence of a relevant association. It is not the case that data like these necessarily imply that recovery follows the PRR. Other biological models could produce data similar to Dataset D, and how to compare competing models will be considered in later sections." Just to say, I was a little confused here. What you are saying here seems to be inconsistent with what you said in the previous paragraph in re. what can be deduced from dataset D.

We have attempted to clarify the text throughout this section in response to your comments, including this one. This text was intended to convey a sentiment it seems you agree with – namely that the PRR can give rise to a collection of correlations and variance ratios, but that other mechanisms can as well. Tools in later sections are intended to differentiate between these, and our purpose in this section and elsewhere is to provide relevant background and our perspective on the statistical issues that have been raise elsewhere.

Results section, subsection "Recasting Oldham's method", paragraph beginning "Instead of cor(x+ γ, x − γ), we prefer to focus…": again, some of what is said in this paragraph seems to ignore the confounds that arise from ceiling effects.

Results section, subsection "Correlations, variance ratios, and regression", paragraph beginning "The following expressions relate..", Would it be possible to see a derivation of these two equations? These are not completely standard since \δ = y-x and ii = max-x. This could go in an appendix.

We have broadly revised our text to acknowledge that ceiling confounds can give rise to correlations and variance ratios similar to those that are produce by the PRR. Meanwhile, in our original submission this paragraph included the following text:

“For instance, as noted by both (Hope et al. 2018) and (Hawe, Scott, and Dukelow 2019), Oldham’s method does not address the possibility of ceiling effects; in our view, determining which process gives rise to observed correlations comes after assessing whether those correlations are artifacts driven by coupling.”

This was intended specifically to address this point.

Results section, subsection "Comparing distinct biological models for recovery", paragraph beginning "To illustrate how prediction accuracy…", last sentence: can you give more details of how ceiling is enforced?

We have revised the text in this sentence as requested. In short, data above the ceiling are set to the ceiling value. We note that in our first submission data at ceiling were taken to be the maximum value minus an error generated from an exponential distribution, while in our current submission we generate data under a strict ceiling effect.

Results section, subsection "Comparing distinct biological models for recovery", paragraph beginning "We consider three models for the association …", you say that "Second, we implement the PRR (without intercept) to estimate δ given x, with y taken to be x + δ.", Don't you need to tell us the slope in this eqn, in order to know it is proportional to lost? This is required to rule out proportional to spared or constant recovery. Also, you refer to using a "generalized additive model". Could you give more details of the functional form of this model, perhaps in an appendix? At the least, could you include a relevant reference?

Our implementation of the PRR is, we believe, described as “proportional to lost” in Bonkhoff et al. 2020 and elsewhere, and we have made that explicit in our revisions. We have also clarified that we estimate the proportion based on observed data.

As noted in response to part (b) of your second major concern, we use the framework introduced in “Generalized additive models: an introduction with R” and implemented in the mgcv package. This estimates non-linear associations using a flexible spline expansion with explicit penalties to enforce smoothness in a data-driven way. We have included these details and references in our revisions.

Results section, subsection "Results for reported datasets", paragraph beginning "We next conducted the bootstrap analysis on the subsample…": again, the confound created by ceiling effects impacts this paragraph.

Our conclusion in this paragraph is “we have evidence both that recovery is related to baseline values in a way that reduces variance at follow-up, and also that baseline values are predictive of follow-up” but not that a particular mechanism drives this finding. As elsewhere, we have clarified that several data generating mechanisms could give rise to results like this, and suggest alternatives for distinguishing between those mechanisms.

Meanwhile, we think these results (and the discussion of correlations and variance ratios more broadly) are important for understanding issues that arise when relating baseline, follow-up, and changes scores. In particular, these help to contextualize observed correlations and address concerns about canonical forms of mathematical coupling.

Results section, subsection "Results for reported datasets", paragraph beginning "We compared the performance of three models …": this is where my second point above applies.

As noted in our response above, we have substantially edited this section. The main changes are a clearer interpretation of the meaning of CV results; comparison to additional models for recovery, including constant recovery with a ceiling effect (implemented as a special case of Tobit regression) and an exponential model; and a more complete discussion of competing approaches.

Results section, subsection "Results for reported datasets", table 2 caption. Sorry for being dumb, but could you give more explanation of how the R-squared is being calculated here. In particular, could you confirm whether the R-squared equation given in the "Correlations, variance ratios, and regression" subsection is the one being applied here. If it is not this one, can you give the equation that is being used. This is a key issue, since the R-squared values here are much smaller than those given in relevant papers published for some of these data sets.

Thank you for noting this inconsistency. In “Correlations, variance ratios, and regression” we derive an analytical form for R2 when regressing δ on x and show that this is cor(x,δ)2, as part of a larger discussion of how large correlations and R2 values are expected in certain circumstances. In “ Results for reported datasets”, however, we are interested in understanding variability in the outcome y, and how much can be explained by baseline values x. The R2, like the MAPE, measures the strength of that association. We have updated our text to clarify this distinction.

Results section, subsection "Results for reported datasets", paragraph beginning "The preceding results for …": as previously discussed, the central finding of the work of Hope et al. and Bonkhoff et al. is that, in the presence of ceiling, recovery patterns completely different to the PRR, for example proportional to spared function, can look like PRR, when the measures considered in this paper, cor(\δ, x), cor(x,y), var ratio, are taken. How has the work presented here countered that criticism? I cannot see that it has.

We have edited this paragraph to more carefully describe our results and conclusions, and made similar edits elsewhere in the paper.

We agree that similar correlations and variance ratios can arise through the PRR, constant recovery and a ceiling, proportional to spared function, or other mechanisms, and our goal is not to counter that criticism in Hope et al. or Bonkhoff et al. Correlations have often been used to measure association in studies of recovery, but can have very counterintuitive properties. Our discussion and results show that some common statistical issues – like a canonical form of coupling, which is a common example in papers critiquing the PRR – can be ruled out using appropriate “null” values.

Meanwhile, the cross validation results that are also included in this paragraph can form a basis of comparison for competing models. It is in reference to these results that we claim “competing model for recovery … do not produce more accurate predictions than the PRR”, which is supported by our findings.

Discussion section, paragraph beginning "Indeed, a recent large-scale cohort study …", it is stated that, "Where Hope et al. (2018) and Bowman et al. (2021) note ceiling effects can compress scores in a way that induces a variance reduction, Lee et al. (2021) observe more than half of all patients, and more than 60% of patients with baseline FMA-UE above 46, recover to the maximum possible value." I do not understand what is being said here. Lee et al. is put in opposition to Hope et al. and Bowman et al., but isn't the Lee et al. ceiling effect going to exactly lead to a reduction in variance at followup?

We phrased this poorly, and have attempted to clarify our meaning in the revised manuscript. Our intended point was that Hope et al. (2018) and Bowman et al. (2021) are largely focused on the impact ceiling effects can have on observed correlations, while Lee et al. (2021) model the probability of full recovery as a function of initial impairment and other variables using a logistic regression. We believe that evaluating the data presented in Lee et al. (2021) using a range of competing models to evaluate their predictive accuracy would be a useful contribution.

Discussion section, paragraph beginning "More complex models than the PRR do.." What is being said in this paragraph about Bowman et al. (2020) is not completely clear to me and this may well be a presentational failure on our part when writing Bowman et al. The only place in Bowman et al. (2020) where we refer to mixed models is the following statement, "In particular, van der Vliet et al. present an impressive Bayesian mixture modeling of Fugl-Meyer upper extremity measurements following stroke. Importantly, the authors avoid the confounded correlation of initial scores with change by simply fitting their models to behavioral time-series, that is, raw Fugl-Meyer upper extremity scores, without involving the recovery measure."

However, in the next paragraph, you commend van der Vliet et al. for the same piece of work that we are referring to. Indeed, we were under the impression that the above quoted statement in Bowman et al. (2020), was only a reiteration of what van der Vliet et al. say themselves, which is as follows, "Our current longitudinal mixture model of FM-UE recovery, as opposed to the proportional recovery model, cannot be confounded by mathematical coupling. Hope et al. showed that the correlations between baseline FMUE score (distribution X) and the amount of recovery defined as endpoint FM-UE minus baseline FM-UE (distribution Y-X) found in proportional recovery research could be inflated by mathematical coupling. However, because mathematical coupling applies to correlations of data points (baseline and endpoint FM-UE) and not to models of longitudinal data, the recovery coefficients in our research represent nonconfounded measures of recovery as a proportion of potential recovery. In addition, mathematical coupling does not apply to the outcomes of the cross-validation, as we report correlations between the model predictions and the observed values for endpoint FM-UE and ΔFM-UE rather than correlations of the form X and Y-X."

This leaves us confused, as to what about Bowman et al. (2020) is being criticised in your paragraph beginning "More complex models ….".

Our attribution of this claim – that longitudinal models avoid the statistical issues inherent in relating baseline and change scores by posing a model for outcomes y rather than change scores – to Bowman et al. (2021) and Lohse et al. (2021) was incorrect. We apologize for the oversight and have corrected the mistake.

The point made in that paragraph is another example of ways in which statistical intuition can fail. It is reasonable to think that a longitudinal model “does indeed avoid mathematical coupling since the change variable plays no part”, as it is put in Bowman et al. (2021). But if one uses a mixed model for the canonical example of mathematical coupling, the resulting random intercepts and slopes will have a correlation of -0.717 – exactly the same as the counter-intuitive correlation between X and Y. Put differently, the concerns about correlations, variance ratios, and model comparisons persist even when data and models are more complex, and may become even harder to spot.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed for further consideration, as outlined below:

Essential revisions:

1) Please expand your discussion of the similarities between the current work and previous critiques of the PRR. This should include a brief comment on the justification of some of the analysis choices (e.g. inclusion/exclusion of patients at ceiling) and the relevance of the findings presented here to other areas of research.

In our revisions we have more carefully tracked the critiques of the PRR and the more recent convergence in the field. As part of this, we note similarities and differences in analysis approaches and include justifications for our model choices. Finally, to broaden the appeal of this manuscript to eLife’s readership, we comment on the relevance of this work to other domains.

2) Address the statistical concerns raised by Reviewer #2.

We value input from this reviewer, who has clear expertise in this area, and appreciate that the reviewer recognizes our “great efforts to clarify various misconceptions about testing the relation between change and the baseline.” We agree with the majority of points raised in these comments, and have revised accordingly; please see our responses below for details.

3) Based on some of the reviewers' comments, you may decide to move some of the analyses to a supplement. This is at your discretion.

We appreciate the reviewers’ suggestion to focus on succinctly introducing the statistical issues and our contributions, and to avoid unnecessary detail or digressions. At the same time, a careful consideration of the PRR, its critiques, and our insights requires some nuance and detail. In response to this comment and those from the reviewers, we have deferred some discussions to the appendix and proofread the manuscript for conciseness.

Reviewer #1 (Recommendations for the authors):

The authors have completed an extensive reworking of their paper, which does a good job of responding to the majority of my comments. Perhaps most importantly, the authors have convincingly countered the argument that I made in my first review that their findings could have arisen from a Constant model with a ceiling. It is excellent to see this rebuttal of my concern. So, I am basically happy with this revision.

It is again good to see the emphasis in this paper on model comparison – that certainly seems to be the right way to go. I also note that the findings in this paper are somewhat consistent with what we have reported in Bonkhoff et al., 2020 – a PRR effect, although considerably weaker than had previously been claimed. So, as the authors point out, there does seem to be convergence in the field, which is excellent.

The first review of our work was thoughtful, and we believe the careful revisions undertaken to address the comments raised at that time improved the quality of our work. The inclusion of constant recovery in the presence of a ceiling effect is, as you note, an important element of the revised manuscript: the referee’s prior work illustrated that this data generating mechanisms could result in findings consistent with the PRR, and explicit model comparisons in observed data was warranted.

We’re happy to see the growing convergence in the field around a number of points, and continue to highlight these in our revised manuscript. We further agree that careful model comparisons are critical for understanding statistical models and biological mechanisms of recovery, and that the results of our analyses are broadly consistent with those reported in Bonkhoff et al. 2020. Meanwhile, we would argue that the contributions of this manuscript are broader than model comparisons or validations of previously reported results; please see our response to the third reviewer for a more complete discussion of this point.

It remains a pity that the claim that PRR is definitively better on the basis of the presented model fits is still a weak claim. In particular, violin plots for GAM, Exp and PRR are heavily overlapping in figure 6. As you surely know, what is needed is a more sophisticated model comparison, most likely of a Bayesian variety, using something like the model evidence, free energy or deviance information criterion.

The key inference suggested in the paper hinges on an Occams razor argument that if one cannot distinguish between GAM, Exp and PRR, one should pick PRR, since it is more simple. [Although note, I do not believe that Exp is a supermodel of PRR, i.e. I don't think one can get the PRR pattern from Exp, which slightly complicates this line of argument.]

I think that in the absence of a more sophisticated model comparison, of the kind I just highlighted, the Occams razor argument probably is the best one can do. However, since the cross validation is out-of-sample, the flexibility of the model is actually taken into account there.

We use cross validation to compare models because it explicitly focuses on prediction accuracy, and the results can therefore be interpreted in the context of single-subject predictions. This implicitly takes model complexity into account by separating the training and testing datasets.

The more formal approaches you list here and their application in Bonkhoff et al. (2020), Bonkhoff et al. (2022), and others, are valuable contributions to this literature. They explicitly balance goodness-of-fit with model complexity, and the Bayesian framework provides a complete assessment of all forms of model uncertainty. In our revisions, we emphasize the importance of these tools for model comparison and highlight their use.

The broadly overlapping violin plots suggest that models achieve similar prediction accuracies, which makes model selection difficult. They are reminiscent of the deviances shown in Figure 7E of Bonkhoff et al. 2022, where Standard Form Regression and Proportional to Lost have very similar performances. Even with sophisticated model comparisons, settings where data are not sufficient to identify an obviously preferred model will exist, with resulting ambiguity for model selection and inference.

Among the models we consider, the PRR is simplest – it is linear model controlled by a single, interpretable parameter. This simplicity and interpretability is one reason we favor the PRR over competing models, but the examination of fitted values suggested in the first review is another. If more complex non-linear models yielded similar prediction accuracies but obviously different fitted values, that might suggest competing biological mechanisms are equally well supported by the data. The PRR is not a special case of the exponential we’ve implemented, but the fits of the PRR, GAM, and exponentials models are similar enough that we favor the simpler and more easily interpretable PRR. As an aside, if two models produced with distinct fitted values produced similar prediction accuracies, leveraging the unique information they provide via Bayesian model averaging or ensemble prediction could be explored.

Obviously, it would be great if this issue could be resolved with a more sophisticated model comparison in a further re-write, however, I don't think this is fair to require at this stage of the review process, but it would be good to acknowledge this weakness in the discussion, and note that competitor approaches are doing better in this respect.

We see value in pursuing sophisticated model comparisons that include a range of candidate models in future work, and appreciate that the review agrees that work may not be necessary for our contributions at this stage. We have expanded our discussion of this point and acknowledged prior work that uses more sophisticated techniques for model comparison.

Specific points:

I did not pick this up in the first review, but I spent a lot of time on this review being confused by the right panel of figure 1, which re-represents the surface plot from Hope et al. (2018). I'm now pretty sure your colour bar needs to be inverted. As it is, the yellow contour corresponds to the degenerate region in Hope et al's surface plot, i.e. it does not matter what the value of cor(x,y) is, you always get a correlation of x with change near to -1. However, shouldn't that be for low values of the variability ratio (i.e. log(k) substantially smaller than zero)? Your yellow contour corresponds to log(k) much *bigger* than zero. This issue is also present in Figures 4 and 5.

You are correct, and we appreciate you noting this issue. We have corrected our underlying plot functions and resolved the inversion of the color bar in all plots.

Top of page 8: "which when baseline (x) and follow-up (y) are uncorrelated and have the same variance." Should the "when" be deleted here?

Page 9, towards bottom: "relationship between of".

Page 10, 2nd para: “data represent in which baseline values”.

Thanks for noting these language mistakes; we have corrected these and proofread the rest of the paper.

Page 17, beginning of last para: you talk about there being five models, but in the next sentence you only list 4.

You are correct; our list had omitted the exponential model motivated by van der Vliet. This is corrected here and on page 38, noted below.

Page 26: you reference Lohse et al. 2021, but I couldn't find this in the bibliography.

This entry in the bibliograph was appended to the preceding item, Lee et al. (2021), rather than appearing as a new citation. We have fixed this issue in the revision.

Page 38 top of page: there are a couple of typos here. Also, you again talk about there being five models, but in the next sentence you only list 4.

We have fixed these typos, corrected the list here as well.

Reviewer #2 (Recommendations for the authors):

1. I think this paper could be made more succinct. It is quite long because the authors used many examples to demonstrate their arguments. Although I feel this approach is well-suited to clinical audience, I am not sure whether the targeted readers may get lost in the technical discussion. For example, I am not sure using bootstrapping is necessary, as we can either directly test the spurious correlation against a more appropriate null hypothesis or using simulation (see the paper in Eur J Oral Sci 2005; 113: 279-288).

As the reviewer is aware, there is an inherent challenge in making the nuanced statistical issues that arise in this setting and their solutions accessible. Our manuscript also synthesizes and contributes to an ongoing discussion in this literature; we provide necessary background to frame our discussions and reinterpret some prior examples in light of our perspective. Throughout, we strive to balance informative examples with a need to be succinct.

We agree that direct testing is possible when sample sizes are large and other conditions are met, and have emphasized this in our revision. Simulations, as presented in the paper you note, provide a mechanism for obtaining null values when sample sizes are small (as they often are both in the periodontal studies that motivate that paper and in studies of stroke recovery). The simulation approach depends on generating and then drawing repeated samples from paired data with an observed correlation; like the bootstrap, this is an inherently iterative procedure. While both are plausible, it is not obvious to us that one is more accessible that the other and we prefer to retain the bootstrap for constructing confidence intervals.

Your broader point about brevity is well-taken. In particular, in response to concerns you raise below (which echo broader critiques in this literature), we have moved the digression on R2 and regression coefficients to a supplement. We have also edited our manuscript with an eye to removing unnecessary components.

2. There are two major issues involving in assessing the relation between change and the baseline. The first is the hypothesis testing and the other is the prediction. Because of mathematical coupling, the usual null hypothesis that the correlation coefficient is zero is inappropriate. This is because the distribution of correlation coefficient is in a restricted space defined by the correlation between x and y, as shown in Figure 1. By the way, I think the paper by Bartko, J. J. and Pettigrew, K. D. (The Teacher's Corner: A Note on the Correlation of Parts with Wholes. Am Stat 22, 41-41, 1968) should be cited, as they are the first to produce such a figure for the correlation between change and the baseline. Another paper of interest is the one by Sutherland, T. M. (The correlation between feed efficiency and rate of gain, a ratio and its denominator. Biometrics 21, 739-749, 1965). Oldham's method is to address this issue of wrong null hypothesis by testing the difference in the variances of X and Y. Other methods are also available as discussed in my previous paper (Tu and Gilthorpe 2007) cited by the authors. As discussed by the authors, Oldham's method or any others based on the same idea has its limitation, if data may be truncated. A possible solution is to design a more sensitive tool to measure the outcome.

We appreciate you noting these papers, and have added a citation to the first visualization of the surface we present in Figure 1. We also continue to emphasize prior work by Tu and coauthors that introduce methods to appropriately test null hypotheses in the presence of mathematical coupling.

We also agree, with the reviewer and with others working in this area, that more sensitive tools may alleviate concerns due to truncation and the resulting ceiling effects. Indeed, our prior work using kinematic data developed an average scaled distance from the center of a healthy control population (Cortes et al., 2017). In some sense, though, there is an inherent upper limit on recovery – meaning that the insights in our manuscript will be relevant even when better measures of function are available.

3. For the prediction, I think we need to be cautious to use R2 for comparing the performance of different studies. Because R2 is the square of r, i.e. the correlation coefficient. Consequently, the range of R2 is also restricted by the same conditions as r is. Therefore, if different studies have different correlations between X and Y, I do not think their R2 are directly comparable.

We agree with this point completely. In our first submission, we made the connection between R2 and correlations explicit for exactly this reason. Because R2 can be difficult to interpret, and indeed has been the focus of confusion and misinterpretation in the past, we have urged caution in using R2 but deferred this section to an appendix, and removed R2 from our reported results.

4. Moreover, a high R2 does not necessarily mean that patients' recovery can be precisely predicted. Suppose a zero correlation between X and Y, the R2 for using X to predict Y – X will be 0.71^2 = 0.5. Although half of the change's variance can be predicted by the model (i.e. by the baseline value X), this model is useless. This is because R2=0.5 is what we would expect from two sets of random numbers. The baseline and the follow-up values behave like two random variables with zero correlation, this means that X has no use for predicting Y.

We agree here as well – in fact, this example has been used to illustrate why reported strong correlations between baseline and change may not mean that recovery is predictable from baseline values. We are concerned, though, that this correct observation has been taken too far in some cases: while a high R2 doesn’t guarantee that recovery can be predicted, it is possible that recovery can be predicted well when R2 is high.

As in the previous comment, we think it is important to remain more focused on correlations and variance ratios as measures of association, and prediction error as a measure of prediction accuracy. We have therefore moved most of our discussion of R2 to the supplement.

5. Regarding to subgroups of patients with different responses, the authors are correct in giving warnings on identifying clusters of patients by using unsupervised methods. Due to regression to the mean, patients with greater baseline diseases are more likely to be identified as good responders, while those with milder diseases are more likely to be identified as poor responders. In fact, the response is actually a continuous spectrum.

The inappropriate use of unsupervised methods can suggest clusters that do not in fact exist in the data. As discussed in Hawe et al. (2018), using the results of an inappropriate clustering analysis can misleadingly suggest the presence of proportional recovery. The focus in that paper was on “random” recovery but, as you note, regression to the mean is an important phenomenon that could drive similar results. We have added this to our discussion of clustering.

We think it is important to reiterate warnings about the inappropriate use of unsupervised methods. A major contribution of our work, we think, is the development of an inferential procedure to test against the specific null hypothesis considered in Hawe et al. (2018). While “random” recovery and regression to the mean can give the illusion of proportional recovery after inappropriate clustering, our results suggest that in at least some dataset there are biologically meaningful subgroups with different recovery patterns. Confirming the existing of these groups, finding ways to accurately classify patients early, and developing targeted therapies to promote recovery is important future work.

6. Finally, I feel a little uneasy that equating the relation between change and the baseline values to the proportional recovery rule. I feel the latter is more akin to the percentage change. Please see the paper: Testing the relation between percentage change and baseline value. Sci Rep 6, 23247 (2016).

Thank you for pointing out this useful paper, and we can see where confusion can arise. The proportional recovery rule is a regression of change (y-x) on initial impairment (66-x), where 66 is the maximum value for the outcome measure and is presumed to be 66 for all patients before stroke. The regression does not include an intercept, and the slope on initial impairment is typically interpreted as the proportion of function lost due to stroke that is expected to be regained during recovery. Put differently, this is the expected change in the amount of recovery for each one unit change in initial impairment. Thus, the slope is interpreted as a proportion or percent – but the regression uses change as an outcome rather than modeling the percent γxX as a function of the baseline.

I think the authors made great efforts to clarify various misconceptions about testing the relation between change and the baseline. To get your messages across the intended readers, I suggest making your discussion more focused. For example, I do not feel that introducing bootstrapping or GAM is necessary. In contrast, the key concepts of mathematical coupling and regression to the mean were not explained in the paper. In my experience, they are among the most different concepts to comprehend even for statisticians.

Thank you for the positive assessment of our work, and your suggestions on how it can be improved. As we discussed in our response to your first comment, we have worked to provide the background necessary to synthesize a wide-ranging debate and provide novel insights and contributions. We have chosen to retain the GAM model and the bootstrap in our revisions: the GAM can smoothly capture non-linear associations when they exist, and the bootstrap is a now common inferential procedure. The technical details of the GAM are deferred to a simulation, and we have tried to explain the bootstrap in an accessible way. Meanwhile, in light of your suggestion to make our discussion more focused, we have revised other elements of our paper. Most notably, we have shifted our consideration of R2 to the supplement, and have trimmed text for succinctness where possible. We have also provided a clearer definition of coupling and regression to the mean to frame our discussions.

Reviewer #3 (Recommendations for the authors):

This is a defence of the Proportional Recovery Rule (PRR), which asserts that stroke survivors recover a fixed proportion of lost function after stroke, from recent, formal/statistical inspired criticisms. The analyses and data show that the PRR is likely to be relevant to ongoing efforts to understand and model post-stroke recovery, and the initial severity of upper limb motor impairments (hemiparesis) after stroke predicts 35-55% of the variance in their subsequent recovery from those deficits.

We appreciate the time taken to evaluate our manuscript. We were motivated by recent well-founded statistical critiques of the PRR, which rightly pointed out issues that can lead to misleading effect sizes, exaggerated claims of statistical significance, or challenges in distinguishing proportional recovery from other recovery mechanisms. Our goal, discussed in detail below, was to synthesize and contribute new insights to this ongoing statistical debate; we then compare several models for recovery in using a prediction-oriented measure of model performance, and conclude that the PRR is likely to be biologically and statistically relevant in models for poststroke recovery moving forward.

This manuscript includes a lot of careful analyses that seem reasonable to me and are novel as far as I know in this specific domain. The authors' conclusions also appear to be supported by their methods and data – and their data are impressive, with three large, relevant datasets. I note that other reviewers have identified some detailed issues with the analysis and the text, but in the main the authors have done a very good job of addressing those concerns. The one exception, in my view at least, concerns the best strategy for dealing with patients at or near ceiling in the first two weeks post-stroke. This issue was raised by another reviewer, with whom I co-authored a paper outlining our favoured response, which is to exclude these patients from analyses seeking to quantify the explanatory power of the PRR (Bonkhoff et al., 2021, JNNP). By failing to exclude those acute-near-ceiling patients, the authors have left me somewhat sceptical of the reliability of the variances explained that they report. But this is perhaps more a matter of taste than of statistical rigour, and I can certainly appreciate the authors' reluctance to exclude hard-won empirical data. In their place, I might reference this issue as a possible limitation in the manuscript.

In other words, this paper is not 'broken' in my view, so I have no objection to its publication – and no real requirements for revisions.

Critiques of the PRR and subsequent analyses have been statistically rigorous; we appreciate the reviewer’s comment that our work here is consistent with the care shown in these studies, and that our conclusions are supported by the methods and results.

We apologize for overlooking the suggestion for dealing with ceiling effects raised in the first round of review, and for not responding more fully at that time. The strategy presented is to exclude patients whose baseline scores are above a predetermined threshold. Because this group typically has lower outcome variance at follow-up, the reviewer is correct that omitting these patients would be expected to decrease single-subject prediction accuracy and the proportion of outcome variance explained.

We agree that how best to handle patients at or near ceiling is a matter of preference and analysis goals. In the data we consider, many or most patients have baseline FM scores >= 45. This is a large patient population, and understanding recovery in this group is important scientifically; we hesitate to discard a substantial fraction of our data and miss the opportunity to model recovery for mild to moderate affected patients at baseline. A possible strategy for future work is to incorporate the Tobit-like structure we implemented to evaluate constant recovery into other models for recovery. In any case, it is important to recognize that the narrower outcome distribution for patients at or near ceiling at baseline will affect measures of model performance. We have made this limitation more clear in our revised manuscript.

That said, I also do not believe that this paper makes a particularly compelling contribution to the field: it 'defeats' a straw man caricature of the criticisms made of the PRR, and offers evidence in support of a position with which even the rule's most critical commentators already agree (and indeed for which some of them have already published supporting evidence).

The original criticism of the PRR was that the analyses used to support it would yield apparently enormous effect sizes entirely regardless of whether the PRR was really relevant to recovery (Hope et al., 2019, Brain). There was never any assertion that the PRR was irrelevant to recovery: that conclusion could never have been justified merely from the recognition that the empirical support for the PRR was unreliable. A follow-up analysis with a large dataset (albeit not as large as that used here) suggested that the PRR was indeed really relevant to recovery, but that it explained a less of the variance in recovery than prior empirical analyses had suggested (Bonkhoff et al., 2020, Brain). That latter work shares much in common with the analyses presented here: i.e., it is a model comparison analysis, with the PRR as one of the considered models. The current work is also a model comparison analysis, with the PRR as one of the considered models, but implemented with different methods and tested against a lot more data. In other words, this paper is in my view a conceptual replication of the earlier study: using different methods and data to run a broadly similar analysis, which yields broadly similar conclusions to those reported previously. Similarly, while the lengthy discussion of clustering is well done, it adds little (in my view) to the conclusions already drawn in Bonkhoff et al., 2022, JNNP.

In other words, much of what this manuscript claims to prove, in response to the critics, has already been reported by some of those same critics. In this sense, the paper is akin to a conceptual replication; using excellent data and novel methods (for the domain at least) to draw conclusions that converge with what has come before, and expressing it all in an accessible manner. These are all strengths in my view: I would merely ask that the links to prior results be made more explicit, at least so that others can follow the timeline of the debate more easily. Or indeed if it's the authors' view that I am wrong, that this be justified more explicitly.

Broad points of agreement around the proportional recovery rule have indeed emerged, and some of our results are similar to those reported by critics of the PRR in recent work. That said, our goal is not to defeat a straw man argument against the PRR, and the contributions of this manuscript go beyond a conceptual replication of previous work.

Critiques of the PRR were careful and limited in their conclusions. For example, Hope et al. (2019) say in their discussion that they “are not claiming that the proportional recovery rule is wrong”, only that they do not see evidence that the rule holds and that other models for recovery are possible. But the title of that paper is “Recovery after stroke: not so proportional after all?”; Hawe et al. (2018) included some similar caveats in the discussion of a paper titled “Taking proportional out of recovery.” Senesh and Reinkensmeyer (2019) (title: “Breaking Proportional Recovery After Stroke”) summarized these papers in their abstract by saying they “explained that [proportional recovery] should be expected because of mathematical coupling between the baseline and change score”. Authors may not have asserted the PRR was irrelevant for recovery, but readers could be forgiven for thinking they had.

Those critiques and more recent work contain important points that we agree with: that early top-line effect sizes were overly optimistic due to counter-intuitive statistical issues; that proportional recovery is useful for understanding recovery, although it explains less outcome variance than one might hope for single-subject prediction; and that different subpopulations exhibit different recovery outcomes. Within those points of agreement, interpretations of results and points of emphasis can differ. For instance, Bonkhoff et al. (2020) summarizes some analysis results by noting that the “first model comparison … pointed to a combination of proportion to lost function and constant recovery” among fitters; our view of the same results is that standard regression and proportional recovery are virtually indistinguishable, and both far outperform constant or proportional to spared models for recovery. Both statements are true, and yet highlighting that different conclusions can be drawn may be useful to readers of this literature.

If the present manuscript contained only different interpretations of past results and analyses of new datasets, it might be considered a conceptual replication of other work. Our manuscript is broader than that, however. We synthesize a broad collection of critiques, draw new statistical insights, and examine those insights in previously reported datasets. Our discussion of correlations as measures of association, for example, illustrates that statistical evidence for non-artifactual associations can be obtained even in the “degenerate” region described in Hope et al. (2019). Several papers have highlighted the difficulty in obtaining accurate null distributions in the presence of coupling and ceiling effects; we have made specific proposals in response. And our work on clustering does indeed differ from and add to the conclusions in Bonkhoff et al. (2022). That paper thresholds initial impairment to separate severe and non-severe cases, and uses Bayesian hierarchical models and model selection approaches to understand whether these groups have different recovery patterns. Our point, meanwhile, is that the severely affected population contains biologically meaningful subgroups – recoverers and non-recoverers – and that these can be identified using appropriate unsupervised learning methods. Hawe et al. (2018) had noted that inappropriately clustering when recovery is random can produce results that mimic proportional recovery; one of our contributions is a hypothesis test for the null of “random recovery” and evidence supporting the separation of recoverers and non-recoverers (rather than severe and non-severe patients at baseline).

Critics of the PRR introduced a range of valid statistical concerns and have produced results that refine the field’s understanding of recovery. Our work attempts to acknowledge those contributions, and in our revisions we have tried to more explicitly discuss the timeline of this debate and highlight prior findings. We have also tried to be more clear about the ways in which our work goes beyond a conceptual replication of past work.

Finally, I am concerned that this issue is rather too narrow to appeal to the readership of eLife. To bolster the more general appeal, I would recommend adding a paragraph or two on variants of this debate that have raged in other subjects – examples of which the authors themselves have given in other papers on this topic.

Thank you for raising this point. While our primary motivation is the appropriate analysis of data that arise in the context of stroke recovery, the same problems are and will continue to be present across a wide range of domains. We agree that being clear about this will broaden the appeal of this paper, and have made this point in the abstract and introduction.

References

Bonkhoff, Anna K, Thomas Hope, Danilo Bzdok, Adrian G Guggisberg, Rachel L Hawe, Sean P Dukelow, Anne K Rehme, Gereon R Fink, Christian Grefkes, and Howard Bowman. 2020. “Bringing proportional recovery into proportion: Bayesian modelling of post-stroke motor impairment.” Brain.

Bonkhoff, A. K., Hope, T., Bzdok, D., Guggisberg, A. G., Hawe, R. L., Dukelow, S. P., … and Bowman, H. (2022). Recovery after stroke: the severely impaired are a distinct group. Journal of Neurology, Neurosurgery and Psychiatry, 93(4), 369-378.

Cortes, J. C., Goldsmith, J., Harran, M. D., Xu, J., Kim, N., Schambra, H. M., … and Kitago, T. (2017). A short and distinct time window for recovery of arm motor control early after stroke revealed with a global measure of trajectory kinematics. Neurorehabilitation and neural repair, 31(6), 552-560.

Hawe, Rachel L, Stephen H Scott, and Sean P Dukelow. 2019. “Taking Proportional Out of Stroke Recovery.” Stroke 50 (1): 204–11.

Hope, Thomas M H, Karl Friston, Cathy J Price, Alex P Leff, Pia Rotshtein, and Howard Bowman. 2018. “Recovery after stroke: not so proportional after all?” Brain 142 (1): 15–22.

Senesh, Merav R., and David J. Reinkensmeyer. "Breaking proportional recovery after stroke." Neurorehabilitation and neural repair 33.11 (2019): 888-901.

Tu, Yu-Kang. "Testing the relation between percentage change and baseline value." Scientific reports 6.1 (2016): 1-8.

https://doi.org/10.7554/eLife.80458.sa2

Article and author information

Author details

  1. Jeff Goldsmith

    Department of Biostatistics, Columbia Mailman School of Public Health, New York, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Visualization, Methodology, Writing - original draft
    For correspondence
    ajg2202@cumc.columbia.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6150-8997
  2. Tomoko Kitago

    1. Burke Neurological Institute, White Plains, United States
    2. Weill Cornell Medicine, New York, United States
    Contribution
    Conceptualization, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Angel Garcia de la Garza

    Department of Biostatistics, Columbia Mailman School of Public Health, New York, United States
    Contribution
    Visualization, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Robinson Kundert

    Cereneo, Center for Neurology and Rehabilitation, Vitznau, Switzerland
    Contribution
    Conceptualization, Formal analysis, Writing - review and editing
    Competing interests
    No competing interests declared
  5. Andreas Luft

    1. Cereneo, Center for Neurology and Rehabilitation, Vitznau, Switzerland
    2. Vascular Neurology and Neurorehabilitation, Department of Neurology, University Hospital Zurich, University of Zurich, Zurich, Switzerland
    Contribution
    Conceptualization, Writing - review and editing
    Competing interests
    No competing interests declared
  6. Cathy Stinear

    Department of Medicine, University of Auckland, Auckland, New Zealand
    Contribution
    Conceptualization, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  7. Winston D Byblow

    Department of Exercise Sciences, University of Auckland, Auckland, New Zealand
    Contribution
    Conceptualization, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  8. Gert Kwakkel

    1. Rehabilitation Research Centre, Reade, Amsterdam, Netherlands
    2. Rehabilitation Medicine, Amsterdam UMC - Location VUMC, Amsterdam Movement Sciences, Amsterdam, Netherlands
    Contribution
    Conceptualization, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  9. John W Krakauer

    1. Department of Neurology, Johns Hopkins University, Baltimore, United States
    2. Department of Neuroscience, Johns Hopkins University, Baltimore, United States
    3. Department of Physical Medicine and Rehabilitation, Baltimore, United States
    4. Santa Fe Institute, Santa Fe NM, United States
    Contribution
    Conceptualization, Data curation, Investigation, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4316-1846

Funding

National Institute of Neurological Disorders and Stroke (R01 NS097423)

  • Jeff Goldsmith

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We acknowledge the EXPLICIT-stroke consortium for collecting data. This data collection was supported by funding from the Royal Dutch Society of Physical Therapy and the Netherlands Organization for Health Research and Development (ZonMw; Grant No. 89000001). JG’s work was supported in part by NIH-funded R01NS097423.

Ethics

Our work reanalyzes several previously reported datasets. These were collected under protocols that ensured informed consent and consent to publish was obtained; details of the protocols and approvals are available in papers originally reporting these datasets.

Senior Editor

  1. Michael J Frank, Brown University, United States

Reviewing Editor

  1. Taraz Lee, University of Michigan, United States

Reviewers

  1. Howard Bowman, University of Kent, United Kingdom
  2. Yu-Kang Tu, Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taiwan
  3. Thomas Hope

Publication history

  1. Preprint posted: May 21, 2021 (view preprint)
  2. Received: May 20, 2022
  3. Accepted: October 17, 2022
  4. Accepted Manuscript published: October 18, 2022 (version 1)
  5. Version of Record published: November 10, 2022 (version 2)

Copyright

© 2022, Goldsmith et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 412
    Page views
  • 104
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jeff Goldsmith
  2. Tomoko Kitago
  3. Angel Garcia de la Garza
  4. Robinson Kundert
  5. Andreas Luft
  6. Cathy Stinear
  7. Winston D Byblow
  8. Gert Kwakkel
  9. John W Krakauer
(2022)
Arguments for the biological and predictive relevance of the proportional recovery rule
eLife 11:e80458.
https://doi.org/10.7554/eLife.80458

Further reading

    1. Neuroscience
    Kevin Vinberg, Jörgen Rosén ... Fredrik Ahs
    Research Article Updated

    Understanding the neural basis for individual differences in the skin conductance response (SCR) during discriminative fear conditioning may inform on our understanding of autonomic regulation in fear-related psychopathology. Previous region-of-interest (ROI) analyses have implicated the amygdala in regulating conditioned SCR, but whole brain analyses are lacking. This study examined correlations between individual differences in SCR during discriminative fear conditioning to social stimuli and neural activity throughout the brain, by using data from a large functional magnetic resonance imaging study of twins (N = 285 individuals). Results show that conditioned SCR correlates with activity in the dorsal anterior cingulate cortex/anterior midcingulate cortex, anterior insula, bilateral temporoparietal junction, right frontal operculum, bilateral dorsal premotor cortex, right superior parietal lobe, and midbrain. A ROI analysis additionally showed a positive correlation between amygdala activity and conditioned SCR in line with previous reports. We suggest that the observed whole brain correlates of SCR belong to a large-scale midcingulo-insular network related to salience detection and autonomic-interoceptive processing. Altered activity within this network may underlie individual differences in conditioned SCR and autonomic aspects of psychopathology.

    1. Neuroscience
    Rong Zhao, Stacy D Grunke ... Joanna L Jankowsky
    Research Article

    Neurodegenerative diseases are characterized by selective vulnerability of distinct cell populations; however, the cause for this specificity remains elusive. Here we show that entorhinal cortex layer 2 (EC2) neurons are unusually vulnerable to prolonged neuronal inactivity compared with neighboring regions of the temporal lobe, and that reelin+ stellate cells connecting EC with the hippocampus are preferentially susceptible within the EC2 population. We demonstrate that neuronal death after silencing can be elicited through multiple independent means of activity inhibition, and that preventing synaptic release, either alone or in combination with electrical shunting, is sufficient to elicit silencing-induced degeneration. Finally, we discovered that degeneration following synaptic silencing is governed by competition between active and inactive cells, which is a circuit refinement process traditionally thought to end early in postnatal life. Our data suggests that the developmental window for wholesale circuit plasticity may extend into adulthood for specific brain regions. We speculate that this sustained potential for remodeling by entorhinal neurons may support lifelong memory but renders them vulnerable to prolonged activity changes in disease.