1. Neuroscience
Download icon

The influence of task outcome on implicit motor learning

  1. Hyosub E Kim  Is a corresponding author
  2. Darius E Parvin
  3. Richard B Ivry
  1. University of California, Berkeley, United States
  2. University of Delaware, United States
Research Communication
  • Cited 0
  • Views 1,097
  • Annotations
Cite this article as: eLife 2019;8:e39882 doi: 10.7554/eLife.39882

Abstract

Recent studies have demonstrated that task success signals can modulate learning during sensorimotor adaptation tasks, primarily through engaging explicit processes. Here, we examine the influence of task outcome on implicit adaptation, using a reaching task in which adaptation is induced by feedback that is not contingent on actual performance. We imposed an invariant perturbation (rotation) on the feedback cursor while varying the target size. In this way, the cursor either hit or missed the target, with the former producing a marked attenuation of implicit motor learning. We explored different computational architectures that might account for how task outcome information interacts with implicit adaptation. The results fail to support an architecture in which adaptation operates in parallel with a model-free operant reinforcement process. Rather, task outcome may serve as a gain on implicit adaptation or provide a distinct error signal for a second, independent implicit learning process.

Editorial note: This article has been through an editorial process in which the authors decide how to respond to the issues raised during peer review. The Reviewing Editor's assessment is that all the issues have been addressed (see decision letter).

https://doi.org/10.7554/eLife.39882.001

Introduction

Multiple learning processes contribute to successful goal-directed actions in the face of changing physiological states, body structures, and environments (Taylor et al., 2014; Huberdeau et al., 2015; McDougle et al., 2016). Among these processes, implicit sensorimotor adaptation is of primary importance for maintaining appropriate calibration of sensorimotor maps over both short and long timescales. A large body of work has focused on how sensory prediction error (SPE), the difference between predicted and actual sensory feedback, drives sensorimotor adaptation (Shadmehr et al., 2010). In addition, there is growing appreciation of the contribution of other processes to sensorimotor learning, including strategic aiming and reward-based learning (Taylor et al., 2014; Wu et al., 2014; Bond and Taylor, 2015; Galea et al., 2015; Nikooyan and Ahmed, 2015; Summerside et al., 2018). In terms of the latter, several recent studies have shown that rewarding successful actions alone is sufficient to learn a new sensorimotor mapping (Izawa and Shadmehr, 2011; Therrien et al., 2016; Therrien et al., 2018).

Little is known about how feedback about task outcome impacts adaptation from SPE; indeed, the literature presents an inconsistent picture of how reward impacts performance in sensorimotor adaptation tasks. For example, two recent visuomotor rotation studies using similar tasks and reward structures led to divergent conclusions: One reported that reward enhanced retention of the adapted state, but had no effect on the rate of adaptation (Galea et al., 2015), whereas the other reported a beneficial effect of rewards specifically on adaptation rate (Nikooyan and Ahmed, 2015). More recently, Leow and colleagues (Leow et al., 2018) created a situation in which task outcome was experimentally manipulated by shifting the target on-line to either intersect a rotated cursor or move away from the cursor. Task success, artificially imposed by allowing the displaced cursor to intersect the target, led to attenuated adaptation.

One factor that may contribute to these inconsistencies is highlighted by studies showing that, even in relatively simple sensorimotor adaptation tasks, overall behavior reflects a combination of explicit and implicit processes (Taylor and Ivry, 2011; Taylor et al., 2014). That is, while SPE is thought to drive adaptation (Tseng et al., 2007), participants are often consciously aware of the perturbation and strategically aim as one means to counteract the perturbation. It may be that reward promotes the activation of such explicit processes (Bond and Taylor, 2015). Consistent with this hypothesis, Codol and colleagues (Codol et al., 2017), showed that at least one of the putative effects of reward, the strengthening of motor memories (Shmuelof et al., 2012), is primarily the result of re-instantiating an explicit aiming strategy rather than via the direct modulation of adaptation. As explicit processes are more flexible than implicit processes (Bond and Taylor, 2015), differential demands on strategies may contribute toward the inconsistent effects reported across previous studies manipulating reward (Holland et al., 2018).

We recently introduced a new method, referred to as clamped visual feedback, designed to isolate implicit adaptation (Morehead et al., 2017; Kim et al., 2018). During the clamp, the angular trajectory of the feedback cursor is invariant with respect to the target location and thus spatially independent of hand position (Shmuelof et al., 2012; Vaswani et al., 2015; Morehead et al., 2017; Kim et al., 2018; Vandevoorde and Orban de Xivry, 2018). Participants are informed of the invariant nature of the visual feedback and instructed to ignore it. In this way, explicit aiming should be eliminated and, thus, allow for a clean probe of implicit learning (Morehead et al., 2017).

Here, we employ the clamp method to revisit how task outcome, even when divorced from actual performance, influences implicit adaptation. In a series of three experiments, the clamp angle was held constant and only the target size was manipulated. We assume that the clamp angle, defined with respect to the centers of the target and feedback cursor, specifies the SPE. In contrast, by varying the target size, we independently manipulate the information regarding task outcome, comparing conditions in which the feedback cursor signals the presence or absence of a target error (TE), defined in a binary manner by whether the cursor misses or hits the target. Given that the participants are aware that they have no control over the feedback cursor, the effect of this task outcome information would presumably operate in an implicit, automatic manner, similar to how we assume the clamped feedback provides an invariant SPE signal.

Our experiments show that hitting the target has a strong effect on performance, attenuating the rate and magnitude of learning. Through computational modeling, we explore hypotheses that might account for this effect, considering three models in which implicit learning is driven by both SPE and task outcome information. In the first two models, hitting the target serves as an intrinsic reward signal that either reinforces associated movements or directly modulates adaptation. In the third model, hitting or missing the target serves as a task-outcome feedback signal that drives a second implicit learning process, one that operates in parallel with implicit adaptation.

Results

In all experiments, we used clamped visual feedback, in which the angular trajectory of a feedback cursor is invariant with respect to the target location and thus spatially independent of hand position (Morehead et al., 2017). The instructions (see Supplementary file 1-Target Size Experiment Instructions) emphasized that the participant’s behavior would not influence the cursor trajectory: They were to ignore this stimulus and always aim directly for the target. This method allows us to isolate implicit learning from an invariant error, eliminating potential contributions from explicit aiming that might be used to reduce task performance error.

Experiment 1

In Experiment 1, we asked if the task outcome, defined in terms of whether or not the cursor hit the target, would modulate learning under conditions in which the feedback is not contingent on behavior. We tested three groups of participants (n = 16/group) with a 3.5° clamp for 80 cycles (eight targets per cycle). The purpose of this experiment was to examine the effects of three different relationships between the clamp and target while holding the visual error (defined as the center-to-center distance between the cursor and target) constant (Figure 1b): Hit Target (when the terminal position of the clamped cursor is fully embedded within a 16 mm diameter target), Straddle Target (when roughly half of the cursor falls within a 9.8 mm target, with the remaining part outside the target), Miss Target (when the cursor is fully outside a 6 mm target).

Hitting the target attenuates the behavioral change from clamped feedback.

(a) During clamped visual feedback, the angular deviation of the cursor feedback is held constant throughout the perturbation block, and participants are fully informed of the manipulation. (b) The clamp angle was equal across all three conditions tested in Experiment 1, with only the target size varying between conditions. (c) Block design for experiment. (d) As in previous studies with clamped feedback, the manipulation elicits robust changes in hand angle. However, the effect was attenuated in the Hit Target condition, observed in the (e) rate of early adaptation, and, more dramatically, in (f) late learning. (g) The proportion of learning retained over the no feedback block following the clamp did not differ between groups. Dots represent individuals; shading and error bars denote SEM.

https://doi.org/10.7554/eLife.39882.002

Hitting the target reduced the overall change in behavior (Figure 1d). Statistically, there was a marginal difference on the rate of initial adaptation (one-way ANOVA: F(2,45)=2.67, p=0.08, η2 = 0.11; permutation test: p=0.08; Figure 1e) and a significant effect on late learning (F(2,45)=4.44, p=0.016, η2 = 0.17; Figure 1f). For the latter measure, the value for the Hit Target group was approximately 35% lower than for the Straddle and Miss Target groups, with post-hoc comparisons confirming the substantial differences in late learning between the Hit Target and both the Straddle Target (95% CI [−16.13°, −2.34°], t(30)=-2.73, p=0.010, d = 0.97) and Miss Target (95% CI [−16.76°, −2.79°], t(30)=-2.86, p=0.008, d = 1.01) groups. These differences were also evident in the aftereffect measure, taken from the first cycle of the no feedback block (see Materials and methods). The learning functions for the Straddle and Miss Target groups were remarkably similar throughout the entire clamp block and reached similar magnitudes of late learning (95% CI [−7.90°, 8.97°], t(30)=.13, p=0.898, d = 0.05).

As seen in Figure 1d, the change in hand angle from the final cycle of the clamp block to the final cycle of the no feedback block was less for the Hit than the Straddle and Miss groups (one-way ANOVA: F(2,45)=4.42, p=0.018, η2 = 0.16; Hit vs Miss: 95% CI [1.47°, 8.00°], t(30)=2.96, p=0.006, d = 1.05; Hit vs Straddle: 95% CI [1.06°, 8.74°], t(30)=2.61, p=0.014, d = 0.92). This result indicates that retention was strongest in the Hit group. However, retention is generally analyzed as a relative, rather than absolute measure, especially when the amount of learning differs between groups. We thus re-analyzed the change in hand angle across the no feedback block, but now as the ratio of the last no-feedback cycle relative to the last clamp cycle. In this analysis, there was no difference between the three groups (Figure 1g; F(2,45)=2.06, p=0.139, η2 = 0.08; permutation test: p=0.138).

Interestingly, the results from this experiment are qualitatively different to those observed when manipulating the angular deviation of the clamp. Our previous study using clamped visual feedback demonstrated that adaptation in response to errors of varying size, which was assessed by manipulating the clamp angle, results in different early learning rates, but produces the same magnitude of late learning (Kim et al., 2018). In contrast, the results in Experiment 1 show that hitting the target attenuates learning, with the effect becoming pronounced after prolonged exposure to the perturbation. Furthermore, the effect of task outcome appears to be categorical, as it was only observed for the condition in which the cursor was fully embedded within the target (Hit Target), and not when the terminal position of the cursor fell partially outside the target (Straddle Target).

Experiment 2

Experiment 2 was designed to extend the results of Experiment 1 in two ways: First, to verify that the effect of hitting a target generalized to other contexts, we changed the size of the clamp angle. We tested two groups of participants (n = 16/group) with a small 1.75° clamp. For the Hit Target group (Figure 2a), we used the large 16 mm target, and thus, the cursor was fully embedded. For the Straddle Target group, we used the small 6 mm diameter target, resulting in an endpoint configuration in which the cursor was approximately half within the target and half outside the target. We did not test a Miss Target condition because having the clamped cursor land fully outside the target would have necessitated an impractically small target (~1.4 mm). Moreover, the results of Experiment 1 indicate that this condition is functionally equivalent to the Straddle Target group. The second methodological change was made to better assess asymptotic learning. We increased the number of clamped reaches to each location to 220 (reducing the number of target locations to four to keep the experiment within a 1.5 hr session). This resulted in a nearly three-fold increase in the number of clamped reaches per location.

The effects of hitting a target generalize to a different context and remain consistent at asymptote.

The attenuation of adaptation caused by hitting the target (a) generalizes to a different clamp angle and is stable over an extended clamp block (b). As in Experiment 1, there was (c) a marginal difference in early adaptation rate that became (d) a more dramatic difference in late learning. (e) Again, there was no difference in the proportion of retention, this time during a 0° clamp block. Dots represent individuals; shading and error bars denote SEM.

https://doi.org/10.7554/eLife.39882.004

Consistent with the results of Experiment 1, the Hit Target group showed an attenuated learning function compared to the Straddle Target group (Figure 2b). Statistically, there was again only a marginal difference in the rate of early adaptation (95% CI [−0.52°/cycle, .01°/cycle], t(30)=-1.96, p=0.06, d = 0.69; Figure 2c), whereas the difference in late learning was more pronounced (95% CI [−11.38°, −1.25°], t(30)=-2.54, p=0.016, d = 0.90; permutation test: p=0.007; Figure 2d). Indeed, the 35% attenuation in asymptote for the Hit Target group compared to the Straddle Target group is approximately equal to that observed in Experiment 1.

We used a different approach to examine retention in Experiment 2, having participants complete 10 cycles with a 0° clamp following the extended 1.75° clamp block (Shmuelof et al., 2012). We opted to use this alternative method since the presence of the 0° clamp would create less contextual change when switching from the clamp to the retention block, compared to the no feedback block of Experiment 1. In terms of absolute change across the 0° clamp block, there was a trend for greater retention in the Hit group compared to the Straddle group (95% CI [−0.27°, 3.53°], t(30)=1.75, p=0.090, d = 0.62). However, when analyzed as a proportional change, the difference was not reliable (95% CI [−0.06,. 27], t(30)=1.27, p=0.21, d = 0.45).

The results of these first two experiments converge in showing that learning from an invariant error is attenuated when the cursor hits the target, relative to conditions in which at least part of the cursor falls outside the target. This effect replicated across two experiments that used different clamp sizes.

Attenuated behavioral changes are not due to differences in motor planning

Although we hypothesized that manipulating target size in Experiments 1 and 2 would influence learning mechanisms that respond to the differential task outcomes (i.e., hit or miss), it is also important to consider alternative explanations for the effect of target size on learning. Figure 3 provides a schematic of the core components of sensorimotor adaptation. The figure highlights that changes in adaptation might arise because target size alters the inputs on which learning operates, rather than from a change in the operation of the learning process itself. For example, increasing the target size may increase perceptual uncertainty, creating a weaker error signal. We test this hypothesis in a control condition in Experiment 3.

Target size could affect adaptation due to increased perceptual uncertainty or greater variability in motor planning.

In the case of perceptual uncertainty, the feedback signal is weakened, thus leading to a weaker SPE signal. In the case of noisy motor planning, the forward model prediction would also be more variable and effectively weaken the SPE.

https://doi.org/10.7554/eLife.39882.006

Another hypothesis centers on how variation in target size might alter motor planning. Assuming target size influences response preparation, participants in the Hit Target groups had reduced accuracy demands relative to the other groups, given that they were reaching to a larger target (Soechting, 1984). If the accuracy demands were reduced for these large targets, then the motor command could be more variable, resulting in more variable sensory predictions from a forward model, and thus a weaker SPE (Körding and Wolpert, 2004). While we do not have direct measures of planning noise, a reasonable proxy can be obtained by examining movement variability during the unperturbed baseline trials (data from clamped trials would be problematic given the induced change in behavior). If there is substantially more noise in the plan for the larger target, then the variability of hand angles should be higher in this group (Churchland et al., 2006). In addition, one may expect faster movement times (or peak velocities) and/or reaction times for reaches to the larger target, assuming a speed-accuracy tradeoff (Fitts, 1992).

Examination of kinematic and temporal variables (see Appendix 1) did not support the noisy motor plan hypothesis. In Experiment 1, average movement variability across the eight targets during cycles 2–10 of the veridical feedback baseline block were not reliably different between groups (variability: F(2,45)=2.32, p=0.110, η2 = 0.093). Movement times across groups were not different (F(2,45)=2.19, p=0.123, η2 = 0.089). However, we did observe a difference in baseline RTs (F(2,45)=4.48, p=0.017, η2 = 0.166), with post hoc t-tests confirming that the large target (Hit) group had faster RTs than the small target (Miss) group (95% CI [−108 ms, −16 ms], t(30)=-2.74, p=0.010, d = 0.97) and medium target (Straddle) group (95% CI [−66 ms, −10 ms], t(30)=-2.76, p=0.010, d = 0.97). The medium target (Straddle) and small target groups’ RTs were not reliably different (95% CI [−74 ms, 26 ms], t(30)=-.984, p=0.333, d = 0.348). This baseline difference in RTs was only observed in this experiment (see Appendix 1), and there was no correlation between baseline RT and late learning for the large target group (r = 0.09, p=0.73), suggesting that RTs are not associated with the magnitude of learning.

During baseline trials with veridical feedback in Experiment 2, mean spatial variability, measured in terms of hand angle, was actually lower for the group reaching to the larger target (Hit Target group: 3.09° ±. 18°; Straddle Target group: 3.56° ±. 16°; t(30)=-1.99 p=0.056, d = 0.70). Further supporting the argument that planning was no different across conditions, neither reaction times (Hit Target: 378 ± 22 ms; Straddle Target: 373 ± 12 ms) nor movement times (Hit Target: 149 ± 8 ms; Straddle Target: 157 ± 8 ms) differed between the groups (t(30)=-0.183, p=0.856, d = 0.06 and t(30)=0.71, p=0.484, d = 0.25, respectively).

One reason for not observing consistent effects of target size on accuracy or temporal measures could be due to the constraints of the task. Studies showing an effect of target size on motor planning typically utilize point-to-point movements (Soechting, 1984; Knill et al., 2011) in which accuracy requires planning of both movement direction and extent. In our experiments, we utilized shooting movements, thus minimizing demands on the control of movement extent. Endpoint variability is generally larger for movement extent compared to movement direction (Gordon et al., 1994). It is also possible that participants are near ceiling-level performance in terms of hand angle variability.

Theoretical analysis of the effect of task outcome on implicit learning

Having ruled out a motor planning account of the differences in performance in Experiments 1 and 2, we next considered different ways in which target error could affect the rate and asymptotic level of learning. Adaptation from SPE can be thought of as recalibrating an internal model that learns to predict the sensory outcome of a motor command (Figure 3). Here, we model adaptation with a single rate state-space equation of the of the following form:

(1) x(n+1)=Ax(n)+U(e)

where x represents the motor output on trial n, A is a retention factor, and U represents the update/correction size (or, learning rate) as a function of the error (clamp) size, e. This model is mathematically equivalent to a standard single rate state-space model (Thoroughman and Shadmehr, 2000), with the only modification being the replacement of the error sensitivity term, B, with a correction size function, U (Kim et al., 2018). Unlike standard adaptation studies where error size changes over the course of learning, e is a constant with clamped visual feedback and thus, U(e) can be estimated as a single parameter. We refer to this model as the motor correction variant of the standard state space model. The first two experiments make clear that a successful model must account for the differences between hitting and missing the target, even while holding the error term in Equation. (1) (clamp angle) constant.

We consider three variants to the basic model that might account for how task outcome influences learning. The first model is motivated by previous studies that have considered how reinforcement processes might operate in sensorimotor adaptation tasks, and in particular, the idea that task outcome information impacts a model-free operant reinforcement process (Huang et al., 2011; Shmuelof et al., 2012). We can extend this idea to the clamp paradigm, considering how the manipulation of target size affects reward signals: When the clamp hits the target, the feedback generates a positive reinforcement signal; when the clamp misses (or straddles) the target, this reinforcement signal is absent. We refer to the positive outcome as an intrinsic reward given that it is not contingent on the participant’s behavior. This signal could strengthen the representation of its associated movement (Gonzalez Castro et al., 2011; Shmuelof et al., 2012), and thus increase the likelihood that future movements will be biased in a similar direction.

We combine this idea with the state space model to create a Movement Reinforcement model (Figure 4a). Here, a model-free reinforcement-based process is combined with a model-based adaptation process. Intuitively, this model accounts for the attenuated learning functions for the Hit conditions in Experiments 1 and 2 because the effect of movement reinforcement resists the directional change in hand angle induced by SPEs. In this model, intrinsic reward has no direct effect on SPE-driven adaptation. That is, reward and error-based learning are assumed to operate independently of each other, with the final movement being the sum of these two processes.

Three models of how intrinsic reward or target error could affect learning.

(a) In the Movement Reinforcement model, reward signals cause reinforcement learning processes to bias future movements toward previously rewarded movements. The adaptation process is sensitive only to SPE and not reward. The overall movement reflects a composite of the two processes. (b) In the Adaptation Modulation model, reward directly attenuates adaptation to SPE. (c) In the Dual Error model, a second, independent implicit learning process, one driven by TE, combines with SPE-based adaptation to modify performance.

https://doi.org/10.7554/eLife.39882.007

Here, we formalize a Movement Reinforcement model, taking this as illustrative of a broad class of operant reinforcement models in which the reinforcement process acts in parallel to a traditional state space model of sensorimotor adaptation. The motor output, y, is a weighted sum of a model-free reinforcement process and an adaptation process, x:

(2) y(n)=(1V1(n))x(n)+V1(n)Vd(n)

where a population vector (Georgopoulos et al., 1986), V, indicates the current bias of motor representations within the reinforcement system (see Materials and methods). The direction of this vector (Vd) corresponds to the mean preferred direction resulting from the reinforcement history, with the length (Vl) corresponding to the strength of this biasing signal. This vector can be viewed as a weight on the movement reinforcement process (0 = no wt, 1 = full wt), relative to the adaptation process.

In sum, the Movement Reinforcement model entails four parameters, composed of separate update and retention parameters for the reinforcement learning process and the adaptation process (see Materials and methods). The former is model-free, dependent on an operant conditioning process by which a task outcome signal modifies movement biases, whereas the latter is model-based, using SPE to recalibrate an internal model of the sensorimotor map. Importantly, the predictions of this model are not dependent on whether we model the effect of reinforcement as operating on discrete units, as we have done here, or as basis functions (Donchin et al., 2003; Tanaka et al., 2012; Taylor et al., 2013).

The second model entails a single process whereby the task outcome directly modulates the adaptation process. For example, an intrinsic reward signal associated with hitting the target could modulate adaptation, attenuating the trial-to-trial change induced by the SPE (Figure 4b). In this Adaptation Modulation model, the reward signal can be interpreted as a gain controller, similar to previous efforts to model the effect of explicit rewards and punishments on adaptation (Galea et al., 2015). In Experiments 1 and 2, hitting the target presumably reduces the gain on adaptation, thus leading to attenuated learning.

We formalize the Adaptation Modulation model as follows:

(3) x(n+1)=γAAx(n)+γuU(e)

where γA and γu are gains on the retention and update parameters, respectively. In the current implementation, we set γA and γu to one on miss trials and estimate the values of γA and γu for the hit trials. Although this could be reversed (e.g. set gains to one on hit trials and estimate values on miss trials), our convention seems more consistent with previous modeling studies of adaption where the movements generally miss the target. We impose no additional constraint on the gain parameters; the effect of retention or updating can be larger or smaller on hit trials compared to miss trials. As with the Movement Reinforcement model, the Adaptation Modulation model has four free parameters.

The third model we consider here, the Dual Error model, postulates that learning is the composite of two implicit learning processes that operate on different error signals. The first is an adaptation process driven by SPE (as in Equation (1)). The second process operates in the same manner as adaptation, but here the error signal is sensitive to the task outcome. This idea of a TE-sensitive process stems from previous studies in which an error is produced, not by perturbing the visual feedback of hand position, but rather by displacing the visual feedback of the target position (Magescas and Prablanc, 2006; Cameron et al., 2010a; Cameron et al., 2010a; Schmitz et al., 2010). The resulting mismatch between the hand position and displaced target position can be viewed as a TE rather than SPE, under the assumption that the veridical feedback of hand position roughly matches the predicted hand position (see Discussion). When this error signal is consistent (e.g. target is displaced in the same direction on every trial), a gradual change in heading angle is observed, similar to that seen in studies of visuomotor adaptation. Moreover, this form of learning is implicit: By shifting the target position during a saccade, just prior to the reach, the participants are unaware of the target displacement.

In the Dual Error model, the motor output is the sum of two processes:

(4) xtotal(n)=xspe(n)+xte(n)

where

(5) xspe(n+1)=Aspexspe(n)+Uspe(SPE)
(6) xte(n+1)=Atexte(n)+Ute(TE)

Equation (5) is the same as in the other two models, describing adaptation from a sensory prediction error, but with the notation modified here to explicitly contrast with the second process. Equation (6) describes a second implicit learning process, but one that is driven by the target error.

The SPE-sensitive process updates from the error term on every trial given that the SPE is always present, even on hit trials. In contrast, the TE-sensitive process only updates from the error term on miss trials. The error component of Equation (6) is absent on hit trials. This would account for the attenuated learning observed in the large target (Hit) conditions in Experiments 1 and 2. In the context of our clamp experiments, TE is modeled as a step function (Figure 4c), set to 0 when the cursor hits the target and one when the cursor misses or straddles the target. However, if the cursor position varied (as in studies with contingent feedback), TE might take on continuous, signed values, similar to SPE.

We note that the Dual Error model is similar to the influential two-process state space model of adaption introduced by Smith and colleagues (Smith et al., 2006). In their model, dual-adaptation processes have different learning rates and retention factors, resulting in changes that occur over different time scales. Here, the different learning rates and retention factors are related to the different error signals, TE and SPE. Whereas the dual-rate model imposes a constraint on the parameters (i.e. process with faster learning must also have faster forgetting), the four parameters in the Dual Error model are unconstrained relative to each other.

Experiment 3

The experimental design employed in Experiments 1 and 2 cannot distinguish between these three models because all make qualitatively similar predictions. In the Movement Reinforcement model, the attenuated asymptote in response to Hit conditions arises because movements are rewarded throughout, including during early learning, biasing future movements toward baseline. The Adaptation Modulation model predicts a lower asymptote during the Hit condition because the adaptation system is directly attenuated by reward. The Dual Error model similarly predicts a lower asymptote because only one of two learning processes is active when there is no target error.

In contrast to the single perturbation blocks used in Experiments 1 and 2, a transfer design in which the target size changes after an initial adaptation phase affords an opportunity to contrast the three models. In Experiment 3, we tested two groups of participants (n = 12/group) with a 1.75° clamp, varying the target size between the first and second halves of the experiment (Figure 5a). The key manipulation centered on the order of when the target was large (hit condition) or small (straddle condition).

Within-subject transfer design to evaluate models of the impact of task outcome on implicit motor learning.

(a) Using a transfer design, (b) the models diverge in their behavioral predictions for the Straddle-to-Hit group following transfer. The Movement Reinforcement model predicts a persistent asymptote following transfer, whereas the Adaptation Modulation and Dual Error models predict a decay in hand angle. During the acquisition phase, we again observed differences between the Hit and Straddle groups in the (c) early adaptation rate as well as (d) late learning. All participants in both groups demonstrated changes in reach angle consistent with the Adaptation Modulation and Dual Error models. (e) The learning functions were inconsistent with the Movement Reinforcement model. Note that the rise in hand angle for the Hit-to-Straddle group is consistent with all three models. Dots represent individuals; shading and error bars denote SEM.

https://doi.org/10.7554/eLife.39882.008

For the Straddle-to-Hit group, a small target was used in an initial acquisition phase (first 120 clamp cycles). Based on the results of Experiments 1 and 2, we expect to observe a relatively large change in hand angle at the end of this phase since the outcome is always an effective ‘miss’. The key test comes during the transfer phase (final 80 clamp cycles), in which the target size is increased such that the invariant clamp now results in a target hit. For the Movement Reinforcement model, hitting the target will produce an intrinsic reward signal, reinforcing the associated movement. Therefore, there should be no change in performance (hand angle) following transfer since the SPE remains the same and the current movements are now reinforced (Figure 5b). In contrast, both the Adaptation Modulation and Dual Error models predict that, following transfer to the large target, there will be a drop in hand angle, relative to the initial asymptote. For the former, hitting the target will attenuate the adaptation system; for the latter, hitting the target will shut down learning from the process that is sensitive to target error.

We also tested a second group in which the large target (hit) was used in the acquisition phase and the small target (effective ‘miss’) in the transfer phase (Hit-to-Straddle group). All three models make the same qualitative predictions for this group. At the end of the acquisition phase, there should be a smaller change in hand angle compared to the Straddle-to-Hit group, due to the persistent target hits. Following transfer, all three models predict an increase in hand angle, relative to the initial asymptote. For the Movement Reinforcement model, the reduction in target size removes the intrinsic reward signal, which over time, lessens the contribution of the reinforcement process as the learned movement biases decay in strength. The Adaptation Modulation model predicts that hand angle will increase due to the removal of the attenuating effect on adaptation following transfer. The Dual Error model also predicts an increase in hand angle, but here the effect occurs because the introduction of a target error activates the second implicit learning process. Although the Hit-to-Straddle group does not provide a discriminative test between the three models, the inclusion of this group does provide a second test of each model, as well as an opportunity to rule out alternative hypotheses for the behavioral effects at transfer. For example, the absence of a change at transfer might be due to reduced sensitivity to the clamp following a long initial acquisition phase.

Experiment 3 – behavioral analyses

For our analyses, we first examined performance during the acquisition phase. Consistent with the results from Experiments 1 and 2, the Hit-to-Straddle Target group adapted slower than the Straddle-to-Hit group (95% CI [−0.17°/cycle, −0.83°/cycle], t(22)=-3.15, p=0.005, d = 1.29; Figure 5c) and reached a lower asymptote (95% CI [−5.25°, −15.29°], t(22)=-4.24, p=0.0003, d = 1.73; permutation test: p=0.0003; Figure 5d). The reduction at asymptote was approximately 45%.

We next examined performance during the transfer phase where the target size reversed for the two groups. Our primary measure of behavioral change for each subject was the difference in late learning (average hand angle over last 10 cycles) between the end of the acquisition phase and the end of the transfer phase. As seen in Figure 5d, the two groups showed opposite changes in behavior in the transfer phase, evident by the strong (group x phase) interaction (F(2,33)=43.1, p<10−7, partial η2 = 0.72). The results of a within-subjects t-test showed that the Hit-to-Straddle group showed a marked increase in hand angle following the decrease in target size (95% CI [4.9°, 9.1°], t(11)=7.42, p<0.0001, dz = 2.14; Figure 5e), consistent with the predictions for all three models.

The Straddle-to-Hit group’s transfer performance provides an opportunity to compare differential predictions, and in particular, to pit the Movement Reinforcement model against the other two models. Following the switch to the large target, there was a decrease in hand angle. Applying the same statistical test, the mean decrement in hand angle was 5.7° from the final cycles of the training phase to the final cycles of the transfer phase (95% CI [−3.1°, −8.2°], t(11)=-4.84, p=0.0005, dz = 1.40; Figure 5e). This result is consistent with the prediction of the Adaptation Modulation and Dual Error models. In contrast, the reduction in hand angle cannot be accounted for by the Movement Reinforcement model.

Experiment 3 – modeling results

We evaluated the three models by simultaneously fitting group-averaged data for both groups. As depicted in Figure 6, all three models capture the initial plateau followed by increased learning of the Hit-to-Straddle group. However, the quality of the fits diverges for the Straddle-to-Hit group, where the Movement Reinforcement model cannot produce a decrease in hand angle once the large target is introduced. Instead, the best-fit parameters for this model result in an asymptote that falls between the hand angle values observed during the latter part of each phase. In contrast, the Adaptation Modulation and Dual Error models both predict the drop in hand angle during the second phase of the experiment for the Straddle-to-Hit group.

Figure 6 with 1 supplement see all
Model fits of the learning functions from Experiment 3.

The failure of the (a) Movement Reinforcement model to qualitatively capture the decay in hand angle following transfer in the Straddle-to-Hit condition argues against the idea that the effect of task outcome arises solely from a model-free learning process that operates independent of model-based adaptation. In contrast, both the (b) Adaptation Modulation and (c) Dual Error models accurately predict the changes in hand angle following transfer in both the Hit-to-Straddle and Straddle-to-Hit conditions.

https://doi.org/10.7554/eLife.39882.010

Consistent with the preceding qualitative observations, the Movement Reinforcement model yielded a lower R2 value and higher Akaike Information Criterion (AIC) score (higher AIC indicates relatively worse fit) than the Adaptation Modulation and Dual Error models (Table 1). A comparison of the latter two shows that the Dual Error model provides the best account of the results. This model yielded a lower AIC score and accounted for 90% of the variance in the group-averaged data compared to 86% for the Adaptation Modulation model.

Table 1
Model evaluations.
https://doi.org/10.7554/eLife.39882.012
Basic models# of free parametersR-squaredAIC
Movement Reinforcement40.824363
Adaptation Modulation40.861269
Dual Error40.895156
Hybrid Models
Movement Reinforcement + Adaptation Modulation60.945−100
Movement Reinforcement + Dual Error60.945−97

To better understand the effects of target size on learning and retention, we examined the parameter estimates for the Adaptation Modulation and Dual Error models. We first generated 1000 bootstrapped samples of group-averaged behavior by resampling with replacement from each group. We then fit each of the bootstrapped samples simultaneously and report the results here in terms of 95% confidence intervals. For the Adaptation Modulation model, the estimates of γu*U were larger during miss than hit conditions, with no overlap of the confidence intervals ([.693, 1.302] vs [.182,. 573], respectively); thus, the error-driven adjustment in the state of the internal model was much larger after a miss than a hit. For the Dual Error model, the estimates of Uspe were larger than for Ute, again with no overlap of the confidence intervals ([.414, 1.08], vs [.157,. 398]), indicating that the state change was more strongly driven by SPE than TE. For each model, the process that produced a larger error-based update also had the lower retention factor, although here there was overlap in the 95% confidence intervals for the latter (γr*A for Miss: [.939,.969] vs Hit: [. 961,.989]; Aspe: [.900,.972] vs Ate: [.938,.993]). In sum, our model fits suggest the impact of task outcome (hit or miss) was primarily manifest in the estimates of the learning rate parameters. However, this interpretation is tempered by the correlations observed between certain parameters (Cheng and Sabes, 2006) (see Figure 6—figure supplement 1).

The behavioral pattern observed in Experiment 3, complemented by the modeling results, are problematic for the Movement Reinforcement model, challenging the idea that the effect of task outcome arises solely from a model-free learning process that operates independent of model-based adaptation. However, this does not exclude the possibility that task outcome information influences both model-free and model-based processes. For example, hitting the target might not only reinforce an executed movement, but might also modulate adaptation. Formally, this hypothesis would correspond to a hybrid model that combines the Adaptation Modulation and Movement Reinforcement models. Indeed, hybrids that combine the Movement Reinforcement model with either the Adaptation Modulation or Dual Error models (see Materials and methods) yield improved model fits and lower AIC values, with the two hybrids producing comparable values (see Table 1).

Control group for testing perceptual uncertainty hypothesis

Across the three experiments, the amount of learning induced by clamped visual feedback was attenuated when participants reached to the large target. We considered if this effect could be due, in part, to the differences between the Hit and Straddle/Miss conditions in terms of perceptual uncertainty. For example, the reliability of the visual error signal might be weaker if the cursor is fully embedded within the target; in the extreme, failure to detect the angular offset might lead to the absence of an error signal on some percentage of the trials.

To evaluate this perceptual uncertainty hypothesis, we tested an additional group in Experiment 3 with a large target, but modified the display such that a bright line, aligned with the target direction, bisected the target (Figure 7). With this display, the feedback cursor remained fully embedded in the target, but was clearly off-center. If the attenuation associated with the large target is due to perceptual uncertainty, then the inclusion of the bisecting line should produce an adaptation effect similar to that observed with small targets. Alternatively, if perceptual uncertainty does not play a prominent role in the target size effect, then the adaptation effects would be similar to that observed with large targets.

Effect of large target is not due to perceptual uncertainty.

(a) A control group was tested in a transfer design in which the large target was used in both phases, but a bisection line was present only during the acquisition phase. (b) The behavior of the control group (magenta) during the acquisition phase was not significantly different than that observed for the group that was tested with the (non-bisected) large target in the acquisition phase of Experiment 3 (re-plotted here in green), suggesting that perceptual uncertainty did not make a substantive contribution to the effects of hitting the target. Note that we do not display transfer data for the large target group since the target size changed for this group. (c) No change in asymptote was observed when going from the bisected target to the standard large target.

https://doi.org/10.7554/eLife.39882.013

Consistent with the second hypothesis, performance during the acquisition phase for the group reaching to a bisected target was similar to that of the group reaching to the standard large target (Hit-to-Straddle, Figure 7). To provide support for this observation, we first performed an omnibus one-way ANOVA on the late learning data at the end of the acquisition phase, given our analysis plan entailed multiple planned pair-wise comparisons. There was a significant effect of group (F(2,33)=9.33, p=0.0006, η2 = 0.36). Subsequent planned pair-wise comparisons showed no significant differences between the bisected target and standard large target (Hit-to-Straddle) groups (early adapt: 95% CI [−0.34°/cycle,. 22°/cycle], t(22)=-.47; p=0.64,; d = 0.19; late learning: 95% CI [−7.80° 1.19°], t(22)=-1.52; p=0.14; d = 0.62). In contrast, the group reaching to bisected targets showed slower early adaptation rates (95% CI [−0.81°/cycle, −0.07°/cycle], t(22)=-2.49, p=0.02, d = 1.02) and lower magnitudes of late learning (95% CI [−12.58°, −1.35°], t = −2.57, p=0.017, d = 1.05) when compared with the group reaching to small targets (Straddle-to-Hit).

During the transfer phase, the target size for the perceptual uncertainty group remained large, but the bisection line was removed. If perceptual uncertainty underlies the effect we have attributed to hitting the target, we would expect to observe a decrease in hand angle following transfer, since uncertainty would increase. However, following transfer to the non-bisected large target, there was no change in asymptote (95% CI [−0.87°, 2.32°], t(11)=1.0, p=0.341, dz = 0.29). In sum, the results from this control group indicate that the attenuated adaptation observed when the cursor is fully embedded within the target is not due to perceptual uncertainty.

Discussion

Models of sensorimotor adaptation have emphasized that this form of learning is driven by sensory prediction errors, the difference between the observed and predicted sensory consequences of a movement. In this formulation, task outcome, defined as hitting or missing the target, is not part of the equation (although in most adaptation tasks, the sensory prediction is at the target, thus conflating SPE and task outcome). While a number of recent studies have demonstrated that task outcome signals can influence overall performance in these tasks (Galea et al., 2015; Reichenthal et al., 2016; Leow et al., 2018; van der Kooij et al., 2018), it is unclear whether these reinforcement signals impact sensorimotor adaptation (Shmuelof et al., 2012; Galea et al., 2015), or whether they are exploited by other learning systems, distinct from SPE-driven implicit adaptation (Codol et al., 2018; Holland et al., 2018).

The interpretation of the results from these studies is complicated by the fact that the experimental tasks may conflate different learning processes. In the present study, we sought to avoid this complication by employing a new method to study implicit learning, one in which participants are specifically instructed to ignore an invariant visual error signal, thus eliminating explicit processes (Morehead et al., 2017). Using this clamp method, we observed a striking difference between conditions in which the final position of the cursor was fully embedded in the target compared to conditions in which the cursor either terminated outside or straddled the target: When the cursor was fully embedded, the rate of learning was reduced and the asymptotic level of learning was markedly attenuated.

Characterizing the information associated with task outcome

We manipulated task outcome by varying the size of the target, and, across experiments, manipulated SPE by varying the clamp size. Although the experimental instructions remained unchanged, these stimulus changes might be expected to also influence the perception of the error or motor planning processes. However, the behavioral differences arising from the manipulation of task outcome did not appear to arise from these factors. Movement kinematics were essentially the same when reaching to the different sized targets, and the perceptual control condition showed that reducing perceptual uncertainty did not influence performance. Moreover, the finding in Experiment 1 that the Straddle group performed similar to the Miss group, suggests that the effect of target size is, to some degree, categorical rather than continuous.

With clamped visual feedback, participants have no control over the invariant task outcome. In our earlier work with this method, we hypothesized that the cursor feedback is interpreted by the adaptation system as an error signal. We assume the adaptation system is ‘fooled’ by the temporal correlation between the motion of the hand and feedback signal, even though the participants are fully aware that the angular position of the cursor is causally unrelated to their behavior (Morehead et al., 2017). This hypothesis is consistent with earlier work showing that SPEs will drive implicit adaptation, even at the cost of reduced task success (Mazzoni and Krakauer, 2006; Taylor and Ivry, 2011).

One interpretation of the effect of task outcome is that an automatic signal is generated when the cursor hits the target; that is, this outcome is intrinsically rewarding (Huang et al., 2011; Leow et al., 2018), even though the participant is aware that the outcome does not depend on the accuracy of their movements. In two of our proposed models, we assume that hitting the target leads to the automatic generation of a positive reinforcement signal. In the Movement Reinforcement model, this signal strengthens associated movement representations, producing a bias on behavior. In the Adaptation Modulation model, this signal directly attenuates adaptation. Alternatively, one could emphasize the other side of the coin, namely, that the absence of reward (i.e. missing the target) results in a negative reinforcement signal, or what we refer to here as target error. Consideration of two types of error signals is, of course, central to the Dual Error model. We could also reframe the Adaptation Modulation model: Rather than view adaptation as being attenuated following a positive task outcome, it may be that adaptation is enhanced following a negative task outcome.

With the current procedure, we do not have evidence, independent of the behavior, that the task outcome with non-contingent feedback results in a reinforcement signal (either positive or negative). Methods such as fMRI (Daw et al., 2011) or pupillometry (Manohar et al., 2017) could provide an independent means to assess the presence of well-established signatures of reward. Nonetheless, our results indicate, more generally, that task outcome is an important factor mediating the rate and magnitude of implicit motor learning.

Modeling the influence of task outcome on implicit changes in performance

Our modeling analysis makes clear that parallel, independent activity of sensorimotor adaptation and task outcome-driven operant reinforcement processes cannot account for the behavioral changes observed in the present set of experiments. In particular, the Movement Reinforcement Model fails to predict the change in reach direction observed when the target size was decreased in the Straddle-to-Hit condition of Experiment 3. In this model, the Straddle-to-Hit group’s asymptotic learning during the acquisition phase is due to the isolated operation of the adaptation system, given that none of the reaches are rewarded. The SPE signal would be expected to persist following transfer, maintaining this asymptote. Moreover, movements in this direction would be further strengthened given that, with the introduction of the large target, they would be reinforced by an intrinsic reward signal. Importantly, the predicted absence of behavioral change following transfer should hold for all models in which a model-free reinforcement-based process is combined with a task outcome-insensitive model-based adaptation process. For example, the prediction is independent of whether the reinforcement process follows a different time course than adaptation (e.g. faster or slower), or if we model the effect of reinforcement as basis functions (Donchin et al., 2003; Tanaka et al., 2012; Taylor et al., 2013) rather than discrete units. Thus, we propose that any model in which adaptation and reinforcement processes act independently will fail to show the observed decrease in hand angle following transfer from a miss condition to a hit condition.

The failure of the Movement Reinforcement model requires that we consider alternatives in which information about the task outcome interacts with model-based processes. The Adaptation Modulation model postulates that a signal associated with the task outcome directly modulates the adaptation process. In the current instantiation, we propose that hitting the target results in an intrinsic reward signal that reduces the gain on adaptation (Leow et al., 2018), although an alternative interpretation would be that missing the target results in an error signal that amplifies the gain. This model was able to account for the reduced asymptote observed in the Straddle-to-Hit condition of Experiment 3, outperforming the Movement Reinforcement model.

The Adaptation Modulation model makes explicit assumptions of previous work in which reward was proposed to act as a gain controller on the adaptation process (Galea et al., 2015; Nikooyan and Ahmed, 2015). In terms of the standard state space model, the results indicate that the main effect of task outcome was on the learning rate parameter. Hitting the target reduced the learning rate by approximately 40%, consistent with other studies showing reduced behavioral changes when hitting the target (Reichenthal et al., 2016; Leow et al., 2018).

Galea et al. (2015) also used a model-based approach to examine the influence of reinforcement on adaptation, comparing conditions in which participants received or lost money during a standard visuomotor rotation task. Their results indicated that reward had a selective effect on the retention parameter in the state space model, suggesting the effect was on memory rather than learning. We also observed higher retention parameters when the cursor hit the target, although the effect size here was a relatively smaller ~3% increase and not reliably different from the miss/straddle condition, based on bootstrapped parameter estimates. We suspect that the effect on retention in Galea et al. (2015) was, in large part, not due to a change in the adaptation process itself, but rather the residual effects of an aiming strategy induced by the reward. That is, the monetary rewards might have reinforced a strategy during the rotation block, and this carried over into the washout block. Indeed, the idea that reward impacts strategic processes has been advanced in studies comparing conditions in which the performance could be enhanced by re-aiming (Codol et al., 2018; Holland et al., 2018). By using non-contingent clamped feedback, we eliminate strategy use and thus provide a purer assessment of how reward influences adaptation.

We recognize that the hypothesized modulation of sensorimotor adaptation by task outcome is, at least superficially, contrary to previous conjectures concerning the independent effects of SPE and TE (Mazzoni and Krakauer, 2006; Taylor and Ivry, 2011; Taylor et al., 2014; Morehead et al., 2017; Kim et al., 2018). One argument for independence comes from a visuomotor adaptation task in which participants are instructed to use an aiming strategy to compensate for a large visuomotor rotation (Mazzoni and Krakauer, 2006; Taylor and Ivry, 2011). By using the instructed strategy, the cursor immediately intersects the target, eliminating the target error. However, over the course of subsequent reaches, the participants’ performance deteriorates, an effect attributed to the persistence of an SPE, the difference between the aiming location and cursor position. Taylor and Ivry (2011) modeled this behavior by assuming the operation of two independent learning processes, adaptation driven by SPE and strategy adjustment driven by TE. In light of the present results, it is important to note that there were actually very few trials in which target hits actually occurred, given that the large SPE on the initial reaches resulted in target misses on almost all trials. In addition, the strength of a task success signal may fall off with larger SPEs (Cashaback et al., 2017). As such, the current study, in which SPE and task outcome are held constant throughout learning, provides a much stronger assessment on the effect of task outcome on sensorimotor adaptation.

The Dual Error model suggests an alternative account of the effect of task outcome on performance. This model assumes that performance is the composite of two independent error-based processes, an adaptation system that is sensitive to SPE, and a second implicit process that is sensitive to target error. Of the three models tested here, the Dual Error model provided the best account of the behavior in Experiment 3, accounting for 90% of the variance when the group-averaged data from both the Straddle-to-Hit and Hit-to-Straddle conditions of Experiment 3 were fit simultaneously.

Interestingly, in previous work, TE was thought to be a driving signal for explicit learning, and in particular, for adjusting a strategic aiming process that can lead to rapid improvements in performance (Taylor and Ivry, 2011; Taylor et al., 2014; McDougle et al., 2015; Day et al., 2016). Conceptualizing TE-based learning as supporting an explicit process does not appear warranted here. We have no evidence, either based on performance or verbal reports obtained during post-experiment debriefing sessions (Kim et al., 2018), that participants employ a strategy to counteract the clamp. Rather, all the observed changes in behavior are implicit.

Alternatively, we can consider whether the TE-based process constitutes a form of implicit aiming. The notion of implicit aiming has previously been suggested in work showing that, with extended practice, strategic aiming may become automatized (Huberdeau et al., 2017). One interpretation of this effect is that aiming strategies eventually become ‘cached’ and are automatically retrieved during response preparation (Haith and Krakauer, 2018). While the idea of a cached strategy may be reasonable in the context of traditional sensorimotor perturbation studies, it does not seem to offer a reasonable psychological account of the effect of task outcome in the current context. Given that participants do not employ a strategy to counteract the clamp, there is no strategy to cache. Furthermore, parameter estimates for the Dual Error model indicate that the TE-sensitive process learned at a slower rate and retained more than the SPE-sensitive process. Were implicit aiming to share core features of explicit aiming, the modeling results would be inconsistent with previous work indicating that explicit aiming from TE is faster (McDougle et al., 2015) and more flexible (Bond and Taylor, 2015; Hutter and Taylor, 2018) than adaptation from SPE. Despite the arguments against an implicit aiming interpretation, the current results and those from other studies (Magescas and Prablanc, 2006; Cameron et al., 2010a; Cameron et al., 2010b; Schmitz et al., 2010) suggest that there may exist another form of implicit error-based learning, one driven by TE rather than SPE.

Although the Dual Error model provided a better fit of the behavioral results compared to the Adaptation Modulation model, the challenge for future research is to design experiments that can evaluate their unique predictions. In the current study, we manipulated TE by varying the size of the target, with SPE held constant. An alternative method to manipulate TE is to ‘jump’ the target during the movement; Leow et al., 2018 shifted the target in the same direction as a visuomotor rotation, ensuring that the feedback cursor landed in the target. Their results showed attenuated adaptation relative to a condition in which the target position does not change. Future studies could employ the target jump method, varying the size of the target, with a 0° clamp. In this way, SPE is eliminated, but task outcome, that is miss or hit, will depend on the size of target and its displacement. The Dual Error model, as presently formulated would predict learning during miss trials, and no learning during hit trials. The Adaptation Modulation model, on the other hand, would predict no learning in either case since there is no SPE.

In terms of neural mechanisms, converging evidence points to a critical role for the cerebellum in SPE-driven sensorimotor adaptation (Tseng et al., 2007; Taylor et al., 2010; Izawa et al., 2012; Schlerf et al., 2012; Butcher et al., 2017), including the observation that patients with cerebellar degeneration show a reduced response to visual error clamps (Morehead et al., 2017). An important question for future research is whether the cerebellum is also essential for learning driven by information concerning task outcome. A recent behavioral study showed that individuals with cerebellar degeneration were unimpaired in learning from binary, reward-based feedback, once the motor variability associated with their ataxia was taken into consideration (Therrien et al., 2016). This finding provides one instance in which the cerebellum is not essential for learning from task outcome. However, the complete retention observed in that study would indicate that learning was of a different form than adaptation, perhaps related to the use of an explicit strategy (Holland et al., 2018). Evidence that the cerebellum may be integral to processing task outcome signals that could support implicit processes comes from research with animal models indicating that both simple (Wagner et al., 2017) and complex (Ohmae and Medina, 2015) spike activity in the cerebellum may signal information about task outcome and reward prediction errors. By testing individuals with cerebellar impairment on a clamp design in which SPE is held constant and TE is manipulated, one can simultaneously assess the role of the cerebellum in learning from these two error signals.

Conclusions

By using non-contingent feedback, we were able to re-examine the effect of task outcome on sensorimotor learning. The results clearly show that 1) implicit learning processes are influenced by information concerning task outcome, either through the generation of an intrinsic reward or task error signal and 2) that the effect cannot be accounted for by the engagement of a model-based adaptation process operating in tandem with an independent model-free operant reinforcement process. The behavioral results and our modeling work indicate the need for a more nuanced view of sensorimotor adaptation. We outline two directions to consider. In the Adaptation Modulation model, task outcome signals are proposed to serve as a gain on adaptation, contrary to previous views of a modular system that is immune to information about task success. The Dual Error model suggests the need for a more expansive definition of adaptation in which multiple implicit learning processes operate to keep the sensorimotor system well-calibrated. These models can serve as a springboard for future research designed to further delineate how information about motor execution and task outcome influence implicit sensorimotor learning.

Materials and methods

Participants

Healthy, young adults (N = 116, 69 females; average age = 20.9 years old, range: 18.2–27.8) were recruited from the University of California, Berkeley, community. Each participant was tested in only one experiment and was right-handed, as verified with the Edinburgh Handedness Inventory (Oldfield, 1971) All participants provided written informed consent to participate in the study and to allow publication of their data, and received financial compensation for their participation. The Institutional Review Board at UC Berkeley approved all experimental procedures under ID number 2016-02-8439.

Experimental apparatus

Request a detailed protocol

The participant was seated at a custom-made tabletop housing an LCD screen (53.2 cm by 30 cm, ASUS), mounted 27 cm above a digitizing tablet (49.3 cm by 32.7 cm, Intuos 4XL; Wacom, Vancouver, WA). The participant made reaching movements by sliding a modified air hockey ‘paddle’ containing an embedded stylus. The position of the stylus was recorded by the tablet at 200 Hz. The experimental software was custom written in Matlab, using the Psychtoolbox extensions (Pelli, 1997).

Reaching task

Request a detailed protocol

Center-out planar reaching movements were performed from the center of the workspace to targets positioned at a radial distance of 8 cm. Direct vision of the hand was occluded by the monitor, and the lights were extinguished in the room to minimize peripheral vision of the arm. The starting and target locations were indicated by white and blue circles, respectively (start circle: 6 mm in diameter; target: either 6, 9.8 or 16 mm depending on condition).

To initiate each trial, the participant moved the digitizing stylus into the start location. The position of the stylus was indicated by a white feedback cursor (3.5 mm diameter). Once the start location was maintained for 500 ms, the target appeared. For Experiments 1 and 3, the target could appear at one of eight locations, placed in 45° increments around a virtual circle (0°, 45°, 95°, 135°, 180°, 225°, 270°, 315°). For Experiment 2, the target could appear at one of four locations placed in 90° increments around a virtual circle (45°, 135°, 225°, 315°). We reduced the number of targets from 8 to 4 in Experiment 2 in order to increase the overall number of training cycles with the clamp to ensure that participants reach a stable asymptote, while keeping the experiment under 1.5 hr. Participants were instructed to accurately and rapidly ‘slice’ through the target, without needing to stop at the target location. Visual feedback, when presented, was provided during the reach until the movement amplitude exceeded 8 cm. As described below, the feedback either matched the position of the stylus (veridical) or followed a fixed path (clamped). If the movement duration (excluding RT) was not completed within 300 ms, the words ‘too slow’ were generated by the sound system of the computer.

After the hand crossed the target ring, endpoint cursor feedback was provided for 50 ms either at the position in which the hand crossed the virtual target ring (veridical feedback) or at a fixed distance determined by the size of the clamp. During the return movement, the feedback cursor reappeared when the participant’s hand was within 1 cm of the start position.

Experimental feedback conditions

Request a detailed protocol

Across the experimental session, there were three types of visual feedback. On no-feedback trials, the cursor disappeared when the participant‘s hand left the start circle and only reappeared at the end of the return movement. On veridical feedback trials, the cursor matched the position of the stylus during the 8 cm outbound segment of the reach. On clamped feedback trials, the feedback followed a path that was fixed along a specific hand angle. The radial distance of the cursor from the start location was still based on the radial extent of the participant's hand during the 8 cm outbound segment, but the angular position was fixed relative to the target (i.e. independent of the angular position of the hand).

The primary instructions to the participant (experiment script included) remained the same across the experimental session: Specifically, that they were to reach directly toward the visual target. Prior to the introduction of the clamped feedback trials, participants were briefed about the feedback manipulation. They were informed that the position of the cursor would now follow a fixed trajectory and that the angular position would be independent of their movement. They were explicitly instructed to ignore the cursor and continue to reach directly to the target. Participants also performed three instructed trials with the clamp perturbation on. During these practice trials, a target appeared at the 90° location (straight ahead), and the experimenter instructed the participant to first ‘reach straight to the left’ (i.e. 180°). For the second practice trial, the participant was instructed to ‘reach straight to the right’ (0°). For the last trial, the participant was instructed to ‘reach straight down (towards your torso)’ (ie, 270°). The purpose of these trials was to familiarize the participant with the exact clamp condition they were about to experience. Following these three practice trials, the experimenter confirmed with the participant they understood now what was meant by clamped visual feedback. These practice trials were removed from future analyses.

The same instructions in abbreviated form (‘Ignore the cursor and move your hand directly to the target location’) were repeated verbally and with onscreen text at every block break during the clamp perturbation. Participants were debriefed at the end of the experiment and asked whether they ever intentionally tried to reach to locations other than the target. All subjects reported aiming to the target throughout the experiment.

We counterbalanced clockwise and counterclockwise clamps within each group for all three experiments.

Experiment 1

Request a detailed protocol

Participants (n = 48, 16/group) were randomly assigned to one of three groups, each training with a 3.5° clamp but differing only in terms of the size of the target: 6 mm, 9.8, or 16 mm diameter. These sizes were chosen so that at an 8 cm radial distance the clamped cursor would be adjacent to the target without making any contact (Target Miss group), straddling the target by being roughly half inside and half outside the target (Straddle Target group), or fully embedded within the target (Hit Target group). The Euclidean distance for this clamp size, measured from the centers of cursor and target, was 4.9 mm.

The session began with two baseline blocks, the first comprised of five movement cycles (40 total reaches to eight targets) without visual feedback and the second comprised of 10 cycles with a veridical cursor displaying hand position. The experimenter then informed the participant that the visual feedback would no longer be veridical and would now be clamped at a fixed angle from the target location. Immediately following these general instructions, the experimenter continued providing instructions for the three practice trials which immediately followed (see Experimental Feedback Conditions). After the practice trials and confirming the participant’s understanding of the task, the clamp block ensued for a total of 80 cycles. A short break (<1 min), as well as a reminder of the task instructions, was provided after 40 cycles (i.e. at the halfway point of this block). Immediately following the perturbation block, there were two washout blocks, first a five cycle block in which there was no visual feedback, followed by 10 cycles with veridical visual feedback. These blocks were preceded by instructions regarding the change in experimental condition and participants were reminded to always aim for the target and to attempt to slice through it with their hand.

Experiment 2

Request a detailed protocol

In Experiment 2, we assessed adaptation over an extended number of clamped visual feedback trials. The purpose of extending the perturbation block was to ensure that participants reached asymptotic levels of learning. In order to achieve a greater number of training cycles, we reduced the number of target locations within the set from 8 to 4.

Participants (n = 32, 16/group) trained with a 1.75° clamp (2.4 mm distance between target and cursor centers) and were assigned to either a small (Straddle) or large (Hit) target condition. The session started with two baseline blocks, 10 cycles (40 reaches) without visual feedback and then 10 cycles with veridical feedback. Following three practice trials with the clamp, the number of cycles in the clamped visual feedback block was nearly tripled from that of Experiment 1 to 220 cycles, with breaks provided after every 70 cycles. Following 220 cycles of training with a 1.75° clamp, there were two washout blocks, first a 10 cycle block in which there was a 0° clamp, followed by 10 cycles with veridical visual feedback. Prior to washout, participants were again instructed to always aim directly to the target.

Experiment 3

Request a detailed protocol

Experiment 3 used a transfer design to evaluate different hypotheses concerning the role of task outcome on implicit sensorimotor learning. Our main predictions focused on the transfer phase, comparing the participants’ behavior to the predictions of three models (see section, Theoretical analysis of the effect of task outcome on implicit learning). We tested two main groups (n = 12/group) in Experiment 3, using a 1.75° clamp in both the acquisition and transfer phases. The session started with two baseline blocks, five cycles (40 reaches) without visual feedback and then five cycles with veridical feedback. After the baseline blocks, clamp instructions and three practice trials were provided to all participants. The first clamp block (acquisition phase) lasted 120 cycles, with participants training with either a small or large target. Following the first 120 cycles, the target sizes were reversed for the next 80 cycles (transfer phase: Straddle-to-Hit or Hit-to-Straddle conditions). Breaks of <1 min were provided after every 35 cycles of training. On the break preceding the transfer (15 cycles before target switch), participants were told that everything would continue on as before, except that the target size would change at some point during the block. The purpose of staggering the break with the transfer was to mitigate any change in adaptation due to temporal decay that could result from a break in training (Hadjiosif and Smith, 2013).

Control group

Request a detailed protocol

A third group (n = 12) was added to test whether the attenuation of adaptation in the large target condition was due to perceptual uncertainty. Here, the block structure was identical to the first two groups. We used a modified large target (16 mm), one which had a bright green bisecting line through the middle, aligned with the target direction. The clamped cursor always fell within one half of the target (either clockwise or counter-clockwise depending on the condition), thus providing a clear indication that the cursor was off center. At the transfer, the bisecting line was removed and participants trained for 80 cycles with the standard large target.

Data analysis

Request a detailed protocol

All statistical analyses and modeling were performed using MATLAB 2015b and the Statistics Toolbox. Data and code are available on GitHub at: https://github.com/hyosubkim/Influence-of-task-outcome-on-implicit-motor-learning (Kim, 2019; copy archived at https://github.com/elifesciences-publications/Influence-of-task-outcome-on-implicit-motor-learning). The primary dependent variable in all experiments was hand angle at peak radial velocity, defined by the angle of the hand relative to the target at the time of peak radial velocity (i.e., angle between lines connecting start position to target and start position to hand). Throughout the text, we refer to this variable as hand angle. Additional analyses were performed using hand angle at ‘endpoint’ (angle of the hand as it crossed the invisible target ring) rather than peak radial velocity. The results were essentially identical for the two dependent variables; as such, we only report the results of the analyses using peak radial velocity.

Data used in statistical analyses were tested for normality and homogeneity of variance using Shapiro-Wilks and Levene’s tests, respectively. When normality or homogeneity of variance was violated, we performed non-parametric permutation tests in addition to standard parametric tests (i.e. t-tests and ANOVAs) and report results from both. For comparisons between two groups, we used the difference between group means as our test statistic. This value was compared to a null distribution, created by random shuffling of group assignment in 10,000 Monte Carlo simulations (resampling with replacement), to obtain an exact p-value. When a comparison involved more than two groups, we used a similar approach, but used the F-value obtained from a one-way ANOVA as our test statistic.

Outlier responses were removed from the analyses. For the sole purpose of identifying outliers, the Matlab ‘smooth’ function was used to calculate a moving average (using a five-trial window) of the hand angle data for each target location. Outliers were trials in which the observed hand angle was greater than 90° or deviated by more than three standard deviations from the moving average. In total, less than 0.8% of trials overall were removed, and the most trials removed for any individual across all three experiments was 2%.

Individual baseline biases for each target location were subtracted from all data. Biases were defined as the average hand angles across cycles 2–10 (Experiments 1 and 2) or 2–5 (Experiment 3) of the feedback baseline block. These same cycles were used to calculate mean baseline RTs, MTs, and movement variability (SD). To calculate each participant’s baseline RT or MT, we took the average of median values at each target location. To calculate each participant’s movement variability, we took the average of the standard deviations of hand angles at each target location.

In order to pool all the data and to aid visualization, we flipped the hand angles for all participants clamped in the counterclockwise direction.

For Experiments 1 and 3, movement cycles consisted of 8 consecutive reaches (one reach/target); for Experiment 2, we only used four targets, thus a movement cycle consisted of four consecutive reaches (one reach/target). To estimate the rate of early adaptation, we calculated the mean change in hand angle per cycle over the first five cycles. To provide a more stable estimate of hand angle at cycle 5, we averaged over cycles 3–7 of the clamp block. We opted to use this measure of early adaptation rather than obtain parameter estimates from exponential fits since the latter approach gives considerable weight to the asymptotic phase of performance, and, therefore, would be less sensitive to early differences in rate. This would be especially problematic in Experiment 2, which utilized 220 clamp cycles. We also performed a secondary analysis of early adaptation rates using a larger window, cycles 2–11 (Krakauer et al., 2005). Results from using this alternate metric were consistent with the reported analyses (i.e. slower rates for Hit Target groups), only they resulted in larger effect sizes due to the gradually increasing divergence of learning functions. Asymptotic adaptation (i.e. late learning) was defined as the average hand angle over the last 10 cycles within a clamp block. In Experiment 1, the aftereffect was quantified by using the data from the first no-feedback cycle following the last clamp cycle. This measure yielded similar statistical results as that based on the analysis of asymptotic adaptation.

All t-tests were two-tailed. Posthoc pairwise comparisons following significant ANOVAs were performed using two-tailed t-tests, with a corrected α of .017 due to multiple comparisons. Cohen’s d, eta squared (η2), partial eta squared (for mixed model ANOVA), and dz (for within-subjects design) values are provided as standardized measures of effect size (Lakens, 2013). Values in main text are reported as 95% CIs in brackets and mean ± SEM.

No statistical methods were used to predetermine sample sizes. The chosen sample sizes were based on our previous studies using the clamp method (Morehead et al., 2017; Kim et al., 2018), as well as prior psychophysical studies of human sensorimotor learning (Huang et al., 2011; Galea et al., 2015; Vaswani et al., 2015; Gallivan et al., 2016).

Modeling

Request a detailed protocol

For the Movement Reinforcement model, a population vector (Georgopoulos et al., 1986), V, indicates the current bias of motor representations within the reinforcement system. In this model, the vector is composed of directionally-tuned units, with the strength of each unit reflective of its reward history. The direction of this vector (Vd) was calculated for each trial in the following manner:

Vx(n)=r(n)ux
Vy(n)=r(n)uy
Vd(n)=tan1(Vy(n)/Vx(n))

Here, r represents the weights on every unit in u, a vector containing 36,000 total unit vectors pointing in every direction around the circle, representing a resolution of .01° (x and y subscripts represent the x- and y-components for both V and u). The update rule for r takes into account the task outcome on each trial:

rθ(n+1)=A'rθ(n)+s
rθ(n+1)=A'rθ(n)

where θ indexes the unit corresponding to the direction of the movement, y(n), on hit trials, and ~θ indexes all the other units on hit trials and all units on miss trials. In this simplified reward scheme, the weight to the unit corresponding to the rewarded movement direction is increased by magnitude s on a trial-by-trial basis, and all weights are decremented due to a retention factor, A’, on every trial. The latter ensures that these reward-dependent weights revert back to zero in the absence of reward. The mean preferred direction, Vd, was converted from radians into degrees. The strength of the biasing signal, Vl, is equal to the population vector length: Vx2+Vy2, with the constraint that 0 ≤ Vl ≤ 1.

In order to calculate confidence intervals for the parameter estimates, we applied standard bootstrapping techniques, constructing group-averaged hand angle data 1000 times by randomly resampling with replacement from the pool of participants within each group. Using Matlab’s fmincon function, we started with 10 different initial sets of parameter values and estimated the retention and learning parameters that minimized the least squared error between the bootstrapped data and model output (xn). Parameter estimates were bounded such that 0 < A < 1 and 0 < U(e)<e, where e is equal to the clamp size in degrees.

The hybrid models combined the Movement Reinforcement with either the Adaptation Modulation or Dual Error model. Each hybrid incorporated the equations for the Movement Reinforcement model. However, when movement reinforcement was combined with the Adaptation Modulation model, the contribution of the adaptation system, x, to the motor output, y, was derived from the gain modulation equation (Equation (3)). When movement reinforcement was combined with the Dual Error model, Equations (4-6) were used, with xtotal now substituting for x in Equation (2).

Appendix 1

Experiment 3 – kinematic variables

Baseline movement variability was not different across all three groups, including the control group trained with the bisected target (F(2,33)=1.38, p=0.267, η2 = 0.077). Similarly, no differences across groups were observed for either RTs (F(2,33)=1.51, p=0.236, η2 = 0.0084) or MTs (F(2,33)=.46, p=0.634, η2 = 0.027).

Appendix 1—table 1
Average Reaction Times (RTs) in ms.

Values represent mean ± SEM.

https://doi.org/10.7554/eLife.39882.017
Experiment 1BaselineEarly clampLate clampNo feedback
Hit325 ± 7327 ± 7347 ± 11344 ± 12
Straddle362 ± 12359 ± 14397 ± 32407 ± 33
Miss386 ± 22383 ± 19378 ± 15385 ± 15
Experiment 20° clamp
Hit378 ± 22376 ± 27354 ± 9351 ± 9
Straddle373 ± 12366 ± 13368 ± 15373 ± 16
Experiment 3
Hit-to-Straddle356 ± 19350 ± 15326 ± 9N/A
Straddle-to-Hit360 ± 8360 ± 7355 ± 7N/A
Bisected-to-Normal400 ± 28395 ± 27400 ± 25N/A
Appendix 1—table 2
Average Movement Times (MTs) in ms.

Values represent mean ± SEM.

https://doi.org/10.7554/eLife.39882.018
Experiment 1BaselineEarly clampLate clampNo feedback
Hit153 ± 11150 ± 10137 ± 8133 ± 9
Straddle162 ± 8149 ± 8139 ± 7131 ± 7
Miss137 ± 7134 ± 7124 ± 6118 ± 6
Experiment 20° clamp
Hit149 ± 8159 ± 20155 ± 11127 ± 7
Straddle157 ± 8161 ± 15170 ± 18130 ± 8
Experiment3
Hit-to-Straddle158 ± 7189 ± 12168 ± 12N/A
Straddle-to-Hit164 ± 11207 ± 28169 ± 13N/A
Bisected-to-Normal151 ± 11165 ± 14166 ± 15N/A
Appendix 1—table 3
Movement variability during baseline block.

Values represent mean ± SEM.

https://doi.org/10.7554/eLife.39882.019
Experiment 1Baseline SD
Hit4.19 ±. 26°
Straddle3.61 ±. 16°
Miss3.80 ±. 15°
Experiment 2
Hit3.09 ±. 18°
Straddle3.57 ±. 16°
Experiment 3
Hit-to-Straddle3.30 ±. 22°
Straddle-to-Hit3.85 ±. 37°
Bisected-to-Normal3.97 ±. 31°

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
    Savings is restricted to the temporally labile component of motor adaptation
    1. A Hadjiosif
    2. M Smith
    (2013)
    Translational and Computational Motor Control (TCMC).
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66

Decision letter

  1. Tamar R Makin
    Reviewing Editor; University College London, United Kingdom
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom
  3. Tamar R Makin
    Reviewer; University College London, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter, peer reviews, and accompanying author responses.

[Editorial note: This article has been through an editorial process in which the authors decide how to respond to the issues raised during peer review. The Reviewing Editor's assessment is that all the issues have been addressed.]

Thank you for submitting your article "Intrinsic rewards modulate sensorimotor adaptation" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Tamar R Makin as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Timothy Behrens as the Senior Editor.

The Reviewing Editor has highlighted the concerns that require revision and/or responses, and we have included the separate reviews below for your consideration. If you have any questions, please do not hesitate to contact us.

In this paper the authors demonstrate how visual cues, uninformative of participants' performance, attenuates intrinsic sensorimotor learning (i.e. adaptation). For this purpose, the authors employ an “error clamp” paradigm, which they previously showed to induced implicit adaptation of reaching movements. By varying the size of the target with respect to the curser's final position, they demonstrate substantial shrinkage of the implicit adaptation effect, which they suggest occurs due to a reward which is experienced when the curser hits the target. Since previous research suggested that sensorimotor adaptation is insensitive to reward, the proposed study offers an interesting new piece of evidence. In particular, the authors suggest that their findings demonstrate that reward directly acts on the adaptation process, and run an experiment and a computational model to rule out an alternative account, by which reward impacts the movement itself.

While all reviewers agreed that the paper elegantly portrayed an interesting phenomenon, they were unsatisfied with how this phenomenon has been interpreted. In particular, the authors make, but don't substantiate, a series of assumptions that require further validation. Unless these assumptions could be substantiated with more data, alternative interpretations should be considered more thoroughly. The reviewers therefore suggest that the authors produce stronger evidence to pin-point the learning process being affected by the visual feedback, give more consideration to the alternative mechanism and provide quantitative evidence for the superiority of the adaptation modulation account. Unless these suggestions have been implemented, there was a consensus that the authors will need to consider substantially moderating some of their most central interpretations and conclusions, as currently reflected in the title and Abstract.

Major comments:

1) The authors focus their design and interpretation on SPE-driven learning, under the assumption that the error clamp isolates a single learning process, implicit recalibration driven by SPE. The possible alternatives to the "Adaptation Modulation" model are collapsed onto a single model that represents only one very specific alternative proposal about how reward might influence learning under the clamp. Therefore, rejecting this one particular model is not grounds to reject every possible architecture in which reward does not act by modulating SPE-driven adaptation. There reviewers point at several possible interpretations of what is happening and the paper needs to be thorough in considering these. Couldn't there be an implicit contribution that is not SPE-driven but is instead driven by reward/task error? How can the authors be certain that the participants were not engaging in some kind of re-aiming that may have been implicit or involuntary? The “hit” condition might increase salience, rather than impact SPE-related mechanisms, etc. Alternatively, there could be unexplored processes underlying the “miss” group, and the target size effect may be driven by alternative processes.

2) While focusing on the SPE-driven learning, an equally valid alternative to the authors account is that not hitting the target could be conceived as punishment, or in other words, failure to earn a reward amplifies sensitivity to sensory prediction error. Based on this logic, punishment (or lack of reward) will increase cerebellar learning, as the key mechanism underlying the results [see more specific comments in the individual reviews]. Relatedly, the "reward reduces learning" pitch requires some further conceptual qualification, e.g. from other related literature.

3) The reviewers were particularly dissatisfied with how the results of Experiment 3 were interpreted, based on the “adaptation” and “reinforcement” models. As this is perhaps the most crucial result in the paper, this requires more consideration. The current model assumes that reward has an immediate effect on learning rate parameters, when it seems likely it would have a more incremental onset. The model also needs to explain the differences in no vision/error clamp retention observed in Experiment 1 and 2. Furthermore, the model put forward for “reinforcement” was in no way exhaustive, and therefore doesn’t adequately represent the alternative process [see individual reviews for more specific suggestions]. Crucially, the comparison across the two competing "models" needs to be formally quantified. Currently, the data seems to sit in the middle of the two model predictions, rather than providing robust evidence for either.

4) The reviewers felt that more data/analysis/discussion of no visual feedback and zero error clamp is needed, to quality some of the assumptions of SPE-driven learning. This will help clarify to what extent the known properties of reward in adaptation tasks (often assumed to be explicit) might be attributed to an implicit process. This should also allow the authors to deal with the different rates of washout observed (but not discussed) in Experiment 1 and 2, indicating a role of reward in retention, with important potential implications on different models for learning. [see some specific suggestions in the individual reviews]

5) Inherent to the study is the key assumption that hitting the target would be intrinsically rewarding to the participants, despite explicit information that this is irrelevant to the task (and presumably the monetary reward?). The assumption is unsubstantiated and needs better evidence (e.g. pupilometry, control conditions with monetary reward)

Separate reviews (please respond to each point):

Reviewer #1:

This paper elegantly dissects a curious phenomenon, where small variations in task-irrelevant visual information dramatically and systematically modulate participants' performance on a basic motor reaching task. The study design is elegant, the results seem robust, the statistical analysis and reporting is appropriate, and the manuscript is clearly written. While I take no issue with the design, analysis and results, I have to question the interpretation the authors provide for their own findings. I wish to raise a few alternative accounts that I don't think have been sufficiently addressed in the manuscript, and I also ask for some clarification on the key interpretation the authors are offering.

Major comments:

1) If the visual feedback determines the SPE, then in the Hit condition there's no error, you correctly hit the target, and therefore there is no "need" for corrective movement (i.e. adaptation). According to this account, the target size ("reward") impacts the SPE by modulating the cursor feedback (Figure 4), not the adaptation or the movement. In other words, the subjects interpreted the visual feedback as correct task performance. This should not be translated to kinematic changes during baseline, because the visual cursor feedback is missing from the baseline. The authors suggest a control in their third experiment (the condition with a bisected target), but I didn't really understand how this eliminates the SPE interpretation.

2) The kinematics analysis could be potentially useful for ruling out that the large target was more salient. But the fact that participants tended to show lower variability (Experiment 2) and faster RT (Experiment 1) in the Hit target condition is concerning [*please note that upon discussion it has been agreed by the reviewers that this response profile is also compatible with reward, consistent with the authors' main interpretation]. They need to perhaps put the groups from Experiments 1 and 2 together to compare these baseline differences? Regardless, they should definitely report the results for Experiment 1 in the main results.

3) I don't really get the authors' conceptual pitch (that reward reduces learning), assuming that adaptation (perturbed hand position) is learning, why would a reword for a perturbed trial reduce that learning? Are there any precedents to this concept from other research? Surely, animal work on reinforcement learning would have noticed such an effect before, if it exists? Their discussion didn't help me understand their idea any better.

4) Inherent to the study is the key assumption that hitting the target would be intrinsically rewarding to the participants, despite explicit information that this is irrelevant to the task (and presumably the monetary reward?). I'm not sure what they mean here by reward, and how this could be generalised to other fields of research where reward is a key concept (e.g. decision making, reinforcement learning).

5) How much of the late effect shown across studies is actual adaptation/learning (i.e. due to the visual feedback), as opposed to drift? What happens when you ask participants to just repeatedly perform a movement to a target with no visual feedback? It seems necessary to quantify these drifts, in order to accurately account for the "learning".

Minor Comments:

1) "Interestingly, these results appear qualitatively different to those observed when manipulating the clamp offset. Our previous study using clamped visual feedback demonstrated that varying clamp offset alone results in different early learning rates, but produces the same magnitude of late learning (Kim et al., 2018)." please spare us having to read this paper, what do you mean by clamp offset? please provide better description of the study (perhaps in the discussion?).

2) It would be very beneficial if the authors could provide the exact instructions given to the participants (i.e. the study script).

3) "This was supported by our control analyses, perceptual control experiment, and our finding that the Straddle group in Experiment 1 was similar to the Hit group, suggesting that the effect of target size was categorical." In that experiment Hit was different from Straddle, which was similar to miss.

Reviewer #2:

The work by Kim et al., investigates a very interesting and well investigated topic; the effect of reward on sensorimotor adaptation. Through 3 well designed experiments, the authors show that task-based reward attenuates sensorimotor adaptation. The authors suggest that these results can only be explained by a model in which reward directly acts on the adaptation system (learning parameter in a state-space model). This is a well written manuscript, with clear results and interesting conclusions. My biggest issue is the lack of analysis and discussion of their after-effect (no vision/zero error clamp) data from Experiment 1 and 2. I cannot explain the differences between groups during these blocks (see below) by reward only acting on the learning component of the state-space model. To me, this data indicates that reward (Hit group) is having an effect on retention. I also have issues with the implementation of the model in Experiment 3 and the possibility that the task could also involve implicit punishment. As a result, I feel the conclusions made in this manuscript create a clear story in contrast to the data which suggests a more complicated narrative.

Specifics:

No visual feedback vs. 0⁰ error clamp following adaptation: Why did the authors switch from “no feedback” to “0⁰ error clamp” from Experiment 1 to Experiment 2? No explanation is given for this switch.

After-effects: Following this, I find it odd how little the “after-effect” is analysed/discussed considering this was a major component of the analysis in the seminal work by Morehead et al., 2017, and is also analysed in Kim et al., 2018. The authors describe in the Materials and methods that the after-effect was defined by the first no-feedback cycle but then it is never further discussed… Although it is difficult to see in Figure 1, the data provided by the authors reveals that performance is substantially different across groups during no feedback. Whereas, Miss and Straddle decline across trials, Hit maintains a plateau (by the end of No FB, groups are near the same level of performance). This plateau is strikingly different from the behaviour observed in Kim et al., 2018. This is also seen in Experiment 2, where the Straddle group decline across trials and the Hit group maintain a plateau (annoyingly I am unable to submit figures in my review, but this is clear in the data provided by the authors). If the authors believe that reward is having a specific effect on learning, and not retention, how would a state-space model explain these results? During both these trials types, error has to be fixed to zero (there is no error to adjust behaviour with). Therefore, the only parameter which influences behaviour is the retention parameter. Thus, Straddle and Miss show the normal decay seen with no-vision/error clamp trials with a retention parameter of approx. 0.8-0.9. But the Hit group must have a retention parameter of 1. To me, this would suggest reward is having a direct effect on retention causing enhanced retention (relative to where they started at). Can the author's explain this behaviour in the context of the adaptation modulation model? In other words, can you model the plateau performance in no vision/zero clamp trials of the Hit group when reward only affects the learning parameter?

Interpretation of results: Throughout the paper there is an assumption (first stated in paragraph five of the Introduction) that hitting the target is intrinsically rewarding, is it not an equally valid assumption that missing the target is intrinsically punishing? Or in fact both options may be true and have differing effects. How do we know what the “null” cerebellar behaviour is (this also relevant for the Morehead et al., and Kim et al., papers)? As the participants move from a baseline situation where they are being successful to a situation where they are being unsuccessful, couldn't the error clamp behaviour be regarded as intrinsically punishing? Therefore, across all these studies we could be observing a cerebellar process which is augmented by intrinsic punishment (aversive signals in the cerebellum are well known). Within this context, could the results not be viewed as punishment (moving from hitting the target to missing) increasing cerebellar learning (U) and reward increasing retention (A)? This is relevant for paragraph six of the Discussion, modification of the A parameter alone would indeed lead to higher asymptote in opposition to the current results, however the retention findings in Experiments 1 and 2 suggest that A is affected by reward (see above). If both A is increased and U decreased (by a lack of intrinsic punishment) when hitting the target then lower asymptotes can be observed and would explain the findings (including the retention results which are not possible with a learning-only effect).

Model comparison: Another issue I have is the use of the model in the transfer phase. Figure 5B clearly shows the predictions of the transfer phase for the straddle-hit group based on the two opposing theories. Considering this is the key result to the entire paper, surely the authors need to compare which model predicts the data better? To me the data seems to sit in the middle of these model predictions, rather than providing robust evidence for either. The authors could compare their initial adaptation modulation prediction as outlined in Figure 5F with a model that reflects what would happen with a movement reinforcement model (no change/flat line). They could use BIC or AIC model selection to determine which model explains the individual participant data (or the bootstrapped data) better. At the moment, there is no comparison between the opposing models. The authors have simply provided a prediction for the adaptation modulation model and said it approximately looks similar to the data (without even providing any form of model fit i.e. r-squared for transfer phase).

Similar conclusions to Leow et al., 2018: I am unsure what the rules are regarding a preprint but the conclusions for the current manuscript are very similar to Leow et al., 2018; implicit sensorimotor adaptation is reduced by task-based reward (modulating the amount of task error/reward). So how novel are these conclusions?

No normality or homogeneity of variance tests: As ANOVAs are used, tests for normality (Shapiro-Wilk) and homogeneity are required (Levene). If these have been performed, then this needs to be made clear.

Reaction and movement time across experiments: the authors state in the Materials and methods that there was a 300ms movement time cutoff. Did this include reaction time? Did the authors examine reaction time during adaptation and no-vision/0 error clamp blocks, rather than at baseline? This analysis should be included (i.e. reaction and movement time for the trials used in Figure 1E, F and similar for Experiment 2).

Early adaptation rate: I take issue with calling this a measurement of “rate” (subsection “Data Analysis”)? This value is a measurement of average early performance across cycles, not the change in performance which would be measured by calculating the difference between cycles. Or is this what the authors did and the methods are incorrect?

Experiment 1 post hoc t-tests: Are these corrected for multiple comparisons (p=0.16). Although both would survive, this is simply good practise.

Experiment 3 statistics: The order of statistical analysis is odd. Why not start with the omnibus ANOVA and then describe the comparisons between the 3 groups? In addition, I do not like the use of planned comparisons unless these are pre-registered somewhere. I am guessing the t-test between the control group and straddle-to-hit does not survive a correction for multiple comparisons? However, as this is a control experiment, planned comparisons are begrudgingly acceptable.

Wrong group described (Discussion paragraph three): The straddle group is similar to the Miss group.

Reviewer #3:

Implicit recalibration is a specific learning process that is known to be implicit, cerebellum-dependent, and driven by sensory prediction errors. Previous work has suggested that this process is insensitive to reward. The core claim in this paper is that, in fact, implicit recalibration can be modulated by reward in that failure to earn a reward amplifies sensitivity to sensory prediction error. If true, this is certainly an interesting and important finding. I do not believe, however, that the experiments presented here are capable of supporting this conclusion.

The experiments employ a clever “error clamp” paradigm which the authors have previously shown induces implicit adaptation of reaching movements. The data in the current paper demonstrate clearly that the extent of adaptation under this clamp depends on the size of the target. If the target is big enough that the cursor lands inside it, less is learned from the error. I think it is fair to conclude from these results that there is an implicit learning process at play which is sensitive to reward. My major concern, however, is with the attempt to distinguish between the "Adaptation modulation" theory and the "Movement reinforcement" theory, which really is the heart of the paper.

The authors make the very strong assumption that their clamp manipulation isolates a single learning process – implicit recalibration driven by SPE. The results state quite clearly; "This [clamp] method allows us to isolate implicit learning from an invariant SPE, eliminating potential contributions that might be used to reduce task performance error." I agree that the clamp isolates implicit learning, but couldn't there be an implicit contribution that is not SPE-driven but is instead driven by reward/task error? How can the authors be certain that the participants were not engaging in some kind of re-aiming that may have been implicit or involuntary?

The authors do seem to entertain the possibility of distinct SPE-based and reward-based processes, alluding in the Abstract to a model "in which reward and adaptation systems operate in parallel." Figure 4 even quite clearly illustrates such a parallel architecture. But by later in the paper the myriad possible alternatives to the "Adaptation Modulation" model are collapsed onto a single idea in which reward reinforces particular movements, causing them to act as an “attractor”. Ultimately, the conclusion rests on rejecting this “attractor” / "Movement Reinforcement" model because it incorrectly predicts that asymptote will be maintained in Experiment 3 when the target becomes larger. The Adaptation Modulation model is the only model left and so must be correct.

Of course the “reinforcement” model represents only one very specific alternative proposal about how reward might influence learning under the clamp (other than by modulating SPE-driven recalibration). Rejecting this one particular model is not grounds to reject EVERY possible architecture in which reward does not act by modulating SPE-driven adaptation. For this logic to work, it would have to be the case that ANY model in which reward does not modulate SPE-driven learning would necessarily predict a consistent asymptote after the straddle->hit transition.

It is not too difficult to formulate a model in which reward does NOT modulate SPE-driven learning, but which doesn't predict a consistent asymptote after the straddle->hit transition. Suppose behavior in the clamp is the sum of two distinct processes acting in parallel, one feeding on SPE, and one feeding on task error, both with SSM-like learning curves. The latter processes might correspond to a kind of implicit re-aiming and might plausibly only be updated when a movement fails to earn reward. Then disengaging this system by suddenly providing reward (at the transition from straddle to hit) might lead to a gradual decline in that component of compensation. This would lead to behavior quite similar to what the authors observe. I have appended MATLAB code at the end this review which implements a version of this model.

Finally, I would add that I'm not even convinced that it is reasonable to assume that the "Movement Reinforcement" “attractor” model, in which the authors suggest that reinforcing movement attenuates decay, would necessarily lead to a persistent asymptote. What if the reinforcement only partially attenuates decay, or requires repetition to do so (during which time it may decay slightly)?

For these reasons, I don't believe that the primary conclusion is tenable.

Putting aside the question of adaptation modulation, I do like the experiments in this paper and I think phenomena that have been shown are interesting. I think the results convincingly establish the existence of an implicit learning process that is sensitive to reward, and the paradigm provides a platform for exploring the properties of this process. I would suggest the authors try to develop the manuscript along these lines, though it is not particularly satisfying if the results cannot distinguish between modulation of SPE-driven learning by reward and a parallel, reward-driven process as I feel is the case at present.

As a constructive suggestion, it may be enlightening to examine behavior in the 0deg clamp and no-feedback conditions more closely. The authors draw no attention to it in the paper (nor really explain why these blocks were included) but the data seem to suggest a difference in rate of decay between the “hit” and “miss” conditions in Experiments 1 and 2. Different theories about the nature of the implicit reward-based learning would predict quite different behavior in these conditions, but it's impossible to judge from such short periods whether the decay curves might converge, stay separate, or even cross over. I wonder if an additional experiment exploring this phenomenology in more detail, with a decay block that is longer than 5-10 cycles, might be enlightening, though I admit I'm unsure if this would necessarily help to disambiguate the SPE-modulation model from the alternative “parallel” model. Perhaps comparing a NoFB condition to a 0-deg clamp condition might be useful. A more concrete (and exhaustive) set of theories/models would help to formulate clearer predictions about what to expect here.

Matlab Code:

clear all

Ntrials = 220;% number of trials

x_reward = zeros(2,Ntrials);% reward component of compensation (one row for each group)

x_spe = zeros(2,Ntrials);% SPE-driven component of compensation (one row for each group)

U_reward(1) = 0.3;% initial sensitivity of reward component to task error for first group (straddle->hit)

U_reward(2) = 0;% initial sensitivity to reward component to task error for second group (hit->straddle)

U_spe = 0.25;% sensitivity to sensory prediction error throughout

A_reward = 0.9;% retention of reward component

A_spe = 0.98;% retention of SPE-driven component

for i=2:Ntrials

if(i==131)% swap U_rwd values for the two groups on trial 131

U_reward = fliplr(U_reward);

end

% update components

for j=1:2% iterate through two groups

x_reward(j,i) = A_reward*x_reward(j,i-1) + U_reward(j);% reward component update

x_spe(j,i) = A_spe*x_spe(j,i-1) + U_spe;% spe component update

end

end

% sum components to get total compensation

x_total = x_reward+x_spe;

% plot results

figure(2); clf; hold on

plot(x_total(1,:),'b','linewidth',2)

plot(x_total(2,:),'g','linewidth',2)

xlabel('Trial Number')

ylabel('Reach Angle')

legend('Straddle->Hit','Hit->Straddle')

Minor Comments:

Results third paragraph: “angle” or “deviation” may be a better word here than “offset” which could easily be confused with switching off the clamp.

The "models" are never formulated in concrete terms. I would be fine with a verbal argument as to why the Movement Reinforcement theory should predict a consistent asymptote in Experiment 3, but it feels a stretch to call them “models” when they are never concretely fleshed out. Fleshing out the models mathematically may also help to bring to the surface some of the tacit assumptions that have led to the headline conclusion of the paper.

The "model" that generates the "model prediction" in Figure 5F is rather ad-hoc. It’s odd to suggest that the parameters of the underlying process would instantly flip. Also why are the predictions discontinuous across the dashed “transfer” line?

Results subsection “Experiment 3”: "the predictions of the model slightly underestimated the observed rates of change for both groups". Do you mean OVERestimated?

[Editors' note: further revisions were suggested, as described below.]

Thank you for submitting your article "The influence of task outcome on implicit motor learning" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Tamar R Makin as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Adrian M Haith (Reviewer #3).

All three reviewers were fully satisfied with the amendments made to the manuscript, and agreed the manuscript was much improved. Reviewers 2 and 3 provided some valuable follow-up suggestions, which I expect you would be happy to consider. I'm pasting these below for your benefit.

Please incorporate your changes in a final version of your manuscript (no need for a detailed rebuttal).

Reviewer #1:

I find the revised manuscript much improved. The analysis and discussion are more comprehensive and provide a more nuanced, and as such more transparent and balanced, account of the very interesting behavioural effects observed in the study. I'm also satisfied with the responses provided to the reviewers comments. On a personal level, I found the manuscript easier to read (though this might be due to a repetition effect!). I have no further comments.

Reviewer #2:

The authors have done a great job at addressing the issues highlighted by all reviewers, and the manuscript is much improved. Although the overall story is less eye-catching than “reward modulates SPE learning”, I believe it is now a far more informative piece of work which is still novel. I have a few fairly small outstanding issues:

Collinearity of parameters: Could the authors test the collinearity of the parameters in each model? In other words, is the data rich enough to have 4/6 independent parameters (I know this is an issue when using the 2-state Smith model with certain paradigms)? For example, when you use the bootstrapping procedure how correlated are the parameters? A heat map maybe for each model would be a good way to show this. This will ensure the results from paragraph three of subsection “Experiment 3 – Modeling Results” are valid. Also parameter recovery (Palminteri et al., Plos CB, 2017) might be of interest.

Control group figure: I would add Figure 5—figure supplement 1 to the main document.

Discussion (paragraph nine of subsection “Modeling the Influence of Task Outcome on Implicit Changes in Performance”): Aren't the target jump predictions similar to what Leow et al., 2018, have already shown to be true?

Reviewer #3:

The authors have revised the paper very thoroughly. For the most part, I find that the paper hits the mark well in terms of articulating theories that can versus can't be disambiguated based on the current experiments. I do, however, have a few further suggestions to help clarify the paper further.

Use of the term “target error” is a bit ambiguous. At times it is characterized as a binary hit/miss signal (e.g. Introduction paragraph five; Discussion paragraph five). At other times it seems to refer to a vector error (i.e. having magnitude and direction) (e.g. Introduction paragraph six distinguishes it from “hitting the target”; penultimate paragraph of subsection “Experiment 2”). This is liable to cause considerable confusion. It confused me until reading the description of the dual adaptation model, in practical terms it is tantamount to a binary signal, given the constant error size and direction. But I think a little more conceptual clarity is needed. e.g. does TE already take reward into account? Or are TE and SPE equivalent (in this task) but learning from TE is subject to modulation by reward, while learning from SPE is not?

The term “model-based” is used throughout, sometimes (e.g. paragraph four of subsection “Modeling the Influence of Task Outcome on Implicit Changes in Performance”) in the sense of a model-based analysis of the effects of reward, i.e. examining how presence/absence of reward affects fitted parameters of a model (this usage is fine). At other times (e.g. Conclusions section) it's used essentially as a synonym for error-based. I'm a bit skeptical as to whether this latter usage adds much value or clarity. I appreciate it derives from Huang et al. and Haith and Krakauer, who made a case that implicit adaptation reflected updating an internal forward model. However, error-driven learning need not necessarily be model-based. For instance, if learning is driven by target error, it's not clear that this has anything to do with updating any kind of model of the environment (and "inverse model" doesn't really count as a proper model in this sense). So I would caution against using the term “model-based” when the idea doesn't inherently involve a forward model. Particularly given the two distinct usages of “model-based” in this paper.

The Movement Reinforcement model is reasonable, although it is fairly ad hoc. The Discussion argues convincingly that this model is largely illustrative, and that its behavior can be taken as representative of a broad class of “operant reinforcement” models. I think articulating something to this effect earlier, when the model is first introduced, would be helpful. At the moment, it comes from nowhere and it's a little perplexing to follow exactly why this is the model and not, say, a more principled value-function-approximation approach. With that in mind, some of the finer details of the model might be better placed in the Materials and methods section so as not to bog down the results with specifics that aren't all that pertinent to the overall argument.

Minor Comments:

I appreciate the discussion of implicit TE-driven learning in motivating the Dual Error model (subsection “Theoretical analysis of the effect of task outcome on implicit learning”). But I was surprised the authors didn't mention this again in the discussion, instead only speculating that TE-based learning might be re-aiming that has become implicit through automatization/caching, and consequent making the dual error model seem implausible. But it seems perfectly plausible that TE-based learning is just another implicit, error-based learning system, separate from SPE-driven implicit learning, that never has anything to do with re-aiming.

Subsection “Modeling the Influence of Task Outcome on Implicit Changes in Performance”, it doesn't seem necessary to invoke “SPE-driven” here. Could in principle be error-based learning driven by something like "target error" (i.e. just the distance between the center of the cursor and the center of the target). Ditto in the Conclusion section.

Introduction section: "We recently introduced a new method.… designed to isolate learning from implicit adaptation" slightly ambiguous sentence, I first read it as though learning and implicit adaptation are separate things being dissociated. Maybe just drop "learning from"?

Introduction section: "Given that participants have no control over the feedback cursor, the effect of this task outcome would presumably operate in an implicit, automatic manner." It's not having no control that makes it implicit… Might be better rephrased to something like "Given that participants are aware that they have no control over the feedback cursor…"?

Second paragraph of subsection “Theoretical analysis of the effect of task outcome on implicit learning”: this paragraph misses a key detail, that “reinforcing” or “strengthening the representation of” rewarded actions really means that it makes those actions more likely to be selected in the future.

Third paragraph of subsection “Theoretical analysis of the effect of task outcome on implicit learning”: “composite” is somewhat vague. Would “sum” or “average” be accurate?

Third paragraph of subsection “Experiment 3 – Modeling Results”: something is up with the brackets here.

https://doi.org/10.7554/eLife.39882.021

Author response

Major comments:

1) The authors focus their design and interpretation on SPE-driven learning, under the assumption that the error clamp isolates a single learning process, implicit recalibration driven by SPE. The possible alternatives to the "Adaptation Modulation" model are collapsed onto a single model that represents only one very specific alternative proposal about how reward might influence learning under the clamp. Therefore, rejecting this one particular model is not grounds to reject every possible architecture in which reward does not act by modulating SPE-driven adaptation. There reviewers point at several possible interpretations of what is happening and the paper needs to be thorough in considering these. Couldn't there be an implicit contribution that is not SPE-driven but is instead driven by reward/task error? How can the authors be certain that the participants were not engaging in some kind of re-aiming that may have been implicit or involuntary? The “hit” condition might increase salience, rather than impact SPE-related mechanisms, etc. Alternatively, there could be unexplored processes underlying the “miss” group, and the target size effect may be driven by alternative processes.

This comment was at the center of our thinking throughout the revision process and has inspired major changes in the manuscript. The reviewers are correct, rejecting one model is not “grounds to reject every possible architecture…”. The revision now includes a substantially expanded section on the modeling front, including the addition of an alternative model in which we consider how a second error signal, based on the task outcome, could be used to recalibrate an internal model. While obviously not exhaustive, we now formally present three distinct models: Movement Reinforcement, Adaptation Modulation, and Dual Error models, and briefly consider possible combinations of these models. The Dual Error model has been adapted from reviewer 3’s comments and postulates that implicit motor learning could result from target error as well as from SPE. To preview, this model provides the best fit to the data and figures prominently in Experiment 3 and the Discussion.

We gave extensive consideration to whether we should consider the effect of target error as a form of re-aiming. Indeed, this led us to submit a query to the reviewers on this point given our concerns that an aiming-based account didn’t seem congruent with traditional views on the role of aiming in sensorimotor adaptation. This query led to an extended Skype session with reviewer 2, Adrian Haith, and this discussion helped sharpen our thinking on this front, as well as inspired some ideas for future experiments. The Discussion section includes a few paragraphs on this issue, both in terms of explicit and implicit forms of an aiming hypothesis. There we highlight how an implicit process sensitive to target error does not share key features of explicit aiming (error size-dependency, flexibility, fast learning), or the idea of a cached (and thus implicit) aiming process. In the end, we think it suffices at this stage to highlight that the results can be accounted for by the operation of two internal models, one operating on SPE and the other on TE. We sketch out ways in which future work can directly test this model and also explore whether these different learning systems engage similar or dissimilar neural systems.

With regard to the issue of unexplored processes underlying the Miss group, we take that up in our response to the following comment.

2) While focusing on the SPE-driven learning, an equally valid alternative to the authors account is that not hitting the target could be conceived as punishment, or in other words, failure to earn a reward amplifies sensitivity to sensory prediction error. Based on this logic, punishment (or lack of reward) will increase cerebellar learning, as the key mechanism underlying the results [see more specific comments in the individual reviews]. Relatedly, the "reward reduces learning" pitch requires some further conceptual qualification, e.g. from other related literature.

The point here is well-taken: Rather than think of hitting the target as providing a “reward” that attenuates the operation of this process, it may well be that missing the target provides a “punishment” that amplifies the operation of this process. We now make this point explicitly in our formalization of the Adaptation Modulation model and the new Dual Error model, as well as in the Discussion.

In the Dual Error model, misses can be considered to add to learning by providing a second error-based learning process, one sensitive to target error. In contrast, the Movement Reinforcement model incorporates a process that only learns a bias when hitting the target (i.e., when rewarded), and thus also contrasts with the other two models in that it cannot be interpreted from a “failure sensitizes learning” perspective.

We have expanded our discussion of previous studies on the impact of task outcome on adaptation in a number of places. Here, too, we are grateful to reviewer 3, for bringing to our attention the relevance of some of the earlier work in which the target is displaced, sometimes bringing it into alignment with a perturbed cursor (and thus manipulating target error, independent of SPE.

We recognize that, in a number of studies using more traditional adaptation tasks (e.g., standard visuomotor rotation) reward is viewed as improving learning (Galea et al., 2015, Nikooyan and Ahmed, 2015), and indeed, a boost would seem more intuitive. The clamp task differs in that “learning” is arguably detrimental to performance, since the instructions to the participants are to ignore the cursor and reach straight toward the target. More generally, the revision includes many changes to clarify the assumptions of the models (including their formalization), as well as note limitations with our particular implementations.

3) The reviewers were particularly dissatisfied with how the results of Experiment 3 were interpreted, based on the “adaptation” and “reinforcement” models. As this is perhaps the most crucial result in the paper, this requires more consideration. The current model assumes that reward has an immediate effect on learning rate parameters, when it seems likely it would have a more incremental onset. The model also needs to explain the differences in no vision/error clamp retention observed in Experiment 1 and 2. Furthermore, the model put forward for “reinforcement” was in no way exhaustive, and therefore doesn’t adequately represent the alternative process [see individual reviews for more specific suggestions]. Crucially, the comparison across the two competing "models" needs to be formally quantified. Currently, the data seems to sit in the middle of the two model predictions, rather than providing robust evidence for either.

We have overhauled the presentation of Experiment 3, formalizing all three of the models including the new Dual Error model, opting for a different method to fit the data, and quantifying the model comparisons using goodness-of-fit measures and AIC scores for model selection. We emphasize that the results do not support the Movement Reinforcement model in its current implementation, and more generally, any model that incorporates independent activity of model-based adaptation and model-free reinforcement learning processes. The new fitting method, by which we model the data for both groups simultaneously, shows that the Adaptation Modulation and Dual Error models do a good job of fitting the data and accounting for the significant change in asymptote for the Straddle-to-Hit group (see new Figure 6).

4) The reviewers felt that more data/analysis/discussion of no visual feedback and zero error clamp is needed, to quality some of the assumptions of SPE-driven learning. This will help clarify to what extent the known properties of reward in adaptation tasks (often assumed to be explicit) might be attributed to an implicit process. This should also allow the authors to deal with the different rates of washout observed (but not discussed) in Experiment 1 and 2, indicating a role of reward in retention, with important potential implications on different models for learning. [see some specific suggestions in the individual reviews]

We now include an analysis and discussion of the no feedback and 0° clamp data. The Hit Target groups do show less decay of the adapted state when analyzed in terms of the absolute change in hand angle. However, retention is normally viewed in proportional terms (and formalized this way in all of our models). When the analysis is performed in terms of proportional changes, there were no reliable differences between the conditions in both Experiments 1 and 2. Similarly, although the bootstrapped parameter estimates of the retention parameter were higher in the Hit condition relative to the Miss (or Straddle) condition, the confidence intervals indicate that the modulation of retention due to task outcome was much less robust than the changes in learning rate. We provide all of this information in the revision, and also acknowledge that hitting the target may enhance retention.

5) Inherent to the study is the key assumption that hitting the target would be intrinsically rewarding to the participants, despite explicit information that this is irrelevant to the task (and presumably the monetary reward?). The assumption is unsubstantiated and needs better evidence (e.g. pupilometry, control conditions with monetary reward)

Inspired by the reviewers’ comments, our interpretations no longer rely on the assumption that hitting the target is intrinsically rewarding. Rather, we focus more generally on how task outcome can affect overall implicit learning. We note that there is precedence for the intrinsic reward idea (Leow et al., 2018, Huang et al., 2011). However, as developed in the revision, this is just one of the directions we take in the modeling. We now also consider the possibility that missing the target creates a separate target error signal for learning (as in the Dual Error model), or that the task outcome might act as a gain controller on adaptation (as in the Adaptation Modulation model).

Separate reviews (please respond to each point):

Reviewer #1:

This paper elegantly dissects a curious phenomenon, where small variations in task-irrelevant visual information dramatically and systematically modulate participants' performance on a basic motor reaching task. The study design is elegant, the results seem robust, the statistical analysis and reporting is appropriate, and the manuscript is clearly written. While I take no issue with the design, analysis and results, I have to question the interpretation the authors provide for their own findings. I wish to raise a few alternative accounts that I don't think have been sufficiently addressed in the manuscript, and I also ask for some clarification on the key interpretation the authors are offering.

Major comments:

1) If the visual feedback determines the SPE, then in the Hit condition there's no error, you correctly hit the target, and therefore there is no "need" for corrective movement (i.e. adaptation). According to this account, the target size ("reward") impacts the SPE by modulating the cursor feedback (Figure 4), not the adaptation or the movement. In other words, the subjects interpreted the visual feedback as correct task performance.

We have revised the manuscript to clarify our definitions of sensory prediction error and target error. The SPE is defined in terms of the angle between the centers of the cursor and target. As such, one can still have an SPE, even when hitting the target. So, even if the participant thinks they are performing the task as instructed (“reach directly to the target and ignore the feedback”), the motor system is modified. In the current work, target error is defined by whether the cursor hits or misses the target. We do not know if this need be binary; the fact that the Miss and Straddle groups lead to similar behavior suggests there may be some categorical aspect to this. Nonetheless, future work is required to determine if the learning from TE scales with the size of the target error.

This should not be translated to kinematic changes during baseline, because the visual cursor feedback is missing from the baseline.

In our alternative hypotheses section (subsection “Attenuated behavioral changes are not due to differences in motor planning”), we consider other ways in which the manipulation of target size might affect behavior. The kinematic analyses were performed to assess the extent to which differences in planning reaches to small or large targets might influence learning. On the comment about the lack of cursor feedback during baseline, we note that in all experiments there are is a baseline period with veridical cursor feedback and that these are the data used in our kinematic and temporal analyses.

The authors suggest a control in their third experiment (the condition with a bisected target), but I didn't really understand how this eliminates the SPE interpretation.

The bisected target group in Experiment 3 was to test a perceptual uncertainty hypothesis: Perhaps there is less certainty about the position of the cursor in the large target context and this weakens the SPE signal, resulting in attenuated adaptation. We added the bisecting line to provide a salient reference point, thus making it clear that the clamped cursor was off center. As our analyses show, the target size effect was not influenced by the bisecting line, arguing against the hypothesis that the effect of the large target is due to perceptual uncertainty.

2) The kinematics analysis could be potentially useful for ruling out that the large target was more salient. But the fact that participants tended to show lower variability (experiment 2) and faster RT (Experiment 1) in the Hit target condition is concerning [*please note that upon discussion it has been agreed by the reviewers that this response profile is also compatible with reward, consistent with the authors' main interpretation]. They need to perhaps put the groups from Experiments 1 and 2 together to compare these baseline differences? Regardless, they should definitely report the results for Experiment 1 in the main results.

We now include the analyses of temporal and kinematic variables from Experiment 1 in the main text, as well as a table of RTs, MTs, and movement variability from all three experiments in the Appendix. Baseline RTs were different in Experiment 1, with faster RTs observed in the Hit group compared to the other two conditions. However, there were no differences between the Hit and Straddle groups in Experiment 2, and in fact, numerically, RTs were slower for the Hit group. There were no differences nor trends in the movement variability measures between groups in Experiments 1 and 2. Given the absence of systematic differences in the temporal and kinematic measures (see Tables 1-3 in the Appendix), we infer that there were no substantial differences in motor planning between the target size conditions.

3) I don't really get the authors' conceptual pitch (that reward reduces learning), assuming that adaptation (perturbed hand position) is learning, why would a reword for a perturbed trial reduce that learning? Are there any precedents to this concept from other research? Surely, animal work on reinforcement learning would have noticed such an effect before, if it exists? Their discussion didn't help me understand their idea any better.

Please refer to our response to the latter part of Summary Comment 2.

4) Inherent to the study is the key assumption that hitting the target would be intrinsically rewarding to the participants, despite explicit information that this is irrelevant to the task (and presumably the monetary reward?). I'm not sure what they mean here by reward, and how this could be generalised to other fields of research where reward is a key concept (e.g. decision making, reinforcement learning).

Please see our response to Summary Comment 5.

5) How much of the late effect shown across studies is actual adaptation/learning (i.e. due to the visual feedback), as opposed to drift? What happens when you ask participants to just repeatedly perform a movement to a target with no visual feedback? It seems necessary to quantify these drifts, in order to accurately account for the "learning".

All hand angles were baseline-corrected to account for intrinsic biases. We counterbalanced clockwise and counter-clockwise clamp directions within each group; the data for the clockwise group were flipped for the analyses and visualizations of the data. The counterbalancing across subjects would cancel out any systematic biases which were not due to the error clamp. Hand angles going in the positive direction are indicative of a sign-dependent correction in response to error—that is, in the opposite direction of the clamp. Finally, we note that we have included a 0° visual clamp in two previous studies as a baseline condition, and found no evidence of systematic drift (Morehead et al., 2017; Kim et al., 2018).

Minor Comments:

1) "Interestingly, these results appear qualitatively different to those observed when manipulating the clamp offset. Our previous study using clamped visual feedback demonstrated that varying clamp offset alone results in different early learning rates, but produces the same magnitude of late learning (Kim et al., 2018)." please spare us having to read this paper, what do you mean by clamp offset? please provide better description of the study (perhaps in the Discussion?).

What we meant by “clamp offset” was the angular deviation of the clamp relative to the center of the target. We have removed all usage of the phrase “clamp offset” to avoid confusion. We have now provided a clearer description of our previous work (Results paragraph five).

2) It would be very beneficial if the authors could provide the exact instructions given to the participants (i.e. the study script).

We now provide the study script in the Supplementary file 1.

3) "This was supported by our control analyses, perceptual control experiment, and our finding that the Straddle group in Experiment 1 was similar to the Hit group, suggesting that the effect of target size was categorical." In that experiment Hit was different from Straddle, which was similar to miss.

This statement has been corrected.

Reviewer #2:

The work by Kim et al., investigates a very interesting and well investigated topic; the effect of reward on sensorimotor adaptation. Through 3 well designed experiments, the authors show that task-based reward attenuates sensorimotor adaptation. The authors suggest that these results can only be explained by a model in which reward directly acts on the adaptation system (learning parameter in a state-space model). This is a well written manuscript, with clear results and interesting conclusions. My biggest issue is the lack of analysis and discussion of their after-effect (no vision/zero error clamp) data from Experiment 1 and 2. I cannot explain the differences between groups during these blocks (see below) by reward only acting on the learning component of the state-space model. To me, this data indicates that reward (Hit group) is having an effect on retention. I also have issues with the implementation of the model in Experiment 3 and the possibility that the task could also involve implicit punishment. As a result, I feel the conclusions made in this manuscript create a clear story in contrast to the data which suggests a more complicated narrative.

We now provide two analyses of the no feedback and 0° clamp data: comparing absolute change as well as the proportional change during these trials. In terms of the statistical analyses, there is actually no difference between the groups during the washout phase when analyzed as proportional change, and we report this while noting that, numerically, the data do indicate stronger retention in the Hit groups in Experiments 1 and 2. We also now provide the parameter values for the different models (subsection “Experiment 3 - Modeling Results”) and here we note that, although the biggest (and only reliable) effect is on the learning rates, the estimate of the retention parameter is larger for the Hit groups.

For a more extensive discussion of the changes in the modeling work, please see our response to Summary Comment 1. For interpretation of whether the task involves punishment, please see our response to Summary Comment 2.

Specifics:

No visual feedback vs. 0⁰ error clamp following adaptation: Why did the authors switch from “no feedback” to “0⁰ error clamp” from Experiment 1 to Experiment 2? No explanation is given for this switch.

We now provide an explanation in Results section, Experiment 2. In brief, we opted to use the 0° clamp in Experiment 2 as an alternative way to look at retention differences, thinking that retaining the clamp (but changing its size) might be a better way to reduce contextual effects from a change in the stimulus conditions. We followed the lead of Shmuelof and colleagues (2012) here.

After-effects: Following this, I find it odd how little the “after-effect” is analysed/discussed considering this was a major component of the analysis in the seminal work by Morehead et al., 2017, and is also analysed in Kim et al., 2018. The authors describe in the Materials and methods that the after-effect was defined by the first no-feedback cycle but then it is never further discussed.

We now point out in the Results and Materials and methods that the results here, in terms of the statistical comparison between groups, mirrors what was observed in the late learning analyses. We chose to focus on the late learning analysis in Experiment 1 since this allowed us to be consistent with the analyses performed in Experiments 2 and 3 where neither had no-feedback aftereffect trials. In addition, as noted in our response to the reviewer’s major comment, we now provide a more extensive analysis of the no-feedback trials.

Although it is difficult to see in Figure 1, the data provided by the authors reveals that performance is substantially different across groups during no feedback. Whereas, Miss and Straddle decline across trials, Hit maintains a plateau (by the end of No FB, groups are near the same level of performance). This plateau is strikingly different from the behaviour observed in Kim et al., 2018. This is also seen in Experiment 2, where the Straddle group decline across trials and the Hit group maintain a plateau (annoyingly I am unable to submit figures in my review, but this is clear in the data provided by the authors).

Please see our response to Summary Comment 4.

If the authors believe that reward is having a specific effect on learning, and not retention, how would a state-space model explain these results? During both these trials types, error has to be fixed to zero (there is no error to adjust behaviour with). Therefore, the only parameter which influences behaviour is the retention parameter. Thus, Straddle and Miss show the normal decay seen with no-vision/error clamp trials with a retention parameter of approx. 0.8-0.9. But the Hit group must have a retention parameter of 1. To me, this would suggest reward is having a direct effect on retention causing enhanced retention (relative to where they started at). Can the author's explain this behaviour in the context of the adaptation modulation model? In other words, can you model the plateau performance in no vision/zero clamp trials of the Hit group when reward only affects the learning parameter?

The revision provides a much more extensive discussion of the modeling work, including a presentation of the parameter estimates. These estimates show that, for the two viable models (Adaptation Modulation and Dual Error), there is a large difference in learning rates between the Hit and Straddle conditions. However, there is also a difference in retention rates (although there is considerable overlap in the confidence intervals). Thus, the modeling work does indicate that reward, or the lack of target error, may boost retention, in addition to the substantial effect on learning rate.

Interpretation of results: Throughout the paper there is an assumption (first stated ion paragraph five of the Introduction) that hitting the target is intrinsically rewarding, is it not an equally valid assumption that missing the target is intrinsically punishing? Or in fact both options may be true and have differing effects. How do we know what the “null” cerebellar behaviour is (this also relevant for the Morehead et al., and Kim et al., papers)? As the participants move from a baseline situation where they are being successful to a situation where they are being unsuccessful, couldn't the error clamp behaviour be regarded as intrinsically punishing? Therefore, across all these studies we could be observing a cerebellar process which is augmented by intrinsic punishment (aversive signals in the cerebellum are well known). Within this context, could the results not be viewed as punishment (moving from hitting the target to missing) increasing cerebellar learning (U) and reward increasing retention (A)?

Please see our response to Summary Comment 2.

This is relevant for paragraph six of the Discussion, modification of the A parameter alone would indeed lead to higher asymptote in opposition to the current results, however the retention findings in Experiments 1 and 2 suggest that A is affected by reward (see above). If both A is increased and U decreased (by a lack of intrinsic punishment) when hitting the target then lower asymptotes can be observed and would explain the findings (including the retention results which are not possible with a learning-only effect).

As noted above, the modeling work is substantially overhauled in the revision. As part of this revision, we have shifted the focus on what can be gleaned from the modeling work (e.g., that the Movement Reinforcement model is not viable). As the reviewer notes, in the two models we consider viable, the task outcome manipulation appears to have an effect on both the learning rate and retention parameters (albeit with a stronger effect on learning rate, based on bootstrapped parameter estimates).

Model comparison: Another issue I have is the use of the model in the transfer phase. Figure 5B clearly shows the predictions of the transfer phase for the straddle-hit group based on the two opposing theories. Considering this is the key result to the entire paper, surely the authors need to compare which model predicts the data better? To me the data seems to sit in the middle of these model predictions, rather than providing robust evidence for either. The authors could compare their initial adaptation modulation prediction as outlined in Figure 5F with a model that reflects what would happen with a movement reinforcement model (no change/flat line). They could use BIC or AIC model selection to determine which model explains the individual participant data (or the bootstrapped data) better. At the moment, there is no comparison between the opposing models. The authors have simply provided a prediction for the adaptation modulation model and said it approximately looks similar to the data (without even providing any form of model fit i.e. r-squared for transfer phase).

We now present three models that we see as qualitatively distinct to capture different ways in which task outcome might influence performance. We formalize our presentation of all three models, provide a new way to fit the models that analyzes all of the data simultaneously, and then report R-squared and AIC values as a way to evaluate and compare the models.

Similar conclusions to Leow et al., 2018: I am unsure what the rules are regarding a preprint but the conclusions for the current manuscript are very similar to Leow et al., 2018; implicit sensorimotor adaptation is reduced by task-based reward (modulating the amount of task error/reward). So how novel are these conclusions?

We certainly see the Leow et al. paper as relevant to the present work and discuss that paper at a few places in the manuscript. In some respects, the results of our study are consistent with the claims of Leow—we see this as a good thing, providing convergence. We do believe there are marked differences between the studies and our work allows us to reach some novel conclusions. We believe that our non-contingent feedback method, coupled with the explicit instructions about this manipulation, puts us in a strong position to argue that all of the learning effects observed here are implicit. Moreover, we provide a much more extensive computational analysis of the impact of task outcome and Experiment 3 provides a way to directly assess key predictions of these models.

No normality or homogeneity of variance tests: As ANOVAs are used, tests for normality (Shapiro-Wilk) and homogeneity are required (Levene). If these have been performed, then this needs to be made clear.

We now make explicit in the Materials and methods (Data Analysis) that we do test for both normality and homogeneity of variance. In cases where either assumption is violated, we supplemented the main analysis with non-parametric permutation tests and report the results of both the parametric and non-parametric tests. In terms of significance, the statistical outcome results were consistent in all of these cases.

Reaction and movement time across experiments: the authors state in the Materials and methods that there was a 300ms movement time cutoff. Did this include reaction time? Did the authors examine reaction time during adaptation and no-vision/0 error clamp blocks, rather than at baseline? This analysis should be included (i.e. reaction and movement time for the trials used in Figure 1E, F and similar for Experiment 2).

We now state that the 300 ms criterion applies only to movement time and does not include RT (L954). We have added tables to the Appendix that present the RT and MT data for four different phases of the experiment (baseline, early adaptation, late learning, no feedback/0° clamp).

Early adaptation rate: I take issue with calling this a measurement of “rate” (subsection “Data Analysis”)? This value is a measurement of average early performance across cycles, not the change in performance which would be measured by calculating the difference between cycles. Or is this what the authors did and the methods are incorrect?

We have edited the Materials and methods to make clear that this measure represents the mean change in hand angle per cycle over the first five cycles. To provide a more stable estimate of hand angle at cycle 5, we averaged over cycles 3-7 of the clamp block.

Experiment 1 post hoc t-tests: Are these corrected for multiple comparisons (p=0.16). Although both would survive, this is simply good practise.

These are corrected for multiple comparisons. We report the corrected α value (.0167) in the Materials and methods.

Experiment 3 statistics: The order of statistical analysis is odd. Why not start with the omnibus ANOVA and then describe the comparisons between the 3 groups? In addition, I do not like the use of planned comparisons unless these are pre-registered somewhere. I am guessing the t-test between the control group and straddle-to-hit does not survive a correction for multiple comparisons? However, as this is a control experiment, planned comparisons are begrudgingly acceptable.

We have revised the order of the statistical analyses, following the reviewer’s guidelines here. We do think the planned comparisons are appropriate here, given the specific hypothesis we were testing in this control group. Nonetheless, as the reviewer suggests, the p-values for comparisons between the control group and the Straddle-to-Hit group are just above the corrected p-value for multiple comparisons of.0167 (asymptote p-value was.017). However, if we had not used the planned comparisons, we would have performed post-hoc Scheffe contrasts, as this contrast most accurately reflects our hypothesis that the control group should behave similar to the Hit-to-Straddle group in the first phase, but different than Straddle-to-Hit. In this case, the contrast would have been significant.

Wrong group described (Discussion paragraph three): The straddle group is similar to the Miss group.

This has been corrected.

Reviewer #3:

Implicit recalibration is a specific learning process that is known to be implicit, cerebellum-dependent, and driven by sensory prediction errors. Previous work has suggested that this process is insensitive to reward. The core claim in this paper is that, in fact, implicit recalibration can be modulated by reward in that failure to earn a reward amplifies sensitivity to sensory prediction error. If true, this is certainly an interesting and important finding. I do not believe, however, that the experiments presented here are capable of supporting this conclusion.

The experiments employ a clever “error clamp” paradigm which the authors have previously shown induces implicit adaptation of reaching movements. The data in the current paper demonstrate clearly that the extent of adaptation under this clamp depends on the size of the target. If the target is big enough that the cursor lands inside it, less is learned from the error. I think it is fair to conclude from these results that there is an implicit learning process at play which is sensitive to reward. My major concern, however, is with the attempt to distinguish between the "Adaptation modulation" theory and the "Movement reinforcement" theory, which really is the heart of the paper.

We have overhauled the modeling work in this paper, with this part of the revision summarized in our response to Summary Comment 1.

The authors make the very strong assumption that their clamp manipulation isolates a single learning process – implicit recalibration driven by SPE. The results state quite clearly; "This [clamp] method allows us to isolate implicit learning from an invariant SPE, eliminating potential contributions that might be used to reduce task performance error." I agree that the clamp isolates implicit learning, but couldn't there be an implicit contribution that is not SPE-driven but is instead driven by reward/task error? How can the authors be certain that the participants were not engaging in some kind of re-aiming that may have been implicit or involuntary?

The reviewer brings up an excellent point, namely that we cannot be sure that the clamp manipulation isolates a single learning process, implicit recalibration driven by SPE. The Movement Reinforcement model did postulate a second implicit learning process, namely a model-free bias reflecting reward history. But we had not considered that target error might also drive implicit recalibration. This led to the development of the Dual Error model (the idea of which was initially presented to us by the reviewer), a two-process model along the lines of that introduced by Smith et al., 2006, but in the present context, one in which the two state estimates are sensitive to different error signals rather than different time scales. Further explanation of changes related to this comment can be found in our response to Summary Comment 1, including our thoughts on why we are hesitant to psychologically characterize the target-error sensitive process as one of implicit aiming.

The authors do seem to entertain the possibility of distinct SPE-based and reward-based processes, alluding in the Abstract to a model "in which reward and adaptation systems operate in parallel." Figure 4 even quite clearly illustrates such a parallel architecture. But by later in the paper the myriad possible alternatives to the "Adaptation Modulation" model are collapsed onto a single idea in which reward reinforces particular movements, causing them to act as an “attractor”. Ultimately, the conclusion rests on rejecting this “attractor” / "Movement Reinforcement" model because it incorrectly predicts that asymptote will be maintained in Experiment 3 when the target becomes larger. The Adaptation Modulation model is the only model left and so must be correct.

Of course the “reinforcement” model represents only one very specific alternative proposal about how reward might influence learning under the clamp (other than by modulating SPE-driven recalibration). Rejecting this one particular model is not grounds to reject EVERY possible architecture in which reward does not act by modulating SPE-driven adaptation. For this logic to work, it would have to be the case that ANY model in which reward does not modulate SPE-driven learning would necessarily predict a consistent asymptote after the straddle->hit transition.

We agree with the reviewer—we were guilty in the original submission of advocating for the Adaptation Modulation model based on the shortcomings of the Movement Reinforcement model. As noted above, we now introduce the Dual Error model as another account for how task outcome information might influence implicit learning (and this model ends up providing the best account of the results). While we recognize that these three models are not exhaustive, we do see them as representative of qualitatively different conceptualizations (e.g., model-free vs model-based; single vs multiple state estimates; independent vs interactive processes) and expect the expanded formalizations will help make this point. For example, the formalizations of the Movement Reinforcement model should make clear why any model in entailing independent operation of model-based adaptation (insensitive to task outcome) and model-free reinforcement learning processes will not be able to account for the Miss (Straddle) -> Hit conditions.

It is not too difficult to formulate a model in which reward does NOT modulate SPE-driven learning, but which doesn't predict a consistent asymptote after the straddle->hit transition. Suppose behavior in the clamp is the sum of two distinct processes acting in parallel, one feeding on SPE, and one feeding on task error, both with SSM-like learning curves. The latter processes might correspond to a kind of implicit re-aiming and might plausibly only be updated when a movement fails to earn reward. Then disengaging this system by suddenly providing reward (at the transition from straddle to hit) might lead to a gradual decline in that component of compensation. This would lead to behavior quite similar to what the authors observe. I have appended MATLAB code at the end this review which implements a version of this model.

We thank the reviewer for suggesting this model and providing the code to reinforce the idea. In the revision, we present this as the Dual Error model, with SPE and TE being used to update two independent state estimates. We address the interpretation of this process, and in particular, whether it constitutes a form of implicit aiming in the Discussion. We also outline some future experiments for further evaluating this model.

Finally, I would add that I'm not even convinced that it is reasonable to assume that the "Movement Reinforcement" “attractor” model, in which the authors suggest that reinforcing movement attenuates decay, would necessarily lead to a persistent asymptote. What if the reinforcement only partially attenuates decay, or requires repetition to do so (during which time it may decay slightly)?

We have revised our presentation of the Movement Reinforcement model, and the formalization makes clear that the independent operation of model-based adaptation and model-free reinforcement learning processes will predict a persistent asymptote in the Straddle-to-Hit condition. We recognize that a key assumption here is that the parameters in the model-based process remain the same in the Hit and Straddle conditions (which seems justified in a model in which the two processes are assumed to be independent). As such, the only change that comes about at transfer is that which arises from the model-free reinforcement learning process. In the Straddle-to-Hit condition, adaptation will stay at asymptote since the SPE does not change, and reinforcement will only reward movements about the asymptotic hand angle. There is no process that would result in a decay of hand angle. The rate at which the relative weighting between adaptation and reinforcement changes will not affect the hand angle in this condition (but will in the Hit-to-Straddle condition).

For these reasons, I don't believe that the primary conclusion is tenable.

Putting aside the question of adaptation modulation, I do like the experiments in this paper and I think phenomena that have been shown are interesting. I think the results convincingly establish the existence of an implicit learning process that is sensitive to reward, and the paradigm provides a platform for exploring the properties of this process. I would suggest the authors try to develop the manuscript along these lines, though it is not particularly satisfying if the results cannot distinguish between modulation of SPE-driven learning by reward and a parallel, reward-driven process as I feel is the case at present.

As noted in our response to the previous comment, we do think the results can rule out the Movement Reinforcement model. But we do not think they allow us to distinguish between the Adaptation Modulation and Dual Error models. We certainly discuss that the Dual Error model outperforms the Adaptation Modulation model, but give consideration to both in the Discussion as well as describe future work that can provide stronger comparisons between the two models. We do think there is considerable value in the paper, providing new insight into the impact of task outcome on implicit learning in sensorimotor adaptation tasks.

As a constructive suggestion, it may be enlightening to examine behavior in the 0deg clamp and no-feedback conditions more closely. The authors draw no attention to it in the paper (nor really explain why these blocks were included) but the data seem to suggest a difference in rate of decay between the “hit” and “miss” conditions in Experiments 1 and 2. Different theories about the nature of the implicit reward-based learning would predict quite different behavior in these conditions, but it's impossible to judge from such short periods whether the decay curves might converge, stay separate, or even cross over. I wonder if an additional experiment exploring this phenomenology in more detail, with a decay block that is longer than 5-10 cycles, might be enlightening, though I admit I'm unsure if this would necessarily help to disambiguate the SPE-modulation model from the alternative “parallel” model. Perhaps comparing a NoFB condition to a 0-deg clamp condition might be useful. A more concrete (and exhaustive) set of theories/models would help to formulate clearer predictions about what to expect here.

We have added analyses of the no-feedback condition (Experiment 1) and 0 deg clamp (Experiment 2) to look at the retention issue (and supplement the modeling analyses). Please see our response to Summary Comment 4. An experiment in which we extend the washout period seems like a way to test retention directly, but is problematic when the level of learning differs between groups prior to the washout period (see comments above on absolute vs relative changes). Moreover, given that, even with a long washout period, hand angle does not return to zero (Brennan and Smith, MLMC, 2015). This would make it difficult to compare forgetting rates with a one-parameter model (i.e., A term in state space model). This latter problem is not as acute in the initial washout cycles, although the absolute vs relative issue still holds.

Minor Comments:

Results third paragraph: “angle” or “deviation” may be a better word here than “offset” which could easily be confused with switching off the clamp.

We have made the recommended changes.

The "models" are never formulated in concrete terms. I would be fine with a verbal argument as to why the Movement Reinforcement theory should predict a consistent asymptote in Experiment 3, but it feels a stretch to call them “models” when they are never concretely fleshed out. Fleshing out the models mathematically may also help to bring to the surface some of the tacit assumptions that have led to the headline conclusion of the paper.

The revision includes formalizations of all three models, including the Movement Reinforcement model. We do believe this sharpens the presentation of the ideas, and allows us to objectively compare the models.

The "model" that generates the "model prediction" in Figure 5f is rather ad-hoc. It’s odd to suggest that the parameters of the underlying process would instantly flip. Also why are the predictions discontinuous across the dashed “transfer” line?

This comment, coupled with our overhaul of the modeling section, led us to take a different tack in modeling the data. In the initial submission, we fit the first half of the data and used the parameter estimates to predict transfer performance, evaluating this against the group-averaged data from participants in the other group. We now simultaneously fit the data from both halves of the experiment for both groups (again using group-averaged data). This was essential to get stable parameter estimates to generate the best-fitting function for the Movement Reinforcement model since, with the Straddle-to-Hit group, any values for the reinforcement parameters during the Hit phase will predict a persistent asymptote.

Results subsection “Experiment 3”: "the predictions of the model slightly underestimated the observed rates of change for both groups". Do you mean OVERestimated?

The reviewer is correct—we mean the model overestimated the observed rate of change (we were thinking about the behavior in the transfer phase falling short of the predicted rate of change). However, given our new strategy to use the acquisition and transfer data in fitting the models, this point is now moot.

[Editors' note: further revisions were suggested, as described below.]

All three reviewers were fully satisfied with the amendments made to the manuscript, and agreed the manuscript was much improved. Reviewers #2 and 3 provided some valuable follow-up suggestions, which I expect you would be happy to consider. I'm pasting these below for your benefit.

Please incorporate your changes in a final version of your manuscript (no need for a detailed rebuttal).

We are very pleased that the reviewers were satisfied with the revised manuscript. Their input has certainly strengthened the paper. We also appreciate the follow-up suggestions and have addressed them in this new revision. Please see our notes below for how we have addressed each comment.

Reviewer #1:

I find the revised manuscript much improved. The analysis and discussion are more comprehensive and provide a more nuanced, and as such more transparent and balanced, account of the very interesting behavioural effects observed in the study. I'm also satisfied with the responses provided to the reviewers comments. On a personal level, I found the manuscript easier to read (though this might be due to a repetition effect!). I have no further comments.

Thank you! We do hope the improved clarity of the manuscript is not a repetition effect, but reflects the changes made over the course of revision.

Reviewer #2:

The authors have done a great job at addressing the issues highlighted by all reviewers, and the manuscript is much improved. Although the overall story is less eye-catching than “reward modulates SPE learning”, I believe it is now a far more informative piece of work which is still novel. I have a few fairly small outstanding issues:

Collinearity of parameters: Could the authors test the collinearity of the parameters in each model? In other words, is the data rich enough to have 4/6 independent parameters (I know this is an issue when using the 2-state Smith model with certain paradigms)? For example, when you use the bootstrapping procedure how correlated are the parameters? A heat map maybe for each model would be a good way to show this. This will ensure the results from paragraph three of subsection “Experiment 3 – Modeling Results” are valid. Also parameter recovery (Palminteri et al., Plos CB, 2017) might be of interest.

We appreciate the reviewer’s comments. We now include a supplemental figure (Figure 6—figure supplement 1) with heat maps of the parameters. As can be seen in this new figure, there is a trade-off between the ‘A’ and ‘U’ terms, something that is frequently observed with these models when the perturbation is relatively simple. We address this issue in the third paragraph. Note that the tests of collinearity are only performed for the models in which we used a bootstrapping approach to estimate the parameter values.

Control group figure: I would add Figure 5—figure supplement 1 to the main document.

We have made this change. The control group data are now shown in Figure 7 of the main text.

Discussion (paragraph nine of subsection “Modeling the Influence of Task Outcome on Implicit Changes in Performance”): Aren't the target jump predictions similar to what Leow et al.,. 2018, have already shown to be true?

The Leow study is certainly relevant to our predictions here. However, we propose to use the target jump method to manipulate TE in the absence of SPE (by using a 0o clamp). This is what leads to the differential predictions of the Dual Error and Adaptation Modulation models. We have revised the text to clarify this issue and how our proposed experiment differs from the Leow study.

Reviewer #3:

The authors have revised the paper very thoroughly. For the most part, I find that the paper hits the mark well in terms of articulating theories that can versus can't be disambiguated based on the current experiments. I do, however, have a few further suggestions to help clarify the paper further.

Use of the term “target error” is a bit ambiguous. At times it is characterized as a binary hit/miss signal (e.g. Introduction paragraph five; Discussion paragraph five). At other times it seems to refer to a vector error (i.e. having magnitude and direction) (e.g. Introduction paragraph six distinguishes it from “hitting the target”; penultimate paragraph of subsection “Experiment 2”). This is liable to cause considerable confusion. It confused me until reading the description of the dual adaptation model, in practical terms it is tantamount to a binary signal, given the constant error size and direction. But I think a little more conceptual clarity is needed. e.g. does TE already take reward into account? Or are TE and SPE equivalent (in this task) but learning from TE is subject to modulation by reward, while learning from SPE is not?

We now make clear early on that we operationalize TE as a binary signal. These changes are in the fifth paragraph and reinforced in the final paragraph of the Introduction section.

The term “model-based” is used throughout, sometimes (e.g. paragraph four of subsection “Modeling the Influence of Task Outcome on Implicit Changes in Performance”) in the sense of a model-based analysis of the effects of reward, i.e. examining how presence/absence of reward affects fitted parameters of a model (this usage is fine). At other times (e.g. Conclusions section) it's used essentially as a synonym for error-based. I'm a bit skeptical as to whether this latter usage adds much value or clarity. I appreciate it derives from Huang et al. and Haith and Krakauer, who made a case that implicit adaptation reflected updating an internal forward model. However, error-driven learning need not necessarily be model-based. For instance, if learning is driven by target error, it's not clear that this has anything to do with updating any kind of model of the environment (and "inverse model" doesn't really count as a proper model in this sense). So I would caution against using the term “model-based” when the idea doesn't inherently involve a forward model. Particularly given the two distinct usages of “model-based” in this paper.

We agree with the reviewer that the multiple senses of “model-based” can be confusing. In the revision, we now limit our use of the term model-based to places in which the term is used to refer to learning processes driven by SPE. We do not use this term when referring to learning processes driven/modulated by TE.

The Movement Reinforcement model is reasonable, although it is fairly ad hoc. The Discussion argues convincingly that this model is largely illustrative, and that its behavior can be taken as representative of a broad class of “operant reinforcement” models. I think articulating something to this effect earlier, when the model is first introduced, would be helpful. At the moment, it comes from nowhere and it's a little perplexing to follow exactly why this is the model and not, say, a more principled value-function-approximation approach. With that in mind, some of the finer details of the model might be better placed in the Materials and methods section so as not to bog down the results with specifics that aren't all that pertinent to the overall argument.

Based on the reviewer’s suggestions, the discussion of the MR model has been revised. We have simplified our motivation for the model in the section where we introduce the different models, and as the reviewer has suggested, we make clear early on that this model’s behavior is representative of a broad class of operant reinforcement models. We have also moved some of the details into the Materials and methods section.

Minor Comments:

I appreciate the discussion of implicit TE-driven learning in motivating the Dual Error model (subsection “Theoretical analysis of the effect of task outcome on implicit learning”). But I was surprised the authors didn't mention this again in the discussion, instead only speculating that TE-based learning might be re-aiming that has become implicit through automatization/caching, and consequent making the dual error model seem implausible. But it seems perfectly plausible that TE-based learning is just another implicit, error-based learning system, separate from SPE-driven implicit learning, that never has anything to do with re-aiming.

We have modified the Discussion, now having a line to make clear that it remains possible that there may be a TE- sensitive implicit learning process, distinct from SPE-driven adaptation. This point is also reinforced in our discussion of future experiments that might distinguish between the Dual Error and Adaptation Modulation models.

Subsection “Modeling the Influence of Task Outcome on Implicit Changes in Performance”: it doesn't seem necessary to invoke “SPE-driven” here. Could in principle be error-based learning driven by something like "target error" (i.e. just the distance between the center of the cursor and the center of the target). Ditto in the Conclusion section.

We have modified the text in both places in recognition of this point.

Introduction section: "We recently introduced a new method.… designed to isolate learning from implicit adaptation" slightly ambiguous sentence, I first read it as though learning and implicit adaptation are separate things being dissociated. Maybe just drop "learning from"?

We have made this change.

Introduction section: "Given that participants have no control over the feedback cursor, the effect of this task outcome would presumably operate in an implicit, automatic manner." It's not having no control that makes it implicit… Might be better rephrased to something like "Given that participants are aware that they have no control over the feedback cursor…"?

We have made this change.

Second paragraph of subsection “Theoretical analysis of the effect of task outcome on implicit learning”: this paragraph misses a key detail, that “reinforcing” or “strengthening the representation of” rewarded actions really means that it makes those actions more likely to be selected in the future.

We have now added: “…and increase the likelihood that future movements will be biased in a similar direction.”

Third paragraph of subsection “Theoretical analysis of the effect of task outcome on implicit learning”: “composite” is somewhat vague. Would “'sum”' or “'average”' be accurate?

As reviewer 3 has suggested, we have changed the wording to “sum”.

Third paragraph of subsection “Experiment 3 – Modeling Results”: something is up with the brackets here.

We have now fixed things.

https://doi.org/10.7554/eLife.39882.022

Article and author information

Author details

  1. Hyosub E Kim

    1. Department of Psychology, University of California, Berkeley, Berkeley, United States
    2. Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, United States
    3. Department of Physical Therapy, University of Delaware, Newark, United States
    4. Department of Psychological and Brain Sciences, University of Delaware, Newark, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Visualization, Writing—original draft, Writing—review and editing
    For correspondence
    hyosub@udel.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0109-593X
  2. Darius E Parvin

    1. Department of Psychology, University of California, Berkeley, Berkeley, United States
    2. Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, United States
    Contribution
    Conceptualization, Formal analysis, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5278-2970
  3. Richard B Ivry

    1. Department of Psychology, University of California, Berkeley, Berkeley, United States
    2. Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, United States
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Writing—review and editing
    Competing interests
    Senior editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4728-5130

Funding

National Institutes of Health (NS092079)

  • Richard B Ivry

National Institutes of Health (NS105839)

  • Richard B Ivry

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Matthew Hernandez and Wendy Shwe for assistance with data collection. We are also grateful to Maurice Smith, Ryan Morehead, Guy Avraham, and Ian Greenhouse for helpful discussions regarding this work.

Ethics

Human subjects: All participants provided written informed consent to participate in the study and to allow publication of their data, and received financial compensation for their participation. The Institutional Review Board at UC Berkeley approved all experimental procedures under ID number 2016-02-8439.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Tamar R Makin, University College London, United Kingdom

Reviewer

  1. Tamar R Makin, University College London, United Kingdom

Publication history

  1. Received: July 6, 2018
  2. Accepted: April 5, 2019
  3. Version of Record published: April 29, 2019 (version 1)

Copyright

© 2019, Kim et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,097
    Page views
  • 119
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Juyue Chen et al.
    Research Article
    1. Genetics and Genomics
    2. Neuroscience
    Rebecca Delventhal et al.
    Research Article