Abstract
Humans and animals can integrate sensory evidence from various sources to make decisions in a statistically nearoptimal manner, provided that the stimulus presentation time is fixed across trials. Little is known about whether optimality is preserved when subjects can choose when to make a decision (reactiontime task), nor when sensory inputs have timevarying reliability. Using a reactiontime version of a visual/vestibular heading discrimination task, we show that behavior is clearly suboptimal when quantified with traditional optimality metrics that ignore reaction times. We created a computational model that accumulates evidence optimally across both cues and time, and trades off accuracy with decision speed. This model quantitatively explains subjects's choices and reaction times, supporting the hypothesis that subjects do, in fact, accumulate evidence optimally over time and across sensory modalities, even when the reaction time is under the subject's control.
https://doi.org/10.7554/eLife.03005.001eLife digest
Imagine trying out a new rollercoaster ride and doing your best to figure out if you are being hurled to the left or to the right. You might think that this task would be easier if your eyes were open because you could rely on information from your eyes and also from the vestibular system in your ears. This is also what cue combination theory says—our ability to discriminate between two potential outcomes is enhanced when we can draw on more than one of the senses.
However, previous tests of cue combination theory have been limited in that test subjects have been asked to respond after receiving information for a fixed period of time whereas, in real life, we tend to make a decision as soon as we have gathered sufficient information. Now, using data collected from seven human subjects in a simulator, Drugowitsch et al. have confirmed that test subjects do indeed give more correct answers in more realistic conditions when they have two sources of information to rely on, rather than only one.
What makes this result surprising? Traditional cue combination theories do not consider that slower decisions allow us to process more information and therefore tend to be more accurate. Drugowitsch et al. show that this shortcoming causes such theories to conclude that multiple information sources might lead to worse decisions. For example, some of their test subjects made less accurate choices when they were presented with both visual and vestibular information, compared to when only visual information was available, because they made these choices very rapidly.
By developing a theory that takes into account both reaction times and choice accuracy, Drugowitsch et al. were able to show that, despite different tradeoffs between speed and accuracy, test subjects still combined the information from their eyes and ears in a way that was close to ideal. As such the work offers a more thorough account of human decision making.
https://doi.org/10.7554/eLife.03005.002Introduction
Effective decision making in an uncertain, rapidly changing environment requires optimal use of all information available to the decisionmaker. Numerous previous studies have examined how integrating multiple sensory cues—either within or across sensory modalities—alters perceptual sensitivity (van Beers et al., 1996; Ernst and Banks, 2002; Battaglia et al., 2003; Fetsch et al., 2009). These studies generally reveal that subjects' ability to discriminate among stimuli improves when multiple sensory cues are available, such as visual and tactile (van Beers et al., 1996; Ernst and Banks, 2002), visual and auditory (Battaglia et al., 2003), or visual and vestibular (Fetsch et al., 2009) cues. The performance gains associated with cue integration are generally well predicted by models that combine information across senses in a statistically optimal manner (Clark and Yuille, 1990). Specifically, we consider cue integration to be optimal if the information in the combined, multisensory condition is the sum of that available from the separate cues (see Supplementary file 1 for formal statement) (Clark and Yuille, 1990).
Previous studies and models share a common fundamental limitation: they only consider situations in which the stimulus duration is fixed and subjects are required to withhold their response until the stimulus epoch expires. In natural settings, by contrast, subjects usually choose for themselves when they have gathered enough information to make a decision. In such contexts, it is possible that subjects integrate multiple cues to gain speed or to increase their proportion of correct responses (or some combination of effects), and it is unknown whether standard criteria for optimal cue integration apply. Indeed, using a reactiontime version of a multimodal heading discrimination task, we demonstrate here that human performance is markedly suboptimal when evaluated with standard criteria that ignore reaction times. Thus, the conventional framework for optimal cue integration is not applicable to behaviors in which decision times are under subjects' control.
On the other hand, there is a large body of empirical studies that has focused on how multisensory integration affects reaction times, but these studies have generally ignored effects on perceptual sensitivity (Colonius and Arndt, 2001; Otto and Mamassian, 2012). Some of these studies have reported that reaction times for multisensory stimuli are faster than predicted by ‘parallel race’ models (Raab, 1962; Miller, 1982), suggesting that multisensory inputs are combined into a common representation. However, other groups have failed to replicate these findings (Corneil et al., 2002; Whitchurch and Takahashi, 2006) and it is unclear whether the sensory inputs are combined optimally. Thus, multisensory integration in reaction time experiments remains poorly understood, and there is no coherent framework for evaluating optimal decision making that incorporates both perceptual sensitivity and reaction times. We address this substantial gap in knowledge both theoretically and experimentally.
For tasks based on information from a single sensory modality, diffusion models (DMs) have proven to be very effective at characterizing both the speed and accuracy of perceptual decisions, as well as speed/accuracy tradeoffs (Ratcliff, 1978; Ratcliff and Smith, 2004; Palmer et al., 2005) (where accuracy is used in the sense of percentage of correct responses). Here, we develop a novel form of DM that not only integrates evidence optimally over time but also across different sensory cues, providing an optimal decision model for multisensory integration in a reactiontime context. The model is capable of combining cues optimally even when the reliability of each sensory input varies as a function of time. We show that this model reproduces human subjects' behavior very well, thus demonstrating that subjects nearoptimally combine momentary evidence across sensory modalities. The model also predicts the counterintuitive finding that discrimination thresholds are often increased during cue combination, and demonstrates that this departure from standard cueintegration theory is due to a speedaccuracy tradeoff.
Overall, our findings provide a framework for extending cueintegration research to more natural contexts in which decision times are unconstrained and sensory cues vary substantially over time.
Results
We collected behavioral data from seven human subjects, A–G, performing a reactiontime version of a heading discrimination task (Gu et al., 2007, 2008, 2010; Fetsch et al., 2009) based on optic flow alone (visual condition), inertial motion alone (vestibular condition), or a combination of both cues (combined condition, Figure 1A). In each stimulus condition, the subjects experienced forward translation with a small leftward or rightward deviation, and their task was to report whether they moved leftward or rightward relative to (an internal standard of) straight ahead (Figure 1B). In the combined condition, visual and vestibular cues were always spatially congruent, and followed temporally synchronized Gaussian velocity profiles (Figure 1C). Reliability of the visual cue was varied randomly across trials by changing the motion coherence of the optic flow stimulus (three coherence levels). For subjects B, D, and F, an additional experiment with six coherence levels was performed (denoted as B2, D2, F2). In contrast to previous tasks conducted with the same apparatus (Fetsch et al., 2009; Gu et al., 2010), subjects did not have to wait until the end of the stimulus presentation, but were allowed to respond at any time throughout the trial, which lasted up to 2 s.
For all conditions and all subjects, heading discrimination performance improved with an increase in heading direction away from straight ahead and with increased visual motion coherence. Let h denote the heading angle relative to straight ahead (h > 0 for right, h < 0 for left), and $\lefth\right$ its magnitude. Larger values of $\lefth\right$ simplified the discrimination task, as reflected by a larger fraction of correct choices (Figure 2A for subject D2, Figure 3—figure supplement 1 for other subjects). To quantify discrimination performance, we fitted a cumulative Gaussian function to the psychometric curve for each stimulus condition and coherence. A lower discrimination threshold, given by the standard deviation of the fitted Gaussian, indicates a steeper psychometric curve and thus better performance. For both the visual and combined conditions, discrimination thresholds consistently decreased with an increase in motion coherence (Figure 2B for subject D2, Figure 2—figure supplement 1 for other subjects), indicating that increasing coherence improves heading discrimination.
Suboptimal cue combination?
Traditional cue combination models predict that the discrimination threshold in the combined condition should be smaller than that of either unimodal condition (Clark and Yuille, 1990). With a fixed stimulus duration, this prediction has been shown to hold for visual/vestibular heading discrimination in both human and animal subjects (Fetsch et al., 2009, 2011), consistent with optimal cue combination. In contrast, the discrimination thresholds of subjects in our reactiontime task appear to be substantially suboptimal. For the example subject of Figure 2A, psychometric functions in the combined condition lie between the visual and vestibular functions. Correspondingly, discrimination thresholds for the combined condition are intermediate between visual and vestibular thresholds for this subject, and for high coherences, are substantially greater than the optimal predictions (Figure 2B).
This pattern of results was consistent across subjects (Figure 2C, Figure 2—figure supplement 1). In no case did subjects feature a significantly lower discrimination threshold in the combined condition than the better of the two unimodal conditions (p>0.57, onetailed, Supplementary file 2A). For the largest visual motion coherence (70%), all subjects except one showed thresholds in the combined condition that were significantly greater than visual thresholds and significant greater than optimal predictions of a conventional cueintegration scheme (p<0.05, Supplementary file 2A). These data lie in stark contrast to previous reports using fixed duration stimuli (Fetsch et al., 2009, 2011) in which combined thresholds were generally found to improve compared to the unimodal conditions, as expected by standard optimal multisensory integration models. To summarize this contrast, we compare the ratio of observed to predicted thresholds in the combined condition for our subjects to human and monkey subjects performing a similar task in a fixed duration setting (Fetsch et al., 2009). We found this ratio to be significantly greater for our subjects (Figure 2C; twosample t test, t (77) = 3.245, p=0.0017). This indicates that, with respect to predictions of standard multisensory integration models, our subjects performed significantly worse than those engaged in a similar fixedduration task.
A different picture emerges if we take not only discrimination thresholds but also reaction times into account. Short reaction times imply that subjects gather less information to make a decision, yielding greater discrimination thresholds. Longer reaction times may decrease thresholds, but at the cost of time. Consequently, if subjects decide more rapidly in the combined condition than the visual condition, they might feature higher discrimination thresholds in the combined condition even if they make optimal use of all available information. Thus, to assess if subjects perform optimal cue combination, we need to account for the timing of their decisions.
Average reaction times depended on stimulus condition, motion coherence, and heading direction. In general, reaction times were faster for larger heading magnitudes, and reaction times in the vestibular condition were faster than those in the visual condition (Figure 3 for subject D2, Figure 3—figure supplement 1 for other subjects). In the combined condition, however, reaction times were much shorter than those seen for the visual condition and were comparable to those of the vestibular condition (Figure 3). Thus, subjects spent substantially more time integrating evidence in the visual condition, which boosted their discrimination performance when compared to the combined condition. Note also that discrimination thresholds in the combined condition were substantially smaller than vestibular thresholds, especially at 70% coherence (Figures 2 and 3). Thus, adding optic flow to a vestibular stimulus decreased the discrimination threshold with essentially no loss of speed. A similar overall pattern of results was observed for the other subjects (Figure 3—figure supplement 1). These data provide clear evidence that subjects made use of both visual and vestibular information to perform the reactiontime task, but the benefits of cue integration could not be appreciated by considering discrimination thresholds alone.
Modeling cue combination with a novel diffusion model
To investigate whether subjects accumulate evidence optimally across both time and sensory modalities, we built a model that integrates visual and vestibular cues optimally to perform the heading discrimination task, and we compare predictions of the model to data from our human subjects. The model builds upon the structure of diffusion models (DMs), which have previously been shown to account nicely for the tradeoff between speed and accuracy of decisions (Ratcliff, 1978; Ratcliff and Smith, 2004; Palmer et al., 2005). Additionally, DMs are known to optimally integrate evidence over time (Laming, 1968; Bogacz et al., 2006), given that the reliability of the evidence is timeinvariant (such that, at any point in time from stimulus onset, the stimulus provides the same amount of information about the task variable). However, DMs have neither been used to integrate evidence from several sources, nor to handle evidence whose reliability changes over time, both of which are required for our purposes.
In the context of heading discrimination, a standard DM would operate as follows (Figure 4A): consider a diffusing particle with dynamics given by $\dot{x}=k\mathrm{sin}\left(h\right)+\eta \left(t\right)$, where h is the heading direction, k is a positive constant relating particle drift to heading direction, and η(t) is unit variance Gaussian white noise. The particle starts at x (0) = 0, drifts with an average slope given by ksin(h), and diffuses until it hits either the upper bound θ or the lower bound −θ, corresponding to rightward and leftward choices, respectively. The decision time is determined by when the particle hits a bound. Larger $\lefth\right$'s lead to shorter decision times and more correct decisions because the drift rate is greater. Lower bound levels, $\left\theta \right$, also lead to shorter decision times but more incorrect decisions. Errors (hitting bound θ when h < 0, or hitting bound −θ when h > 0) can occur due to the stochasticity of particle motion, which is meant to capture the variability of the momentary sensory evidence. The Fisher information in x(t) regarding h, a measure of how much information x(t) provides for discriminating heading (Papoulis, 1991), is I_{x}(sin(h)) = k^{2} per second, showing that k is a measure of the subject's sensitivity to changes in heading direction. This sensitivity depends on the subject's effectiveness in estimating heading from the cue, which in turn is influenced by the reliability of the cue itself (e.g., coherence).
Now consider both a visual (vis) and a vestibular (vest) source of evidence regarding h, ${\dot{x}}_{vis}={k}_{vis}\left(c\right)\mathrm{sin}\left(h\right)+{\eta}_{vis}\left(t\right)$ and ${\dot{x}}_{vest}={k}_{vest}\mathrm{sin}\left(h\right)+{\eta}_{vest}\left(t\right)$, where k_{vis}(c) indicates that the sensitivity to the cue in the visual modality depends on motion coherence, c. Combining these two sources of evidence by a simple sum, ${\dot{x}}_{vis}+{\dot{x}}_{vest}$, would amount to adding noise to ${\dot{x}}_{vest}$ for low coherences (k_{vis}(c) ≈ 0), which is clearly suboptimal. Rather, it can be shown that the two particle trajectories are combined optimally by weighting their rates of change in proportion to their relative sensitivities (see Supplementary file 1 for derivation):
This allows us to model the combined condition by a single new DM, ${\dot{x}}_{comb}={k}_{comb}\left(c\right)\mathrm{sin}\left(h\right)+{\eta}_{comb}\left(t\right)$, which is optimal because it preserves all information contained in both x_{vis} and x_{vest} (Figure 4B; see ‘Materials and methods’ and Supplementary file 1 for a formal treatment). The sensitivity (drift rate coefficient) in the combined condition,
is a combination of the sensitivities of the unimodal conditions and is therefore always greater than the largest unimodal sensitivity.
So far we have assumed that the reliability of each cue is timeinvariant. However, as the motion velocity changes over time, so does the amount of information about h provided by each cue, and with it the subject's sensitivity to changes in h. For the vestibular and visual conditions, motion acceleration a(t) and motion velocity v(t), respectively, are assumed to be the physical quantities that modulate cue sensitivity (‘Materials and methods’ and ‘Discussion’). To account for these dynamics, the DMs are modified to ${\dot{x}}_{vest}=a\left(t\right){k}_{vest}\mathrm{sin}\left(h\right)+{\eta}_{vest}\left(t\right)$ and ${\dot{x}}_{vis}=v\left(t\right){k}_{vis}\left(c\right)\mathrm{sin}\left(h\right)+{\eta}_{vis}\left(t\right)$. Note that once the drift rate in a DM changes with time, it generally loses its property of integrating evidence optimally over time. For example, at the beginning of each trial when motion velocity is low, ${\dot{x}}_{vis}$ is dominated by noise and integrating ${\dot{x}}_{vis}$ is fruitless. Fortunately, weighting the momentary visual evidence, ${\dot{x}}_{vis}$, by the velocity profile recovers optimality of the DM (‘Materials and methods’). This temporal weighting causes the visual evidence to contribute more at high velocities, while the noise is downweighted at low velocities. Similarly, vestibular evidence is weighted by the time course of acceleration. The new, weighted particle trajectories are described by the DMs ${\dot{X}}_{vis}=v\left(t\right){\dot{x}}_{vis}$ and ${\dot{X}}_{vest}=a\left(t\right){\dot{x}}_{vest}$. The two unimodal DMs are combined as before, resulting in the combined DM given by ${\dot{X}}_{comb}=d\left(t\right){\dot{x}}_{comb}$, where the sensitivity profile d(t) is a weighted combination of the unimodal sensitivity profiles,
(Figure 4B; see Supplementary file 1 for derivation). These modifications to the standard DM are sufficient to integrate evidence optimally across time and sensory modalities, even as the sensitivity to the evidence changes over time.
The model assumes that subjects know their cue sensitivities, k_{vis}(c) and k_{vest}, as well as the temporal sensitivity profiles, a(t) and v(t), of each stimulus. In this respect, our model provides an upper bound on performance, since subjects may not have perfect knowledge of these variables, especially since stimulus modalities and visual motion coherence values are randomized across trials (‘Discussion’).
Quantitative assessment of cue combination performance
We tested whether subjects combined evidence optimally across both time and cues by evaluating how well the model outlined above could explain the observed behavior. The bounds, θ, of the modified DM, and the sensitivity parameters (k_{vis}, k_{vest} and k_{comb}), were allowed to vary between the visual, vestibular, and combined conditions. Varying the bound was essential to capture the deviation of the discrimination threshold in the combined condition from that predicted by traditional cue combination models (Figure 2). Indeed, this discrimination threshold is inversely proportional to bound and sensitivity (see Supplementary file 1). Since the sensitivity in the bimodal condition is not a free parameter (it is determined by Equation 2), the height of the bound is the only parameter that could modulate the discrimination thresholds.
The noise terms η_{vis} and η_{vest} play crucial roles in the model, as they relate to the reliability of the momentary sensory evidence. To specify the manner in which such noise may depend on motion coherence, we relied on fundamental assumptions about how optic flow stimuli are represented by the brain. We assumed that heading is represented by a neural population code in which neurons have heading tuning curves that, within the range of heading tested in this experiment (±16°, Figure 5A), differ in their heading preferences but have similar shapes. This is broadly consistent with data from area MSTd (Fetsch et al., 2011), but the exact location of such a code is not important for our argument. For low coherence, motion energy in the stimulus is almost uniform for all heading directions, such that all neurons in the population fire at approximately the same rate (Figure 5A, dark blue curve). For high coherence, population neural activity is strongly peaked around the actual heading direction (Figure 5A, cyan curve) (Morgan et al., 2008; Fetsch et al., 2011).
Based on this representation, and assuming that the response variability of the neurons belongs to the exponential family with linear sufficient statistics (Ma et al., 2006) (an assumption consistent with in vivo data [Graf et al., 2011]), heading discrimination can be performed optimally by a weighted sum of the activity of all neurons, with weights monotonically related to the preferred heading of each neuron. For a straight forward heading, h = 0, this sum should be 0, and for h > 0 (or h < 0) it should be positive (or negative), thus sharing the basic properties of the momentary evidence, $\dot{x}$, in our DM. This allowed us to deduce the mean and variance of the momentary evidence driving $\dot{x}$, based on what we know about the neural responses. First, the sensitivity, k_{vis}(c), which determines how optic flow modulates the mean drift rate of $\dot{x}$, scales in proportion with the ‘peakedness’ of the neural activity, which in turn is proportional to coherence. We assumed a functional form of k_{vis}(c) given by ${a}_{vis}{c}^{{\gamma}_{vis}}$, where a_{vis} and γ_{vis} are positive parameters. Second, the variance of $\dot{x}$ is assumed to be the sum of the variances of the neural responses. Since experimental data suggest that the variance of these responses is proportional to their firing rate (Tolhurst et al., 1983), the sum of the variances is proportional to the area underneath the population activity profile (Figure 5A). Based on the experimental data of Britten et al. (Heuer and Britten, 2007), this area was assumed to scale roughly linearly with coherence, such that the variance of $\dot{x}$ is proportional to $1+{b}_{vis}{c}^{{\gamma}_{vis}}$ with free parameters b_{vis} and γ_{vis}, the latter of which captures possible deviations from linearity. We further assumed the DM bound to be independent of coherence, and given by θ_{σ,vis}. Thus, the effect of motion coherence on the momentary evidence in the DM was modeled by four parameters: a_{vis}, γ_{vis}, b_{vis}, and θ_{σ,vis}.
The above scaling of the diffusion variance by coherence, which is a consequence of the neural code for heading, makes an interesting prediction: reaction times for headings near straight ahead should be inversely proportional to coherence in the visual condition, even though the mean drift rate, k_{vis}(c)sin(h), is very close to 0. This is indeed what we observed: subjects tended to decide faster for higher coherences even when h ≈ 0 (Figure 3, Figure 3—figure supplement 1). This aspect of the data can only be captured by the model if the DM variance is allowed to change with coherence (Figure 5B,C).
To summarize, in the combined condition, the diffusion variance was assumed to be proportional to $1+{b}_{comb}{c}^{{\gamma}_{comb}}$, while the bound was fixed at θ_{σ,comb}. By contrast, the diffusion rate (sensitivity) cannot be modeled freely but rather needs to obey ${k}_{comb}\left(c\right)=\sqrt{{k}_{vis}^{2}\left(c\right)+{{k}_{vest}}^{2}}$ in order to ensure optimal cue combination. The sensitivity k_{vest} and bound θ_{σ,vest} in the vestibular condition do not depend on motion coherence and were thus model parameters that were fitted directly.
Observed reaction times were assumed to be composed of the decision time and some nondecision time. The decision time is the time from the start of integrating evidence until a decision is made, as predicted by the diffusion model. The nondecision time includes the motor latency and the time from stimulus onset to the start of integrating evidence. As the latter can vary between different modalities, we allowed it to differ between visual, vestibular, and combined conditions, but not for different coherences, thus introducing the model parameters t_{nd,vis}, t_{nd,vest}, and t_{nd,comb}. Although the fitted nondecision times were similar across stimulus conditions for most subjects (Figure 3—figure supplement 2), a model assuming a single nondecision time resulted in a small but significant decrease in fit quality (Figure 7—figure supplement 2A). Overall, 12 parameters were used to model cue sensitivities, bounds, variances, and nondecision times in all conditions, and these 12 parameters were used to fit 312 data points for subjects that were tested with 6 coherences (168 data points for the threecoherence version). An additional 14 parameters (8 parameters for the threecoherence version; one bias parameter per coherence/condition, one lapse parameter across all condition) controlled for biases in the motion direction percept and for lapses of attention that were assumed to lead to random choices (‘Materials and methods’). Although these additional parameters were necessary to achieve good model fits (Figure 7—figure supplement 2A), it is critical to note that they could not account for differences in heading thresholds or reaction times across stimulus conditions. As such, the additional parameters play no role in determining whether subjects perform optimal multisensory integration. Alternative parameterizations of how drift rates and bounds depend on motion coherence yielded qualitatively similar results, but caused the model fits to worsen decisively (Supplementary file 1; Figure 7—figure supplement 2A).
Critically, our model predicts that the unimodal sensitivities k_{vis}(c) and k_{vest} relate to the combined value by ${k}_{comb}^{predicted}\left(c\right)=\sqrt{{k}_{vis}{\left(c\right)}^{2}+{k}_{vest}^{2}}$, if subjects accumulate evidence optimally across cues. To test this prediction, we fitted separately the unimodal and combined sensitivities, k_{vis}(c), k_{vest} and k_{comb} to the complete data set from each individual subject using maximum likelihood optimization (‘Materials and methods’), and then compared the fitted values of k_{comb} to the predicted values, ${k}_{comb}^{predicted}\left(c\right)$. Predicted and observed sensitivities for the combined condition are virtually identical (Figure 6), providing strong support for nearoptimal cue combination across both time and cues. Remarkably, for low coherences at which optic flow provides no useful heading information, the sensitivity in the combined condition was not significantly different from that of the vestibular condition (Figure 6). Thus, subjects were able to completely suppress noisy visual information and rely solely on vestibular input, as predicted by the model.
Having established that cue sensitivities combine according to Equation 2, the model was then fit to data from each individual subject under the assumption of optimal cue combination. Model fits are shown as solid curves for example subject D2 (Figure 3), as well as for all other subjects (Figure 3—figure supplement 1). Sensitivity parameters, bounds, and nondecision times resulting from the fits are also shown for each subject, condition, and coherence (Figure 3—figure supplement 2). For 8 of 10 datasets, the model explains more than 95% of the variance in the data (adjusted R^{2} > 0.95), providing additional evidence for nearoptimal cue combination across both time and cues (Figure 7A). The subjects associated with these datasets show a clear decrease in reaction times with larger $\lefth\right$, and this effect is more pronounced in the visual condition than in the vestibular and combined conditions (Figure 3, Figure 3—figure supplement 1). The remaining two subjects (C and F) feature qualitatively different behavior and lower R^{2} values of approximately 0.80 and 0.90, respectively (Figure 3—figure supplement 1). These subjects showed little decline in reaction times with larger values of $\lefth\right$, and their mean reaction times were more similar across the visual, vestibular and combined conditions.
Critically, the model nicely captures the observation that the psychophysical threshold in the combined condition is typically greater than that for the visual condition, despite nearoptimal combination of momentary evidence from the visual and vestibular modalities (e.g., Figure 3, 70% coherence, Figure 2—figure supplement 1, Figure 3—figure supplement 1). Thus, the model fits confirm quantitatively that apparent suboptimality in psychophysical thresholds can arise even if subjects combine all cues in a statistically optimal manner, emphasizing the need for a computational framework that incorporates both decision accuracy and speed.
Alternative models
To further assess and validate the critical design features of our modified DM, we evaluated six alternative (mostly suboptimal) versions of the model to see if these variants are able to explain the data equally well. We compared these variants to the optimal model using Bayesian model comparison, which trades off fit quality with model complexity to determine whether additional parameters significantly improve the fit (Goodman, 1999).
With regard to optimality of cue integration across modalities, we examined two model variants. The first variant (also used to generate Figure 6) eliminates the relationship, ${k}_{comb}\left(c\right)=\sqrt{{k}_{vis}^{2}\left(c\right)+{k}_{vest}^{2}}$ (Equation 2), between the sensitivity parameters in the combined and singlecue conditions. Instead, this variant allows independent sensitivity parameters for the combined condition at each coherence, thus introducing one additional parameter per coherence. Since this variant is strictly more general than the optimal model, it must fit the data at least as well. However, if the subjects' behavior is near optimal, the additional degrees of freedom in this variant should not improve the fit enough to justify the addition of these parameters. This is indeed what we found by Bayesian model comparison (Figure 7B, ‘separate k's’), which shows the optimal model to be ∼10^{70} times more likely than the variant with independent values of k_{comb}(c). This is well above the threshold value that is considered to provide ‘decisive’ evidence in favor of the optimal model (we use Fisher's definition of decisive [Jeffreys, 1998] according to which a model is said to be decisively better if it is >100 times more likely to have generated the data). The second model variant had the same number of parameters as the optimal model, but assumed that the cues are always weighted equally. Evidence in the combined condition was given by the simple average, ${\dot{X}}_{comb}=\frac{1}{2}\left({\dot{X}}_{vis}+{\dot{X}}_{vest}\right)$, ignoring cue sensitivities. The resulting fits (Figure 7B, ‘no cue weighting’) are also decisively worse than those of the optimal model. Together, these model variants strongly support the hypothesis that subjects weight cues according to their relative sensitivities, as given by Equation 2. These effects were largely consistent across individual subjects (Figure 7—figure supplement 1A).
To test the other key assumption of our model—that subjects temporally weight incoming evidence according to the profile of stimulus information—we tested three model variants that modified how temporal weighting was performed without changing the number of parameters in the model. If we assumed that the temporal weighting of both modalities followed the acceleration profile of the stimulus while leaving the model otherwise unchanged, the model fit worsened decisively (Figure 7B, ‘weighting by acceleration’). Assuming that the weighting of both modalities followed the velocity profile of the stimulus also decisively reduced fit quality (Figure 7B, ‘weighting by velocity’), although this effect was not consistent across subjects (Figure 7—figure supplement 1A). If we completely removed temporal weighting of cues from the model, fits were dramatically worse than the optimal model (Figure 7B, ‘no temporal weighting’). Finally, for completeness, we also tested a model variant that neither performs temporal weighting of cues nor considers the relative sensitivity to the cues. Again, this model variant fit the data decisively worse than the optimal model (Figure 7B, ‘no cue/temporal weighting’). Thus, subjects seem to be able to take into account their sensitivity to the evidence across time as well as across cues. All of these model comparisons received further support from a more conservative randomeffects Bayesian model comparison, shown in Figure 7—figure supplement 1B,C.
Finally, we also considered if a parallel race model could account for our data. The parallel race model (Raab, 1962; Miller, 1982; Townsend and Wenger, 2004; Otto and Mamassian, 2012) postulates that the decision in the combined condition emerges from the faster of two independent races toward a bound, one for each sensory modality. Because it does not combine information across modalities, the parallel race model predicts that decisions in the combined condition are caused by the faster modality. Consequently, choices in the combined condition are unlikely to be more correct (on average) than those of the faster unimodal condition. For all but one subject, the vestibular modality is substantially faster, even when compared to the visual modality at high coherence and controlling for the effect of heading direction (2way ANOVA, p<0.0001 for all subjects except C). Critically, all of these subjects feature significantly lower psychophysical thresholds in the combined condition than in the vestibular condition (p<0.039 for all subjects except subject C, p=0.210, Supplementary file 2A). Furthermore, we performed standard tests (Miller's bound and Grice's bound) that compare the observed distribution of reaction times with that predicted by the parallel race model (Miller, 1982; Grice et al., 1984). These tests revealed that all but two subjects made significantly slower decisions than predicted by the parallel race model for most coherence/heading combinations (p<0.05 for all subjects except subjects F and B2; Supplementary file 2B), and no subject was faster than predicted (p>0.05, all subjects; Supplementary file 2B). Based on these observations, we can reject the parallel race model as a viable hypothesis to explain the observed behavior.
Discussion
We have shown that, when subjects are allowed to choose how long to accumulate evidence in a cue integration task, their behavior no longer follows the standard predictions of optimal cue integration theory that normally apply when stimulus presentation time is controlled by the experimenter. Particularly, they feature worse discrimination performance (higher psychophysical thresholds) in the combined condition than would be predicted from the unimodal conditions—in some cases even worse than the better of the two unimodal conditions. This occurs because subjects tend to decide more quickly in the combined condition than in the more sensitive unimodal condition and thus have less time to accumulate evidence. This indicates that a more general definition of optimal cue integration must incorporate reaction times. Indeed, subjects' behavior could be reproduced by an extended diffusion model that takes into account both speed and accuracy, thus suggesting that subjects accumulate evidence across both time and cues in a statistically nearoptimal manner (i.e., with minimal information loss) despite their reduced discrimination performance in the combined condition.
Previous work on optimal cue integration (e.g., Ernst and Banks, 2002; Battaglia et al., 2003; Knill and Saunders, 2003; Fetsch et al., 2009) was based on experiments that employed fixedduration stimuli and was thus able to ignore how subjects accumulate evidence over time. Moreover, previous work relied on the implicit assumption that subjects make use of all evidence throughout the duration of the stimulus. However, this assumption need not be true and has been shown to be violated even for short presentation durations (Mazurek et al., 2003; Kiani et al., 2008). Therefore, apparent suboptimality in some previous studies of cue integration or in some individual subjects (Battaglia et al., 2003; Fetsch et al., 2009) might be attributable to either truly suboptimal cue combination, to subjects halting evidence accumulation before the end of the stimulus presentation period, or to the difficulty in estimating stimulus processing time (Stanford et al., 2010). Unfortunately, these potential causes cannot be distinguished using a fixedduration task. Allowing subjects to register their decisions at any time during the trial alleviates this potential confound.
We model subjects' decision times by assuming an accumulationtobound process. In the multisensory context, this raises the question of whether evidence accumulation is bounded for each modality separately, as assumed by the parallel race model, or whether evidence is combined across modalities before being accumulated toward a single bound, as in coactivation models and our modified diffusion model. Based on our behavioral data, we can rule out parallel race models, as they cannot explain lower psychophysical thresholds (better sensitivity) in the combined condition relative to the faster vestibular condition. Further evidence against such models is provided by neurophysiological studies which demonstrate that visual and vestibular cues to heading converge in various cortical areas, including areas MSTd (Gu et al., 2006), VIP (Schlack et al., 2005; Chen et al., 2011b), and VPS (Chen et al., 2011a). Activity in area MSTd can account for sensitivitybased cue weighting in a fixedduration task (Fetsch et al., 2011), and MSTd activity is causally related to multimodal heading judgments (Britten and van Wezel, 1998, 2002; Gu et al., 2012). These physiological studies strongly suggest that visual and vestibular signals are integrated in sensory representations prior to decisionmaking, inconsistent with parallel race models.
Our model makes the assumption that sensory signals are integrated prior to decisionmaking and is in this sense similar to coactivation models that have been used previously to model reaction times in multimodal settings (Miller, 1982; Corneil et al., 2002; Townsend and Wenger, 2004). However, it differs from these models in important aspects. First, coactivation models have been introduced to explain reaction times that are faster than those predicted by parallel race models (Raab, 1962; Miller, 1982). Our subjects, in contrast, feature reaction times that are slower than those of parallel race models in almost all conditions (Supplementary file 2B). We capture this effect by an elevated effective bound in the combined condition as compared to the faster vestibular condition, such that cue combination remains optimal despite longer reaction times. Second, coactivation models usually combine inputs from the different modalities by a simple sum (e.g., Townsend and Wenger, 2004). This entails adding noise to the combined signal if the sensitivity to one of the modalities is low, which is detrimental to discrimination performance. In contrast, we show that different cues need to be weighted according to their sensitivities to achieve statistically optimally integration of multisensory evidence at each moment in time (Equation 2).
Another alternative to coactivation models are serial race models, which posit that the race corresponding to one cue needs to be completed before the other one starts (e.g., Townsend and Wenger, 2004). These models can be ruled out by observing that they predict reaction times in the combined condition to be longer than those in the slower of the two unimodal conditions. This is clearly violated by the subjects' behavior.
Optimal accumulation of evidence over time requires the momentary evidence to be weighted according to its associated sensitivity. For the vestibular modality, we assume that the temporal profile of sensitivity to the evidence follows acceleration. This may appear to conflict with data from multimodal areas MSTd, VIP, and VPS, where neural activity in response to selfmotion reflects a mixture of velocity and acceleration components (Fetsch et al., 2010; Chen et al., 2011a). Note, however, that the vestibular stimulus is initially encoded by otolith afferents in terms of acceleration (Fernandez and Goldberg, 1976). Thus, any neural representation of vestibular stimuli in terms of velocity requires a temporal integration of the acceleration signal, and this integration introduces temporal correlations into the signal. As a consequence, a neural response that is maximal at the time of peak stimulus velocity does not imply a simultaneous peak in the information coded about heading direction. Rather, information still follows the time course of its original encoding, which is in terms of acceleration. In contrast, the time course of the sensitivity to the visual stimulus is less clear. For our model we have intuitively assumed it to follow the velocity profile of the stimulus, as information per unit time about heading certainly increases with the velocity of the optic flow field, even when there is no acceleration. This assumption is supported by a decisively worse model fit if we set the weighting of the visual momentary evidence to follow the acceleration profile (Figure 7B, ‘weighted by acceleration’). Nonetheless, we cannot completely exclude any contribution of acceleration components to visual information (Lisberger and Movshon, 1999; Price et al., 2005). In any case, our model fits make clear that temporal weighting of vestibular and visual inputs is necessary to predict behavior when stimuli are timevarying.
The extended DM model described here makes the strong assumption that cue sensitivities are known before combining information from the two modalities, as these sensitivities need to be known in order to weight the cues appropriately. As only the sensitivity to the visual stimulus changes across trials in our experiment, it is possible that subjects can estimate their sensitivity (as influenced by coherence) during the initial lowvelocity stimulus period (Figure 1C) in which heading information is minimal but motion coherence is salient. Thus, for our task, it is reasonable to assume that subjects can estimate their sensitivity to cues. We have recently begun to consider how sensitivity estimation and cue integration can be implemented neurally. The neural model (Onken et al., 2012. Near optimal multisensory integration with nonlinear probabilistic population codes using divisive normalization. The Society for Neuroscience annual meeting 2012) estimates the sensitivity to the visual input from motion sensitive neurons and uses this estimate to perform nearoptimal multisensory integration with generalized probabilistic population codes (Ma et al., 2006; Beck et al., 2008) using divisive normalization. We intend to extend this model to the integration of evidence over time to predict neural responses (e.g., in area LIP) that should roughly track the temporal evolution of the decision variable (x_{comb}(t), ‘Materials and methods’) in the DM model. This will make predictions for activity in decisionmaking areas that can be tested in future experiments.
In closing, our findings establish that conventional definitions of optimality do not apply to cue integration tasks in which subjects’ decision times are unconstrained. We establish how sensory evidence should be weighted across modalities and time to achieve optimal performance in reactiontime tasks, and we show that human behavior is broadly consistent with these predictions but not with alternative models. These findings, and the extended diffusion model that we have developed, provide the foundation for building a general understanding of perceptual decisionmaking under more natural conditions in which multiple cues vary dynamically over time and subjects make rapid decisions when they have acquired sufficient evidence.
Materials and methods
Subjects and apparatus
Request a detailed protocolSeven subjects (3 males) aged 23–38 years with normal or correctedtonormal vision and no history of vestibular deficits participated in the experiments. All subjects but one were informed of the purposes of the study. Informed consent was obtained from all participants and all procedures were reviewed and approved by the Washington University Office of Human Research Protections (OHRP), Institutional Review Board (IRB; IRB ID# 201109183). Consent to publish was not obtained in writing, as it was not required by the IRB, but all subjects were recruited for this purpose and approved verbally. Of these subjects, three (subjects B, D, F; 1 male) participated in a followup experiment roughly 2 years after the initial data collection, with six coherence levels instead of the original three. The sixcoherence version of their data is referred to as B2, D2, and F2. Procedures for the followup experiment were approved by the Institutional Review Board for Human Subject Research for Baylor College of Medicine and Affiliated Hospitals (BCM IRB, ID# H29411) and informed consent and consent to publish was given again by all three subjects.
The apparatus, stimuli, and task design have been described in detail previously (Fetsch et al., 2009; Gu et al., 2010), and are briefly summarized here. Subjects were seated comfortably in a padded racing seat that was firmly attached to a 6degreeoffreedom motion platform (MOOG, Inc). A 3chip DLP projector (Galaxy 6; Barco, Kortrijk, Belgium) was mounted on the motion platform behind the subject and frontprojected images onto a large (149 × 127 cm) projection screen via a mirror mounted above the subject’s head. The viewing distance to the projection screen was ∼70 cm, thus allowing for a field of view of ∼94° × 84°. Subjects were secured to the seat using a 5point racing harness, and a customfitted plastic mask immobilized the head against a cushioned head mount. Seated subjects were enclosed in a black aluminum superstructure, such that only the display screen was visible in the darkened room. To render stimuli stereoscopically, subjects wore active stereo shutter glasses (CrystalEyes 3; RealD, Beverly Hills, CA) which restricted the field of view to ∼90° × 70°. Subjects were instructed to look at a centrallylocated, headfixed target throughout each trial. Sounds from the motion platform were masked by playing white noise through headphones. Behavioral task sequences and data acquisition were controlled by Matlab and responses were collected using a button box.
Visual stimuli were generated by an OpenGL accelerator board (nVidia Quadro FX1400), and were plotted with subpixel accuracy using hardware antialiasing. In the visual and combined conditions, visual stimuli depicted selftranslation through a 3D cloud of stars distributed uniformly within a virtual space 130 cm wide, 150 cm tall, and 75 cm deep. Star density was 0.01/cm^{3}, with each star being a 0.5 cm × 0.5 cm triangle. Motion coherence was manipulated by randomizing the threedimensional location of a percentage of stars on each display update while the remaining stars moved according to the specified heading. The probability of a single star following the trajectory associated with a particular heading for N video updates is therefore (c/100)^{N}, where c denotes motion coherence (ranging from 0–100%). At the largest coherence used here (70%), there is only a 3% probability that a particular star would follow the same trajectory for 10 display updates (0.17 s). Thus, it was practically not possible for subjects to track the trajectories of individual stars. This manipulation degraded optic flow as a heading cue and was used to manipulate visual cue reliability in the visual and combined conditions. ‘Zero’ coherence stimuli had c set to 0.1, which was practically indistinguishable from c = 0, but allowed us to maintain a precise definition of the correctness of the subject's choice.
Behavioral task
Request a detailed protocolIn all stimulus conditions, the task was a singleinterval, twoalternative forced choice (2AFC) heading discrimination task. In each trial, human subjects were presented with a translational motion stimulus in the horizontal plane (Gaussian velocity profile; peak velocity, 0.403 m/s; peak acceleration, 0.822 m/s^{2}; total displacement, 0.3 m; maximum duration, 2 s). Heading was varied in small steps around straight ahead (±0.686°, ±1.96°, ±5.6°, ±16°) and subjects were instructed to report (by a button press) their perceived heading (leftward or rightward relative to an internal standard of straight ahead) as quickly and accurately as possible. In the visual and combined conditions, cue reliability was varied across trials by randomly choosing the motion coherence of the visual stimulus from among either a group of three values (25%, 37%, and 70%, subjects A–G) or a group of six values (0%, 12%, 25%, 37%, 51%, and 70%, subjects B2, D2, F2). A coherence of 25% means that 25% of the dots move in a direction consistent with the subject's heading, whereas the remaining 75% of the dots are relocated randomly within the dot cloud. In the combined condition, visual and vestibular stimuli always specified the same heading (there was no cue conflict).
During the main phase of data collection, subjects were not informed about the correctness of their choices (no feedback). In the vestibular and combined conditions, platform motion was halted smoothly but rapidly immediately following registration of the decision, and the platform then returned to its original starting point. In the visual condition, the optic flow stimulus disappeared from the screen when a decision was made. In all conditions, 2.5 s after the decision, a sound informed the subjects that they could initiate the next trial by pushing a third button. Once a trial was initiated, the stimulus onset occurred following a randomized delay period (truncated exponential; mean, 987 ms). Prior to data collection, subjects were introduced to the task for 1–2 week ‘training’ sessions, in which they were informed about the correctness of their choices by either a lowfrequency (incorrect) or a highfrequency (correct) sound. The training period was terminated once their behavior stabilized across consecutive training sessions. During training, subjects were able to adjust their speedaccuracy tradeoff based on feedback. During subsequent data collection, we did not observe any clear changes in the speedaccuracy tradeoff exhibited by subjects.
Data analysis
Request a detailed protocolAnalyses and statistical tests were performed using MATLAB R2013a (The Mathworks, MA, USA).
For each subject, discrimination thresholds were determined separately for each combination of stimulus modality (visualonly, vestibularonly, combined) and coherence (25%, 37%, and 70% for subjects A–G; 0%, 12%, 25%, 37%, 51%, and 70% for subjects B2, D2, F2) by plotting the proportion of rightward choices as a function of heading direction (Figure 2A). The psychophysical discrimination threshold was taken as the standard deviation of a cumulative Gaussian function, fitted by maximum likelihood methods. We assumed a common lapse rate (proportion of random choices) across all stimulus conditions, but allowed for a separate bias parameter (horizontal shift of the psychometric function) for each modality/coherence. Confidence intervals for threshold estimates were obtained by taking 5000 parametric bootstrap samples (Wichmann and Hill, 2001). These samples also form the basis for statistical comparisons of discrimination thresholds: two thresholds were compared by computing the difference between their associated samples, leading to 5000 threshold difference samples. Subsequently, we determined the fraction of differences that were below or above zero, depending on the directionality of interest. This fraction determined the raw significance level for accepting the null hypothesis (no difference). The reported significance levels are Bonferronicorrected for multiple comparisons. All comparisons were onetailed. Following traditional cue combination analyses (Clark and Yuille, 1990), the optimal threshold σ_{pred,c} in the combined condition for coherence c was predicted from the visual threshold σ_{vis,c} and the vestibular threshold σ_{vest} by ${\sigma}_{pred,c}^{2}={\sigma}_{vis,c}^{2}{\sigma}_{vest}^{2}/({\sigma}_{vis,c}^{2}+{\sigma}_{vest}^{2})$. Confidence intervals and statistical tests were again based on applying this formula to individual bootstrap samples of the unimodal threshold estimates. Supplementary file 2A reports the pvalues for all subjects and all comparisons.
For each dataset, we evaluated the absolute goodnessoffit of the optimal model (Figure 7A) by finding the set of model parameters φ that maximized the likelihood of the observed choices and reaction times, and then computing the average coefficient of determination, ${R}^{2}\left(\text{D}\phi \right)=\frac{1}{2}\left({R}_{psych}^{2}\left(\phi \right)+{R}_{chron}^{2}\left(\phi \right)\right)$. Here, ${R}_{psych}^{2}\left(\phi \right)$ and ${R}_{chron}^{2}\left(\phi \right)$ denote the adjusted R^{2} values for the psychometric and chronometric functions, respectively, across all modalities/coherences. The value of ${R}_{psych}^{2}$ for the psychometric function was based on the probability of making a correct choice across all heading angles, coherences, and conditions, weighted by the number of observations, and adjusted for the number of model parameters. The same procedure, based on the mean reaction times, was used to find ${R}_{chron}^{2}$, but we additionally distinguished between mean reaction times for correct and incorrect choices, and fitted both weighted by their corresponding number of observations (see SI for expressions for ${R}_{psych}^{2}\left(\phi \right)$ and ${R}_{chron}^{2}\left(\phi \right)$).
We compared different variants of the full model (Figure 7B) by Bayesian model comparison based on Bayes factors, which were computed as follows. First, we found for each model $\mathcal{M}$ and subject s the set of parameters φ that maximized the likelihood, ${\phi}_{s,\mathcal{M}}^{*}=\mathrm{arg}{\mathrm{max}}_{\phi}p\left(\text{data\hspace{0.17em}of\hspace{0.17em}subj\hspace{0.17em}}s\phi ,\mathcal{M}\right)$. Second, we approximated the Bayesian model evidence, measuring the model posterior probability while marginalizing over the parameters, up to a constant by the Bayesian information criterion, $\mathrm{ln}p\left(\mathcal{M}s\right)\approx \frac{1}{2}\text{BIC}\left(s,\mathcal{M}\right)$ with $\text{BIC}\left(s,M\right)=2\mathrm{ln}p\left(s{\phi}_{s,\mathcal{M}}^{*},\mathcal{M}\right)+{k}_{\mathcal{M}}\mathrm{ln}{N}_{s}$. Here, ${k}_{\mathcal{M}}$ is the number of parameters of model $\mathcal{M}$, and N_{s} is the number of trials for dataset s, respectively. Based on this, we computed the Bayes factor of model $\mathcal{M}$ vs the optimal model $\mathcal{M}$_{opt} by pooling the model evidence over datasets, resulting in ${\sum}_{s}\left(\mathrm{ln}p\left(\mathcal{M}s\right)\mathrm{ln}p\left({\mathcal{M}}_{opt}s\right)\right)}.$ These values, converted to a base10 logarithm, are shown in Figure 7B. In this case, a negative log_{10}difference of 2 implies that the optimal model is 100 times more likely given the data than the alternative model, a difference that is considered decisive in favor of the optimal model (Jeffreys, 1998).
To determine the faster stimulus modality for each subject, we compared reaction times for the vestibular condition with those for the visual condition at 70% coherence. We tested the difference in the logarithm of these reaction times by a 2way ANOVA with stimulus modality and heading direction as the two factors, and we report the main effect of stimulus modality on reaction times. Although we performed a logtransform of the reaction times to ensure their normality, a Jarque–Bera test revealed that normality did not hold for some heading directions. Thus, we additionally performed a Friedman test on subsampled data (to have the same number of trials per modality/heading) which supported the ANOVA result at the same significance level. In the main text, we only report the main effect of stimulus modality on reaction time from the 2way ANOVA. Detailed results of the 2way ANOVA, the Jarque–Bera test, and the Friedman test are reported for each subject in Supplementary file 2C.
The extended diffusion model
Request a detailed protocolHere we outline the critical extensions to the diffusion model. Detailed derivations and properties of the model are described in the Supplementary file 1.
Discretizing time into small steps of size Δ allows us to describe the particle trajectory x(t) in a DM by a random walk, $x\left(t\right)={\displaystyle {\sum}_{n\in 1:t}\delta {x}_{n}}$, where each of the steps δx_{n} ∼ (ksin(h)Δ, Δ), called the momentary evidence, are normally distributed with mean ksin(h)Δ and variance Δ (1:t denotes the set of all steps up to time t). This representation is exact in the sense that it recovers the diffusion model, $\dot{x}=k\mathrm{sin}\left(h\right)+\eta \left(t\right)$, in the limit of Δ→0.
For the standard diffusion model, the posterior probability of sin(h) after observing the stimulus for t seconds, and under the assumption of a uniform prior, is given by Bayes rule
where δx_{1:t} is the momentary evidence up to time t. From this we can derive the belief that heading is rightward, resulting in
where $\Phi (\xb7)$ denotes the standard cumulative Gaussian function. This shows that both the posterior of the actual heading angle, as well as the belief about ‘rightward’ being the correct choice, only depend on x(t) rather than the whole trajectory δx_{1:t}.
The above formulation assumes that evidence is constant over time, which is not the case for our stimuli. Considering the visual cue and assuming that its associated sensitivity varies with velocity v(t), the momentary evidence $\delta {x}_{vis,n}\sim N\left({v}_{n}{k}_{vis}\left(c\right)\mathrm{sin}\left(h\right)\Delta ,\Delta \right)$ is Gaussian with mean v_{n}k_{vis}(c)sin(h)Δ, where v_{n} is the velocity at time step n, and variance Δ. Using Bayes rule again to find the posterior of sin(h), it is easy to shown that x_{vis}(t) is no longer sufficient to determine the posterior distribution. Rather, we need to perform a velocityweighted accumulation, ${X}_{vis}\left(t\right)={\displaystyle {\sum}_{n\in 1:t}{v}_{n}\delta {x}_{vis,n}}$ to replace x_{vis}(t), and replace time t with $V\left(t\right)={\displaystyle {\sum}_{n\in 1:t}{v}_{n}^{2}\Delta}$, resulting in the following expression for the posterior
Consequently, the belief about ‘rightward’ being correct can also be fully expressed by X_{vis}(t) and V(t). This shows that optimal accumulation of evidence with a singleparticle diffusion model with timevarying evidence sensitivity requires the momentary evidence to be weighted by its momentary sensitivity. A similar formulation holds for the posterior over heading based on the vestibular cue, however the vestibular cue is assumed to be weighted by the temporal profile of stimulus acceleration, instead of velocity.
When combining multiple cues into a single DM, ${\dot{X}}_{comb}=d\left(t\right)\left(d\left(t\right){k}_{comb}\mathrm{sin}\left(h\right)+{\eta}_{comb}\left(t\right)\right)$, we aim to find expressions for k_{comb} and d(t) that keep the posterior over sin(h) unchanged, that is
δx_{comb,1:t} is the sequence of momentary evidence in the combined condition, following $\delta {x}_{comb,n}\sim N\left({d}_{n}{k}_{comb}\left(c\right)\mathrm{sin}\left(h\right)\Delta ,\Delta \right)$. Expanding the probabilities reveals the equality to hold if the combined sensitivity is given by ${k}_{comb}^{2}\left(c\right)={k}_{vis}^{2}\left(c\right)+{{k}_{vest}}^{2}$, and d(t) is expressed by Equation 3, leading to Equation 1 for optimally combining the momentary evidence (see Supplementary file 1 for derivation).
Model fitting
Request a detailed protocolThe model used to fit the behavioral data is described in the main text. We never averaged data across subjects as they feature qualitatively different behavior, due to different speedaccuracy tradeoffs. Furthermore, for subjects performing both the threecoherence and the sixcoherence version of the experiment, we treated either version as a separate data set. For each modality/coherence combination (7 combinations for 3 coherences, 13 combinations for 6 coherences) we fitted one bias parameter that prevents behavioral biases from influencing model fits. The fact that performance of subjects often fails to reach 100% correct even for the highest coherences and largest heading angles was modeled by a lapse rate, which describes the frequency with which the subject makes a random choice rather than one based on accumulated evidence. This lapse rate was assumed to be independent of stimulus modality or coherence, and so a single lapse rate parameter is shared among all modality/coherence combinations.
All model fits sought to find the model parameters φ that maximize the likelihood of the observed choices and reaction times for each dataset. As in Palmer et al. (2005), we have assumed the likelihood of the choices to follow a binomial distribution, and the reaction times of correct and incorrect choices to follow different Gaussian distributions centered on the empirical means and spread according to the standard error. Model predictions for choice fractions and reaction times for correct and incorrect choices were computed from the solution to integral equations describing firstpassage times of bounded diffusion processes (Smith, 2000). See Supplementary file 1 for the exact form of the likelihood function that was used.
To avoid getting trapped in local maxima of this likelihood, we utilized a threestep maximization procedure. First, we found a (possibly local) maximum by pseudogradient ascent on the likelihood function. Starting from this maximum, we used a Markov Chain Monte Carlo procedure to draw 44,000 samples from the parameter posterior under the assumption of a uniform, bounded prior. After this, we used the highestlikelihood sample, which is expected to be close to the mode of this posterior, as a starting point to find the posterior mode by pseudogradient ascent. The resulting parameter vector is taken as the maximumlikelihood estimate. All pseudogradient ascent maximizations were performed with the Optimization Toolbox of Matlab R2013a (Mathworks), using stringent stopping criteria (TolFun = TolX = 10^{−20}) to prevent premature convergence.
References

Bayesian integration of visual and auditory signals for spatial localizationJournal of the Optical Society of America A, Optics, Image Science, and Vision 20:1391–1397.https://doi.org/10.1364/JOSAA.20.001391

Area MST and heading perception in macaque monkeysCerebral Cortex 12:692–701.https://doi.org/10.1093/cercor/12.7.692

Representation of vestibular and visual cues to selfmotion in ventral intraparietal cortexThe Journal of Neuroscience 31:12036–12052.https://doi.org/10.1523/JNEUROSCI.039511.2011

A twostage model for visualauditory interaction in saccadic latenciesPerception & Psychophysics 63:126–147.https://doi.org/10.3758/BF03200508

Auditoryvisual interactions subserving goaldirected saccades in a complex sceneJournal of Neurophysiology 88:438–454.

Physiology of peripheral neurons innervating otolith organs of the squirrel monkey. III. Response dynamicsJournal of Neurophysiology 39:996–1008.

Neural correlates of reliabilitybased cue weighting during multisensory integrationNature Neuroscience 15:146–154.https://doi.org/10.1038/nn.2983

Spatiotemporal properties of vestibular responses in area MSTdJournal of Neurophysiology 104:1506–1522.https://doi.org/10.1152/jn.91247.2008

Dynamic reweighting of visual and vestibular cues during selfmotion perceptionThe Journal of Neuroscience 29:15601–15612.https://doi.org/10.1523/JNEUROSCI.257409.2009

Toward evidencebased medical statistics. 2: the Bayes factorAnnals of Internal Medicine 130:1005–1013.https://doi.org/10.7326/000348191301219990615000019

Decoding the activity of neuronal populations in macaque primary visual cortexNature Neuroscience 14:239–245.https://doi.org/10.1038/nn.2733

Combination rule for redundant information in reaction time tasks with divided attentionPerception & Psychophysics 35:451–463.https://doi.org/10.3758/BF03203922

A functional link between area MSTd and heading perception based on vestibular signalsNature Neuroscience 10:1038–1047.https://doi.org/10.1038/nn1935

Neural correlates of multisensory cue integration in macaque MSTdNature Neuroscience 11:1201–1210.https://doi.org/10.1038/nn.2191

Causal links between dorsal medial superior temporal area neurons and multisensory heading perceptionThe Journal of Neuroscience 32:2299–2313.https://doi.org/10.1523/JNEUROSCI.515411.2012

Linear responses to stochastic motion signals in area MSTJournal of Neurophysiology 98:1115–1124.https://doi.org/10.1152/jn.00083.2007

Bounded integration in parietal cortex underlies decisions even when viewing duration is dictated by the environmentThe Journal of Neuroscience 28:3017–3029.https://doi.org/10.1523/JNEUROSCI.476107.2008

Visual motion analysis for pursuit eye movements in area MT of macaque monkeysThe Journal of Neuroscience 19:2224–2246.

Bayesian inference with probabilistic population codesNature Neuroscience 9:1432–1438.https://doi.org/10.1038/nn1790

A role for neural integrators in perceptual decision makingCerebral Cortex 13:1257–1269.https://doi.org/10.1093/cercor/bhg097

Divided attention: evidence for coactivation with redundant signalsCognitive Psychology 14:247–279.https://doi.org/10.1016/00100285(82)90010X

Near optimal multisensory integration with nonlinear probabilistic population codes using divisive normalizationThe Society for Neuroscience annual meeting 2012.

Noise and correlations in parallel perceptual decision makingCurrent Biology 22:1391–1396.https://doi.org/10.1016/j.cub.2012.05.031

BookProbability, random variables, and stochastic processesNew York, London: McGrawHill.

Comparing acceleration and speed tuning in macaque MT: physiology and modelingJournal of Neurophysiology 94:3451–3464.https://doi.org/10.1152/jn.00564.2005

Statistical facilitation of simple reaction timesTransactions of the New York Academy of Sciences 24:574–590.https://doi.org/10.1111/j.21640947.1962.tb01433.x

Theory of memory retrievalPsychological Review 85:59–108.https://doi.org/10.1037/0033295X.85.2.59

A comparison of sequential sampling models for twochoice reaction timePsychological Review 111:333–367.https://doi.org/10.1037/0033295X.111.2.333

Multisensory space representations in the macaque ventral intraparietal areaThe Journal of Neuroscience 25:4616–4625.https://doi.org/10.1523/JNEUROSCI.045505.2005

Stochastic dynamic models of response time and accuracy: a foundational primerJournal of Mathematical Psychology 44:408–463.https://doi.org/10.1006/jmps.1999.1260

Perceptual decision making in less than 30 millisecondsNature Neuroscience 13:379–385.https://doi.org/10.1038/nn.2485

Bayesian model selection for group studiesNeuroImage 46:1004–1017.https://doi.org/10.1016/j.neuroimage.2009.03.025

How humans combine simultaneous proprioceptive and visual position informationExperimental Brain Research 111:253–261.https://doi.org/10.1016/S00796123(08)604136

Combined auditory and visual stimuli facilitate head saccades in the barn owl (Tyto alba)Journal of Neurophysiology 96:730–745.https://doi.org/10.1152/jn.00072.2006

The psychometric function: II. Bootstrapbased confidence intervals and samplingPerception & Psychophysics 63:1314–1329.https://doi.org/10.3758/BF03194545
Decision letter

Eve MarderReviewing Editor; Brandeis University, United States
eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.
Thank you for sending your work entitled “Optimal multisensory decisionmaking in a reactiontime task” for consideration at eLife. Your article has been favorably evaluated by Eve Marder (Senior editor) and 2 reviewers, one of whom, Emilio Salinas, has agreed to reveal his identity.
The Senior editor and the two reviewers discussed their comments before we reached this decision, and the Senior editor has assembled the following comments to help you prepare a revised submission.
The authors carry out a detailed theoretical analysis of a vestibularvisual cue integration task, in which subjects can make a response at any time after the stimulus comes on. Unlike tasks that have fixed information delivery times, the behavioral thresholds in the combined task, in the present study, are not better than both of the individual thresholds. The reason for this is that subjects terminate evidence accumulation more quickly in the combined condition. The authors develop a model which incorporates timevarying evidence across both cues up to the reaction time (minus baseline stimulusresponse processing) and they show that this model accurately characterizes reaction times and accuracy. They also show that the subjects approximately optimally integrate evidence.
This study represents both theoretical and empirical advances. The behavior has been carefully carried out, the data analysis is detailed and thorough, and the modeling provides and important insight into the behavioral process. I think both the experimental data and the modeling insights are quite compelling and novel. On one hand, multisensory experiments have become quite popular, and performance improvements have been amply documented both in terms of reaction times and of response accuracy. But in retrospect it seems rather surprising that multisensory enhancement has not been studied for the more natural, simultaneous condition in which both time and accuracy are in play. This work not only fills in this gap, but also presents results that may seem quite paradoxical when RT and accuracy are analyzed separately from each other. Surprisingly, the combined condition does not produce better (i.e., more accurate) performance, as one may have thought based on previous results, but mostly faster performance.
The study also presents fits of the experimental data to a generalized version of the diffusion model that works with two independent streams of sensory evidence. The model may not be the ultimate one – it is rather abstract and provides little mechanistic intuition about the underlying neuronal coding schemes and circuit interactions – but it does serve its purpose at this point, which is to prove a quantitative statistical benchmark for measuring the effectiveness of those underlying neural interactions, as well as a framework for testing and generating hypotheses. The generalization to two evidence streams, rather than one, and to a timedependent reliability of the sensory evidence is a clever and useful theoretical advance, and it describes the data quite well.
Minor comments:
1) How were the degrees of freedom calculated for the BIC? Was the model probability calculated by computing a BIC for each subject, and then summing these across subjects? Another approach that is used in functional imaging is to compute exceedance probabilities. Model evidence across multiple subjects inflates degrees of freedom, and exceedance probabilities have been developed to deal with that problem. This is similar to fixed effects vs. mixed effects (or hierarchical) models for analyzing behavioral data across multiple subjects.
2) Within the Results section there is a paragraph about how to set the noise terms in the model, but the reader finds that out several lines ahead. This would be easier to follow if an introductory sentence were added along the lines of 'The noise terms eta_vis and eta_vest play crucial roles in the model, as they relate to the reliability of the momentary sensory evidence. To specify the manner in which such noise may depend on motion coherence, we relied on fundamental assumptions about how optic flow stimuli are represented by the brain...”
https://doi.org/10.7554/eLife.03005.017Author response
1) How were the degrees of freedom calculated for the BIC? Was the model probability calculated by computing a BIC for each subject, and then summing these across subjects? Another approach that is used in functional imaging is to compute exceedance probabilities. Model evidence across multiple subjects inflates degrees of freedom, and exceedance probabilities have been developed to deal with that problem. This is similar to fixed effects vs. mixed effects (or hierarchical) models for analyzing behavioral data across multiple subjects.
As we fitted the model parameters for each subject separately, the BIC was computed for each subject separately and then summed. The details of this procedure are described in the Data Analysis subsection in Methods. All but one model discussed in the main text have the same number of parameters, such that other approaches to taking the number of model parameters into account would have led to the same result.
As suggested by the reviewers, we have additionally added a randomeffects Bayesian model comparison, which was added to Figure 7–figure supplement 1 (panels b and c). The results of this analysis are consistent with our previous BIC analysis, adding strength to the conclusions. We thank the reviewers for this good suggestion.
2) Within the Results section there is a paragraph about how to set the noise terms in the model, but the reader finds that out several lines ahead. This would be easier to follow if an introductory sentence were added along the lines of 'The noise terms eta_vis and eta_vest play crucial roles in the model, as they relate to the reliability of the momentary sensory evidence. To specify the manner in which such noise may depend on motion coherence, we relied on fundamental assumptions about how optic flow stimuli are represented by the brain...”
Thank you for this suggestion. We have modified the beginning of this paragraph as suggested.
https://doi.org/10.7554/eLife.03005.018Article and author information
Author details
Funding
National Institutes of Health (R01 DC007620)
 Dora E Angelaki
National Institutes of Health (R01 EY016178)
 Gregory C DeAngelis
National Science Foundation (BCS0446730)
 Alexandre Pouget
U.S. Army Research Laboratory (Multidisciplinary University Research Initiative, N000140710937)
 Alexandre Pouget
Air Force Office of Scientific Research (FA95501010336)
 Alexandre Pouget
James S. McDonnell Foundation
 Alexandre Pouget
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Ethics
Human subjects: Informed consent was obtained from all participants and all procedures were reviewed and approved by the Washington University Office of Human Research Protections (OHRP), Institutional Review Board (IRB; IRB ID# 201109183). Consent to publish was not obtained in writing, as it was not required by the IRB, but all subjects were recruited for this purpose and approved verbally. Of the initial seven subjects, three participated in a followup experiment roughly 2 years after the initial data collection. Procedures for the followup experiment were approved by the Institutional Review Board for Human Subject Research for Baylor College of Medicine and Affiliated Hospitals (BCM IRB, ID# H29411) and informed consent and consent to publish was given again by all three subjects.
Reviewing Editor
 Eve Marder, Brandeis University, United States
Publication history
 Received: April 4, 2014
 Accepted: June 12, 2014
 Accepted Manuscript published: June 14, 2014 (version 1)
 Version of Record published: July 22, 2014 (version 2)
Copyright
© 2014, Drugowitsch et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 5,749
 Page views

 675
 Downloads

 46
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.