1. Computational and Systems Biology
  2. Neuroscience
Download icon

Time preferences are reliable across time-horizons and verbal versus experiential tasks

  1. Evgeniya Lukinova
  2. Yuyue Wang
  3. Steven F Lehrer
  4. Jeffrey C Erlich  Is a corresponding author
  1. NYU-ECNU Institute of Brain and Cognitive Science at NYU Shanghai, China
  2. NYU Shanghai, China
  3. Queen’s University, Canada
  4. The National Bureau of Economic Research, United States
  5. East China Normal University, China
  • Cited 1
  • Views 664
  • Annotations
Cite this article as: eLife 2019;8:e39656 doi: 10.7554/eLife.39656

Abstract

Individual differences in delay-discounting correlate with important real world outcomes, for example education, income, drug use, and criminality. As such, delay-discounting has been extensively studied by economists, psychologists and neuroscientists to reveal its behavioral and biological mechanisms in both human and non-human animal models. However, two major methodological differences hinder comparing results across species. Human studies present long time-horizon options verbally, whereas animal studies employ experiential cues and short delays. To bridge these divides, we developed a novel language-free experiential task inspired by animal decision-making studies. We found that the ranks of subjects’ time-preferences were reliable across both verbal/experiential and second/day differences. Yet, discount factors scaled dramatically across the tasks, indicating a strong effect of temporal context. Taken together, this indicates that individuals have a stable, but context-dependent, time-preference that can be reliably assessed using different methods, providing a foundation to bridge studies of time-preferences across species.

Editorial note: This article has been through an editorial process in which the authors decide how to respond to the issues raised during peer review. The Reviewing Editor's assessment is that all the issues have been addressed (see decision letter).

https://doi.org/10.7554/eLife.39656.001

Introduction

Intertemporal choices involve a trade-off between a larger outcome received later and a smaller outcome received sooner. Many individual decisions have this temporal structure, such as whether to purchase a cheaper refrigerator, but forgo the ongoing energy savings. Since research has found that intertemporal preferences are predictive of a wide variety of important life outcomes, ranging from SAT scores, graduating from college, and income to anti-social behaviors, for example gambling or drug abuse (Frederick et al., 2002; Madden and Bickel, 2010; Casey et al., 2011; Golsteyn et al., 2014; Åkerlund et al., 2016), they are frequently studied in both humans and animals across multiple disciplines, including marketing, economics, psychology, psychiatry, and neuroscience.

A potential obstacle to understanding the biological basis of intertemporal decision-making is that human studies differ from non-human animal studies in two important ways: long versus short time-horizons and choices that are made based on verbal versus non-verbal (i.e. ‘experiential’) stimuli. In animal studies, subjects experience the delay between their choice and the reward (sometimes cued with a ramping sound or a diminishing visual stimulus) before they can proceed to the next trial (Cai et al., 2011; Blanchard et al., 2013; Tedford et al., 2015). Generally, there is nothing for the subject to do during this waiting period. In human studies, subjects usually make a series of choices (either via computer or a survey, often hypothetical) between smaller sooner and larger offers delayed by months or years (McClure et al., 2004; Andersen et al., 2014). (We are aware of only a handful of studies that have used delays of minutes (McClure et al., 2007) or seconds (Lane et al., 2003; Gregorios-Pippas et al., 2009; Prevost et al., 2010; Tanaka et al., 2014; Fung et al., 2017)). During the delay (e.g. if the payout is in 6 months) the human subjects go about their lives, likely forgetting about the delayed payment, just as individuals do not actively think about their retirement savings account each moment until their retirement.

Animal studies of delay-discounting take several forms (Dalley et al., 2011; Redish et al., 2008; Cai et al., 2011; Wikenheiser et al., 2013), but all require experiential learning that some non-verbal cue is associated with waiting. Subjects experience the cues, delays and rewards, and slowly build an internal map from the cues to the delays and magnitudes. Subjects may only have implicit knowledge of the map, which likely engage distinct neural substrates to the explicit processes engaged by humans when considering a verbal offer (Reber et al., 2003; Poldrack et al., 2001).

Whether animal studies can inform human studies depends on answers to the following questions. Do decisions that involve actively waiting for seconds invoke the same cognitive and neural processes as decisions requiring passively waiting for months? Do decisions made based on experience and perceptual decisions invoke the same cognitive and neural processes as decisions that are made based on explicitly written information?

The animal neuroscience literature on delay-discounting mostly accepts as a given that the behavior of animals will give insight into the biological basis for human impulsivity (Fineberg et al., 2010; Huang et al., 2015; Schoenbaum et al., 2009; Robison and Nestler, 2011) and rarely (Blanchard et al., 2013; Rosati et al., 2007; Vanderveldt et al., 2016) addresses the methodological gaps considered here. This view is not unfounded. Neural recordings from animals (Cai et al., 2011) and brain imaging studies in humans (McClure et al., 2004; Kable and Glimcher, 2007) both find that the prefrontal cortex and basal ganglia are involved in delay-discounting decisions, suggesting common neural mechanisms. Animal models of attention-deficit hyperactive disorder (ADHD) have reasonable construct validity: drugs that shift animal behavior in delay-discounting tasks can also improve the symptoms of ADHD in humans (Paterson et al., 2012; Fineberg et al., 2010). Thus, most neuroscientists would likely predict that our experiments would find high within-subject reliability across both time-horizons and verbal/experiential dimensions.

Reading the literature from economics, a different picture emerges. Traditional economic models (e.g. Samuelson, 1947) posit that agents make consistent intertemporal decisions, thereby implying a constant discount rate regardless of context. In contrast, growing evidence from behavioral economics provides support for the view that discounting over a given time delay changes with the time-horizon (Berns et al., 2007; Andreoni et al., 2015). Among human studies comparing short and long time-horizons only a few are within subject and incentivized, leaving this matter unresolved (Paglieri, 2013; Johnson et al., 2015; Vanderveldt et al., 2016; Horan et al., 2017). Yet, there remains debate in the empirical economics literature about how well discounting measures elicited in human studies truly reflect the rates of time-preference used in real-world decisions since measured discount rates have been found to vary by the type of task (hypothetical, potentially real, and real), stakes being compared, age of participants and across different domains (Chapman and Elstein, 1995). Thus, most economists surveying the empirical evidence would be surprised if a design that varied both type of tasks and horizons would generate results with high within-subject reliability.

Here, we have addressed these questions by measuring the discount factors of human subjects in three ways. First, we used a novel language-free task involving experiential learning with short delays. To our knowledge, this is the first time the time-preferences of human subjects have been measured in this way (Vanderveldt et al., 2016). Then, we measured discount factors more traditionally, with verbal offers over both short and long delays. This design allowed us to test whether, for each subject, a single process is used for intertemporal choice regardless of time-horizon or verbal vs. experiential stimuli, or whether the choices in different tasks could be better explained by distinct underlying mechanisms.

Results

In our main experiment, 63 undergraduate students from NYU Shanghai participated in five experimental sessions. In each session, subjects completed a series of intertemporal choices. Across sessions, at least 160 trials in each task were conducted after learning (Materials and methods, Figure 1—figure supplement 1). In each trial, irrespective of the task, subjects made a decision between the sooner (blue circle) and the later (yellow circle) options. In the non-verbal task (Figure 1A), the parameters of the later option were mapped to an amplitude modulated pure tone. The reward magnitude was mapped to frequency of the tone (larger reward higher frequency). The delay was mapped to amplitude modulation rate (longer delay slower modulation). Across trials, the delay and the magnitude of the sooner option were fixed (4 coins, immediately), later options were drawn from all possible pairs of 5 magnitudes and delays (25 different offers, Materials and methods). For the short delay tasks, when subjects chose the later option, a clock appeared on the screen, and only when the clock image disappeared, could they collect their reward by clicking in the reward port. After clicking the reward port, the chosen number of coins appeared at the reward port and then a ‘dropping coins’ sound was played as the coins were added to a stack of coins on the right side of the screen that accumulated over the session. This stack gave subjects a visual indication of the total amount of rewards they had earned in the session. At the end of the session, the coins were converted to RMB as payment to the subject.

Figure 1 with 1 supplement see all
Behavioral Tasks.

(A) A novel language-free intertemporal choice task. This is an example sequence of screens that subjects viewed in one trial of the non-verbal task. First, the subject initiates the trial by pressing on the white-bordered circle. During fixation, the subject must keep the cursor inside the white circle. The subject hears an amplitude modulated pure tone (the tone frequency is mapped to reward magnitude and the modulation rate is mapped to the delay of the later option). The subject next makes a decision between the sooner (blue circle) and later (yellow circle) options. If the later option is chosen, the subject waits until the delay time finishes, which is indicated by the colored portion of the clock image. Finally, the subject clicks in the middle bottom circle (‘reward port’) to retrieve their reward. The reward is presented as a stack of coins of a specific size and a coin drop sound accompanies the presentation. (B) Stimuli examples in the verbal experiment during decision stage (the bottom row of circles is cropped). (C) Timeline of experimental sessions. Note: The order of short and long delay verbal tasks for sessions 4 and 5 was counter-balanced across subjects.

https://doi.org/10.7554/eLife.39656.002

In the verbal tasks, the verbal description of the offers appeared within the blue and yellow circles in place of the amplitude modulated sound (Figure 1B). In the verbal long delay task, after each choice, subjects were given feedback confirming their choice (e.g. "Your choice: 8 coins in 30 days") and then proceeded to the next trial. Unlike the short tasks, there was no sound of dropping coins nor visual display of coins. At the end of the session, a single long-verbal trial was selected randomly to determine the payment (e.g. a subject was notified that "Trial 10 from session one was randomly chosen to pay you. Your choice in that trial was 8 coins in 30 days"). If the selected trial corresponded to a subject having chosen the later option, she received her reward via an electronic transfer after the delay (e.g. in 30 days).

Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences

Subjects’ impulsivity was estimated by fitting their choices with a Bayesian hierarchical model (BHM) of hyperbolic discounting with decision noise. The model had six population level parameters (log discount factor, log(k), and decision noise, τ, for each of the three tasks, also known as fixed effects) and four parameters per subject: log(kNV)log(kSV)log(kLV) and τ. We used this model to fit 32,707 choices across 63 subjects in the three tasks. We use the natural log of k, log(k), and not k as a model parameter because we found that k is approximately log-normally distributed over our subjects (as in Sanchez-Roige et al., 2018). The subject level effects are drawn from a normal distribution with mean zero. In other words, the subject level effects reflect the difference of each subject relative to the mean across subjects. As such, the actual discount factor for the nth subject in the SV task, kn,SV=elog(k^SV)+log(k˙n,SV)=k^SVk˙n,SV, where log(k^SV) represents the population level log discount factor for SV and log(k˙n,SV) represents the subjects level effect for subject n in SV. For the sake of brevity, we refer to ‘log discount factor’ as ‘discount factor’ throughout the text.

The population level parameters reflect the mean over all subjects. For example, if the mean discount factor across subjects was equal in all tasks, then the population level discount factor parameters would also be equal. If all subjects were exactly twice as impulsive in short vs. non-verbal tasks, then that change would be reflected in the population level discount factor (kSV=2kNVlog(kSV)=log(kNV)+log(2)), and the subject level parameters would be the same across tasks. If, on the other hand, impulsive subjects (relative to the mean) became more impulsive, and patient subjects became more patient, that would result in clear changes to subject level parameters, with relatively little change in the population level parameters (assuming the same scaling factor for impulsive and patient subjects).

Subjects’ choices were well-fit by the model (Figure 2, Figure 2—figure supplement 1, Supplementary file 1). Since we did not ex ante have a strong hypothesis about how the subjects’ impulsivity measures in one task would translate across tasks, we fit subjects’ choices in the units of the task (i.e. seconds or days), examined ranks of impulsivity at first and found significant correlations across experimental tasks (Table 1). In other words, the most impulsive subject in one task was likely to be the most impulsive subject in another task. This result is robust to different functional forms of discounting (e.g. hyperbolic vs. exponential) and estimation (e.g. Bayesian hierarchical models vs. fitting subjects individually using maximum likelihood estimation vs. model-free) methods (Figure 2—figure supplement 1, Figure 2—figure supplement 2, Figure 2—figure supplement 3). For example, if we ranked the subjects by the fraction of trials in which they chose the later option in each task, we obtained a similar result (Spearman r: SV vs. NV r=0.71; SV vs. LV r=0.49; NV vs. LV r=0.30, all p<0.05). The correlations of discount factors across tasks extended to Pearson correlation of log(k) (Figure 3Table 1). This indicates that subjects’ preferences are reliable across the verbal/experiential gap and time-horizons.

Figure 2 with 4 supplements see all
A 50% median split (±1 standard deviation) of the softmax-hyperbolic fits.

(A–C) more patient and (D–F) less patient subjects. The values of k and τ are the means within each group. Average psychometric curves obtained from the model fits (lines) versus actual data (circles with error bars) for NV, SV and LV tasks for each delay value, where the x-axis is the reward magnitude and the y-axis is the probability (or proportion for actual choices) of later choice. Error bars are binomial 95% confidence intervals. We excluded the error in the model for visualization. Note: The lines here are not a model fit to aggregate data, but rather reflect the mean model parameters for each group. As such, discrepancies between the model and data here are not diagnostic. See individual subject plots (Supplementary file 1) to visualize the quality of the model fits.

https://doi.org/10.7554/eLife.39656.004
Figure 3 with 4 supplements see all
Comparison of discount factors across three tasks in the main experiment.

(A, B) Each circle is one subject (N = 63). The logs of discount factors in SV task (x-axis) plotted against the logs of discount factors in NV (A) and LV (B) tasks (y-axis). The color of the circles and the colorbar identify the ranksdiscount factors in NV task. Pearson’s r is reported on the figure (p<0.01 - ’**’). The error bars are the SD of the estimated coefficients (posterior means). Three lines (Huang et al., 2013) represent the vertical y(x), horizontal x(y) and perpendicular (or total) least squares (TLS) regression lines. (C) Distribution of posterior parameter estimates of log(k) and decision noise τ from the model fit for the three tasks in the main experiment (kNV1/sec, kSV1/sec, kLV1/day). The light blue shaded area marks the 80% interval of the posterior estimate. The outline of the distribution extends to the 99.99% interval. Thin grey lines are drawn through the mean of each distribution to ease comparison across tasks. Comparisons between tasks are reported in Table 3. Note, the units for kSV & kNV (1/sec) would need to be scaled by 86400secs/daylog(86400)=11.37 to be directly compared to kLV.

https://doi.org/10.7554/eLife.39656.009
Table 1
Correlations of subjects’ discount factors [95% CI].

Corrected rank correlations of subjects’ discount factors were normalized using simulations to estimate the expected maximum correlation we could observe (Figure 3—figure supplement 4). The correlations between each task were significantly different from each other at p<0.05 using various methods as in the R package ‘cocor’ (Diedenhofen and Musch, 2015).

https://doi.org/10.7554/eLife.39656.014
Spearman
Rank Correlation
Corrected
Rank Correlation
Pearson
Correlation
SV vs. NV0.76 [0.61, 0.85]0.77 [0.62, 0.87]0.79 [0.65, 0.88]
SV vs. LV0.54 [0.30, 0.73]0.57 [0.31, 0.77]0.61 [0.41, 0.76]
NV vs. LV0.36 [0.11, 0.57]0.39 [0.12, 0.62]0.40 [0.18, 0.60]
  1. all p<0.01

Having addressed our initially planned analysis, we continued with analyses to further understand the subjects’ choices within and across the tasks. Consistent with existing research, we found that time-preferences were stable in the same task within subjects between the first half of each reward block and the second half of the block within sessions (time-preferences are measured as % of yellow choices, Wilcoxon signed-rank test, p=0.35; Pearson r=0.81, p<109) and also across experimental sessions that take place every two weeks: % of yellow choices between NV sessions (Wilcoxon signed-rank test, p=0.47; Pearson r=0.7, p<109), between SV sessions (Wilcoxon signed-rank test, p=0.66; Pearson r=0.82, p<109) and a slight difference between LV sessions (Wilcoxon signed-rank test, p<0.1; Pearson r=0.66, p<109) (Meier and Sprenger, 2015; Augenblick et al., 2015). In our verbal experimental sessions, the short and long tasks were alternated and the order was counter-balanced across subjects. We did not find any order effects in either main (bootstrapped mean test, SV-LV-SV-LV vs. LV-SV-LV-SV for SV and LV log(k), all p>0.4) or control experiments (NC, bootstrapped mean test, SV-LV-SV-LV vs. LV-SV-LV-SV for SV and LV log(k), all p>0.6; DW, bootstrapped mean test, DV-WV-DV-WV vs. WV-DV-WV-DV for DV and WV log(k), all p>0.2).

In addition to the reliability of subjects’ choices, other aspects of their behavior were also consistent. We examined the total time it took subjects to finish each session. This time includes waiting time (i.e. the chosen delays in the short task) and also non-waiting time (i.e. intertrial intervals and subject reaction times). The total time taken did not change significantly across sessions (bootstrapped mean tests: between NV session 2 and 3, p=0.55; between verbal sessions 1 and 2, p=0.08). By definition, the waiting time is correlated with log(k). But we also found that for the short sessions non-waiting time (and total-time) were correlated with log(k) and also the fraction of total reward earned (relative to a subject that always picked the larger offer regardless of time; Figure 3—figure supplement 1). This suggests that impulsive subjects not only express their impatience in their choices of a sooner option, but also make their choices faster.

In our experimental design, the SV task has shared features with both the NV and LV task. First, the SV shares time-horizon with the NV task. Second, the SV and LV are both verbal and were undertaken at the same time, always following NV task. The NV and LV tasks differ in both time-horizon and verbal/non-verbal. The central feature that is shared between all tasks is delay-discounting. To test whether the correlation between NV and LV might be accounted for by their shared correlation with the SV task, we performed linear regressions of the discount factors in each task as a function of the other tasks (e.g. log(kNV)=βSVlog(kSV)+βLVlog(kLV)+β0+ϵ ). For NV the two predictors explained 63% of the variance (F(60,2)=50.63, p<109). It was found that log(kSV) significantly predicted log(kNV) (βSV=1.28±0.15, p<109) but log(kLV) did not (βLV=-0.12±0.09, p=0.18). For LV we were able to predict 40% of the variance (F(60,2)=19.64, p<106) and found that log(kSV) significantly predicted log(kLV) (βSV=1.26±0.26, p<105) but log(kNV) did not (βNV=-0.24±0.18, p=0.18). For SV the two predictors explained 72% of the variance (F(60,2)=78.93, p<109). Coefficients for both predictors were significant (βNV=0.44±0.05, p<109βLV=0.22±0.05p<105); where β=mean±std.error.

We further checked whether the correlations between discount factors in the three tasks may have arisen due to some undesirable features of our task design. For example, different subjects experienced the offers in different orders. Anchoring effects (Tversky and Kahneman, 1974; Wilson et al., 1996; Furnham and Boo, 2011) may have set a reference point in the early part of the experiment that guided choices throughout the rest. As such, we repeated the analyses described in the previous paragraph, but we added six additional factors: the mean rewards and delays presented in the first block of the 2nd and 3rd non-verbal session and also the % of yellow choices made in those blocks. We reasoned that if anchoring effects were playing a role then subjects that were presented with longer delays, or smaller rewards early in the experiment should have correlations between these factors and log(kSV) or log(kLV). Likewise, if subjects were simply trying to be consistent with their early choices, then the ‘% yellow’ in the early reward blocks would have an important influence. We tested the contribution of each factor by dropping it from the model to create a reduced nested model and using a likelihood ratio test against the full model (Figure 3—figure supplement 2). We found no evidence for anchoring effects or that subjects were simply trying to be consistent with their early choices.

In order to test whether the verbal/non-verbal gap or the time-horizons gap accounted for more variation in discounting, we used a linear mixed-effects model where we estimated log(k) as a function of the two gaps (as fixed effects) with subject as a random effect, using the ‘lme4’ R package (Bates et al., 2014). We created two predictors: days was false in NV and SV tasks for offers in seconds and was true in the LV task for offers in days; verbal was true for the SV and LV tasks and false for the NV task. We found that time-horizon (βdays=-0.52±0.24, p=0.03) but not verbal/non-verbal (βverbal=-0.32±0.24, p=0.18) contributed significantly to the variance in log(k). This result was further supported by comparing the two-factor model with reduced one-factor models (i.e. that only contained either time or verbal fixed effects). Dropping the days factor significantly decreased the likelihood, but dropping the verbal factor did not (Table 2).

Table 2
Relative contributions of two gaps to variance in log(k) (two-factor model comparison with two reduced one-factor models).
https://doi.org/10.7554/eLife.39656.015
Dropped factorΔdfAICLR test p
none743.06
verbal1742.880.18
days1745.990.03

We described above that subject’s time-preferences were highly correlated across tasks. However, correlation is invariant to shifts or scales across tasks. Our hierarchical model allows us to directly estimate the posterior distributions of log(k) and τ (Figure 3C) and report posterior means and 95% credible intervals (log(k) NV = −3.2 [-3.77,–2.64], SV = −3.49 [−3.86, −3.11], LV = −3.95 [−4.55, −3.34]). Note, that kNV and kSV have units of Hz (1/s), but kLV has units of 1/day. Thus, while the 95% credible intervals of the means of log(k) are overlapping for the three tasks when expressed in the units of each task, the mean log(kLV) is in fact shifted to −14.86 when kLV is expressed in units of 1/s. We further analyze and discuss this scaling subsequently, but first we compare log(k) in the units of each task, in consideration of subjects potentially ignoring the time units in their choices (Furlong and Opfer, 2009; Cox and Kable, 2014). We find that, on average, subjects were most patient in LV, then SV then NV Table 3). Note, that a shift of 1 log-unit is substantial. For example, a subject with log(kSV)-3 would value 10 coins at half its value in just 20 s. But for log(kSV)-4 the coins would lose half their value in 55 s (Figure 3—figure supplement 3).

Table 3
Shift and scale of log(k) between tasks.

kSV,kNV(1/s). kLV(1/day). The evidence ratio (Ev. Ratio) is the Bayes factor of a hypothesis vs. its alternative, for example P(a>b)/P(a<b). '*’denotes p<0.01, one-sided test. Expressing log(kLV) in units of 1/s (for direct comparison with the other tasks) results in a negative shift in log(kLV) and even larger differences in means without changing the difference between standard deviations.

https://doi.org/10.7554/eLife.39656.016
Comparisonlog2(Ev.Ratio)
between means
μlog(kSV)>μlog(kLV)6.79 *
μlog(kNV)>μlog(kLV)7.92 *
μlog(kNV)>μlog(kSV)4.16
between standard deviations
σlog(kLV)>σlog(kSV)8.43 *
σlog(kNV)>σlog(kLV)0.92
σlog(kNV)>σlog(kSV)11.48 *

In addition to the shift, we observed significant scaling of log(k) between SV and the other two tasks (Table 3, note: scaling is insensitive to the units of k, since log(Ck)=log(C)+log(k)). This is likely driven by subgroups that were exceptionally patient in the LV task (Figure 3B) or impulsive in the NV task (Figure 3A). We also observed a clear increase in the decision noise in the NV task, τNV, compared to the other two tasks (Figure 3C), which is unsurprising given that in NV subjects have to make a perceptual decision (mapping the sound features to delay and magnitude) in addition to an economic decision. However, even in the verbal tasks subjects show stochasticity in choice. This is clearly evident for the longer delays (Supplementary file 1).

Controlling for visuo-motor confounds

In the main experiment, we held the following features constant across three tasks: the visual display and the use of a mouse to perform the task. However, after observing the strong correlations between the tasks (Figure 3) we were concerned that the effects could have been driven by the superficial (i.e. visuo-motor) aspects of the tasks. In other words, the visual and response features of the SV and LV tasks may have reminded subjects of the NV task context and nudged them to use a similar strategy across tasks. While this may be interesting in its own right, it would limit the generality of our results. To address this, we ran a control experiment (n = 25 subjects) where the NV task was identical to the original NV task, but the SV and LV tasks were run in a more traditional way, with a text display and keypress response (control experiment 1, Materials and methods, Figure 4—figure supplement 1). We replicated the main findings of our original experiment for ranks of log(k) (Figure 4) and correlation between log(k) in SV and LV tasks (Figure 4B). To determine whether the correlations observed were within the range expected by chance (given the difference in sample size), we repeatedly (10,000 times) randomly sampled 25 of the original 63 subjects (from Figure 3) and computed the correlations between tasks. Pearson’s r=.42 is lower than we would expect for NC (the 95% CI of the correlation between SV and NV in the main experiment assuming 25 subjects is [0.50 0.92]). This suggests that some of the correlation between SV and NV tasks in the main experiment may be driven by visuo-motor similarity in experimental designs. We did not find shifts or scaling between the posterior distributions of log(k) across tasks in this control experiment (Figure 4C, mean [95% CI] NV = −3.98 [-5.44,–2.67], SV = −3.8 [−4.94, −2.75], LV = −3.76 [−4.79, −2.76]), but we found again that noise in NV was higher than in the other tasks.

Figure 4 with 1 supplement see all
Comparison of discount factors across three tasks in control experiment 1.

(A,B) Control experiment 1 (n = 25). The logs of discount factors in SV task (x-axis) plotted against the logs of discount factors in NV (A) and LV (B) tasks (y-axis). The color of the circles and the colorbar identify the discount factors in NV task. Each circle is one subject. Pearson’s r is reported on the figure (p<0.01 - ‘**’, p<0.05 - ‘*’). Spearman r: SV vs. NV r=0.52; SV vs. LV r=0.52 (all p<0.01). The error bars are the SD of the estimated coefficients. Three lines represent the vertical y(x), horizontal x(y) and total least squares (TLS) regression lines. See individual subject plots (Supplementary file 2) to visualize the quality of the model fits. (C) Distribution of posterior parameter estimates of log(k) and decision noise τ from the model fit for the three tasks in control experiment 1 (kNV1/s, kSV1/s, kLV1/day). The light blue shaded area marks the 80% interval of the posterior estimate. The outline of the distribution extends to the 99.99% interval. Thin grey lines are drawn through the mean of each distribution to ease comparison across tasks.

https://doi.org/10.7554/eLife.39656.017

Controlling for differences in reward experience

We designed our non-verbal task so that with minimal changes we could use it in animals: rats and mice in particular. In rodent decision-making primary rewards are typically used (e.g. (Carter and Redish, 2016; Wikenheiser et al., 2013; Erlich et al., 2015)). In order to make the reward in the short tasks more like a primary reinforcer, we included visual and auditory cues at the time of the reward. This introduces a potential confound to one of our findings: that the correlation between the two short tasks is higher than the correlation between long and short tasks. It could be that inter-subject variability in the experience of the audio-visual cues could lower the correlation between the short and long tasks, but since it is shared between the two short tasks, those correlations would be artificially inflated. In order to address this, we refit our model with the following changes: we added a reward scaling parameter that multiplies with reward magnitude on each trial. This parameter has two levels (short/long) which can vary for each subject. This adds two population level parameters and 63×2=126 subject level parameters to the model. We compared the original and expanded model using 10-fold cross-validation (‘kfold’ function in the ‘brms’ R package). This process fits model parameters using 90% of the data and then produces a posterior predictive density for the left out 10% and repeats this 10 times (for each left out 10%). This procedure results in an expected log posterior density for the model (Vehtari et al., 2017), which is then multiplied by −2 to produce a K-fold Information Criteria (KfoldIC), as in other metrics like Akaike, Bayesian or deviance information criteria. The expanded model was substantially better than the original model (ΔKfoldIC±SE=2207.40±81.06, roriginal2=0.595±0.002,rexpanded2=0.640±0.002). This is strong evidence that an important component of the intersubject variability in our task comes from differences in experience of the reward.

Having justified the additional parameters, we re-examined the correlations between log(k) in the three tasks in the expanded model. We found that the between task correlations were slightly larger but highly overlapping with the correlations in the original model (Table 4), thus supporting our findings about the relative reliability between tasks. The population log(k) and decision noise estimates also followed the same pattern as in the original model (compare Figure 3C with Figure 5). The log(k) estimates were shifted slightly higher (estimating subjects as more impulsive) with a corresponding increase in the experience of the rewards. That is, in both long and short tasks, the reward scaling was greater than one. Note, however, that reward scaling for long vs short tasks are not 5 orders of magnitude apart, so this cannot account for the massive scaling of discount factors between the long and short tasks.

Distribution of population level posterior parameter estimates from the expanded model fit with reward scaling for the three tasks in the main experiment.

The light blue shaded area marks the 80% interval of the posterior estimate. The outline of the distribution extends to the 99.99% interval. Thin grey lines are drawn through the mean of each distribution to ease comparison across tasks. Note, the units for kSV & kNV (1/s) would need to be scaled by 86400secs/daylog(86400)=11.37 to be directly compared to kLV(1/day).

https://doi.org/10.7554/eLife.39656.019
Table 4
Correlations of subjects’ log discount factors [95% CI] in the original model (taken from Table 1) and the expanded model which included differential reward scaling between the short and long tasks.

The correlations between each task were significantly different from each other at p<0.05 for both the original and expanded models using various methods as in the R package ‘cocor’ (Diedenhofen and Musch, 2015).

https://doi.org/10.7554/eLife.39656.020
Original model
Pearson Correlation
Expanded model
Pearson Correlation
SV vs. NV0.79 [0.65, 0.88]0.84 [0.75, 0.90]
SV vs. LV0.61 [0.41, 0.76]0.65 [0.48, 0.77]
NV vs. LV0.40 [0.18, 0.60]0.50 [0.29, 0.66]
  1. all correlations are significantly different from 0, p<0.01

Strong effect of temporal context

As described above, we fit the discount factors for each task in the units of that task: kSV and kNV in units of seconds and kLV in units of days. Since there are 86,400 s in a day, classic economic theory would posit that we would find Δlog(k)=11.37 between the long and short tasks to account for the difference in units. But, we found that the discount factors in the LV task, kLV, were close to those in the other tasks (within 1 log-unit) (Figure 3C). This finding implies that for a specific reward value, if a subject would decrease their subjective utility of that reward by 50% for a 10 s delay in the SV task, they would also decrease their subjective utility of that reward by 50% for a 10-day delay in the LV task. This seems incredible, particularly from a neoclassical economics perspective, but has been previously reported (Navarick, 2004; Lane et al., 2003). What could explain this scaling effect? In addition to the change in time units, reward units also changed between the short and long tasks. In our sessions, the exchange rates in NV and SV were 0.1 and 0.05 CNY per coin, respectively (since all coins are accumulated and subjects are paid the total profit), whereas in LV, subjects were paid on the basis of a single trial chosen at random using an exchange rate of 4 CNY for each coin. These exchange rates were set to, on average, equalize the possible total profit between short and long delays tasks. However, even accounting for both the magnitude effect (Green et al., 1999; Green et al., 2004) and unit conversion (calculations presented in Materials and methods) the discount rates are still scaled by 4 orders of magnitude from the short to the long time-horizon tasks (Navarick, 2004).

One possible explanation for this scaling is that subjects are simply ignoring the units and only focusing on the number. This would be consistent with an emerging body of evidence that numerical value, rather than conversion rate or units matter to human subjects (Furlong and Opfer, 2009; Cox and Kable, 2014). A second possible explanation is that subjects normalize the subjective delay of the offers based on context, just as they normalize subjective value based on current context and recent history (Lau and Glimcher, 2005; Tymula and Glimcher, 2016; Louie et al., 2015; Khaw et al., 2017). A third possibility is that in the short delay tasks (NV and SV), subjects experience the wait for the reward on each trial as quite costly, in comparison to the postponement of reward in the LV task. This ‘cost of waiting’ may share some intersubject variability with delay-discounting but may effectively scale the discount factor in tasks with this feature (Paglieri, 2013).

To test the first hypothesis, that subjects ignore units of time, we ran a control experiment (n = 16 subjects) using two verbal discounting tasks (control experiment 2, Materials and methods). In one task, the offers were in days (DV). In the other, the offers were in weeks (WV). This way, we could directly test whether subjects would discount the same for 1 day as 1 week (i.e. ignore units) or 7 days as 1 week (i.e. convert units). For this experiment, we converted the delays from the weeks task into days (i.e. delay in days = 7× delay in weeks) before fitting the BHM. Subjects’ discount factors were highly correlated across the two tasks (Pearson r=0.92; Spearman r=0.92, all p<0.01). Moreover, there is a high degree of overlap in the population estimates of log(k) for the two tasks (Figure 6B). If subjects had ignored units then we would expect that log(kW)=log(kD)+log(7)=log(kD)+1.95. Comparing the posteriors with that predicted shift, we can say that the shift is highly unlikely (p<0.0001). Nonetheless, the discount factors in the two tasks were not equal. We observed a kind of amplification of preferences: the impulsive subjects were more impulsive in days than weeks and the patient subjects were more patient in days than weeks (Figure 6—figure supplement 1). We do not have an explanation for this effect, but overall this control experiment is consistent with and extends our main results: subjects’ time-preferences are reliable but context-dependent and the context dependence cannot be explained by subjects ignoring the units of time.

Figure 6 with 1 supplement see all
Control experiment 2.

(A) The discount factors in WV task plotted against the discount factors in DV task, (n = 14, two out of 16 subjects who always chose the later option were excluded from the model). The color of the circles identifies the order of task appearance. Each circle is one subject. Pearson’s r is reported on the figure (p<0.01 - ‘**’). The error bars are the SD of the estimated coefficients. Three lines represent the vertical y(x), horizontal x(y) and total least squares (TLS) regression lines. See individual subject plots (Supplementary file 3) to visualize the quality of the model fits. (B) Distribution of posterior parameter estimates of log(k) and decision noise τ from the model fit for the two tasks in control experiment 2 (kDV1/day, kWV1/day). The light blue shaded area marks the 80% interval of the posterior estimate. The outline of the distribution extends to the 99.99% interval. Thin grey lines are drawn through the mean of each distribution to ease comparison across tasks.

https://doi.org/10.7554/eLife.39656.021

Having ruled out the possibility that subjects ignore units of time, we test our second potential explanation: that subjects make decisions based on a subjective delay that is context dependent. We reasoned that if choices are context dependent then it may take some number of trials in each task before the context is set. Consistent with this reasoning, we found a small but significant adaptation effect in early trials in our main experiment: subjects are more likely to choose the later option in the first trials of SV task (Figure 7A,B). It seems that, at first, seconds in the current task are interpreted as being smaller than days in the preceding task, but within several trials days are forgotten and time preferences adapt to a new time-horizon of seconds.

Evidence for context dependent temporal processing.

(A,B) Main experiment early trials adaptation effect. The offers for each subject were converted into a subjective utility, U, based on the subjects’ discount factors in each task. This allowed us to combine data across subjects to plot psychometric curves of the probability of choosing the later option, P(later), for SV and LV averaged across all subjects comparing late trials (Trial in task > 5) (A) to the first four trials (B). Using a generalized linear mixed effects model, we found a significant interaction between early/late and SV/LV (βSVLV:early=0.86±0.17,p<106, nsubjects=63,ntrials=20387).

https://doi.org/10.7554/eLife.39656.023

Discussion

We set out to test whether the same delay-discounting process is employed regardless of the verbal/non-verbal nature of the task and the time-horizon. We found significant correlations between subjects’ discount factors across the three tasks, providing evidence that there are common cognitive (and presumably basal neural) mechanisms underlying the decisions made in the three tasks. In particular, the strong correlation between the short time-horizon non-verbal and verbal tasks (r=0.79, Figure 3A) provides the first evidence for generalizability of the non-verbal task; suggesting that this task can be applied to both human and animal research for direct comparison of cognitive and neural mechanisms underlying delay-discounting. However, the correlation between the short-delay/non-verbal task and the long-delay/verbal task, while significant, is weaker (r=0.40). Taken together, our results suggest animal models of delay-discounting may have more in common with short time-scale consumer behavior such as impulse purchases and ‘paying-not-to-wait’ in mobile gaming (Evans, 2016) and some caution is warranted when reaching conclusions from the broader applicability of these models to long-time horizon real-world decisions, such as buying insurance or saving for retirement.

Reliability of preferences

The question of reliability is of central importance to applying in-lab studies to real-world behavior. There are several concepts of reliability that our study addresses. First, is test/re-test reliability; second, reliability across the verbal/non-verbal gap; third, reliability across the second/day gap. Consistent with previous studies (Lane et al., 2003; Meier and Sprenger, 2015; Augenblick et al., 2015), we found high test/re-test reliability. Choices in the same task did not differ when made at the beginning or the end of the session nor when they were made in sessions held on different days even 2 weeks apart.

We found a high degree of reliability in time-preferences across the verbal/non-verbal gap (r=0.79, Figure 3A,Table 1, Table 2). This reliability has not been, to the best of our knowledge, previously measured and is of similar strength to the reported test-retest reliability of personality traits (Viswesvaran and Ones, 2000; Berns et al., 2007). The closest literature that we are aware finds that value encoding (the convexity of the utility function) but not probability weighting is similar across the verbal/non-verbal gap in sessions that compare responses to a classic verbal risky economic choice task with an equivalent task in the motor domain (Wu et al., 2009). It may be that unlike time or value, probability is processed differently in verbal vs. non-verbal settings (Hertwig and Erev, 2009). The main difference between choices in the NV and SV tasks was the increase in noise in NV. A worthwhile future direction is to disentangle the neural substrates of perceptual noise vs. decision noise in a non-verbal task of economic preferences (Hanks et al., 2015; Constantinople et al., 2018).

We found a moderate degree of reliability across the second/day gap (r=0.61, Figure 3B, Table 1, Table 2). There are several aspects to the time-horizon gap that may contribute independently to the lower correlations observed between our short and long tasks (compared to the two short tasks). First, there is the difference in order of magnitudes of the delays. Second, there is a difference in the experience of the delayed rewards, in that subjects must wait, staring at the clock, through all delays in the short tasks, but in the long task, subject wait for a single reward, but can go about their lives while waiting. Paglieri (2013) described these as ‘waiting’ in seconds compared to ‘postponing’ in days. Third, our short tasks had a ‘coin drop’ sound at the time of the reward, which may have acted as a secondary reinforcer and contributed to the discounting of delayed rewards. The absence of this from the long task may have contributed to the decreased reliability between short and long tasks.

Our control study using delays of days vs. weeks compared tasks with different scales but did not differ in the experience of the delayed rewards, as in LV, only (at most) one delayed reward was experienced for both days and weeks tasks. In that experiment, we found extremely high reliability between time-preferences across tasks (Figure 6a). That is, Figure 6 shows that on average, subjects discounted 7 days as frequently as 1 week was discounted in the other task. While, days and weeks are only scaled by seven times and may be easily approximated via preexisting rules of thumb, seconds vs. days are scaled by 86400. Moreover, people have more practice at converting days and weeks than seconds and days. So while the days/weeks experiment provides some evidence that a difference in the magnitude of the delays does not, on its own, affect reliability, it may be that larger or unfamiliar differences (e.g. an experiment comparing hours vs. weeks) may do so. Still, we find the second hypothesis for the lower reliability across time-horizons more compelling: that individual differences in subjective costs of waiting are distinct from (but correlated with) individual differences in costs of postponing (discussed in more detail below).

The evidence from the literature on the issue of reliability across time-horizons is mixed. On the one hand, some have found that measures of discount factors on month-long delays are not predictive of discount factors for year-long horizons (a difference of one order of magnitude) (Thaler, 1981; Loewenstein and Thaler, 1989) but others have found consistent discounting for the same ranges (Johnson and Bickel, 2002). Other studies that compared the population distributions of discount factors for short (up to 28 days) to long (years) delays (2 orders of magnitude) found no differences in subjects’ discount factors (Eckel et al., 2005; Andersen et al., 2014). Some of these discrepancies can be attributed to the framing of choice options: standard larger later vs. smaller sooner compared to negative framework (Loewenstein and Thaler, 1989), where subjects want to be paid more if they have to worry longer about some negative events in the future.

Several previous studies have compared discounting in experienced delay tasks (as in our short tasks) with tasks where delays were hypothetical or just one was experienced (Johnson and Bickel, 2002; Lane et al., 2003; Reynolds and Schiffbauer, 2004; Navarick, 2004; Horan et al., 2017). For example, Lane et al. (2003), also used a within-subject design to examine short vs. long delays (e.g. similar to our short-verbal and long-verbal tasks) and found similar correlations (r0.5±0.1) with a smaller sample size (n = 16). (Interestingly, they also found, but did not discuss, a 5 order of magnitude scaling factor between subjects’ discounting of seconds and days suggesting that this is a general phenomenon.)

Subjective scaling of time

It may seem surprising that human subjects would discount later rewards, that is choosing small immediate rewards, in a task where delays are in seconds. After all, subjects cannot consume earnings immediately. Yet, this result is consistent with earlier work that suggests individuals derive utility from receiving money irrespective of when it is consumed (Reuben et al., 2010; McClure et al., 2004; McClure et al., 2007). In our design, a pleasing (as reported by subjects) ‘slot machine’ sound accompanied the presentation of the coins in the short-delay tasks. This sound may be experienced as an instantaneous secondary reinforcer (Kelleher and Gollub, 1962). Whether or not the secondary reinforcer used in our task is experienced in an analogous way to primary reinforcers used in animal studies may limit the degree of overlap in underlying neural mechanisms. On the other hand, our subjects’ behavior would not be surprising for those who develop (or study) ‘pay-not-to-wait’ video games (Evans, 2016), which exploit player’s impulsivity to acquire virtual goods with no actual economic value.

The range of rates of discounting we observed in the long-verbal task was consistent with that observed in other studies. For example, in a population of more than 23,000 subjects the log of the discount factors ranged from −8.75 to 1.4 (Sanchez-Roige et al., 2018), which is similar to the ranges presented in Figure 3B. This implies that, in our short tasks, subjects are discounting extremely steeply, that is they are discounting the rewards per second at approximately the same amount that they discounted the reward per day. This discrepancy has been previously found (Lane et al., 2003; Navarick, 2004). We consider three (non-mutually exclusive) explanations for this scaling. First, subjects may ignore units. However, by testing overlapping time-horizons of days and weeks we confirmed that subjects can pay attention to units.

Second, it may be that with short delay tasks we are capturing cost of waiting while long delay tasks measure delay-discounting. The costs of waiting could take several forms (Paglieri, 2013). One form is the cost of boredom (Mills and Christoff, 2018); a feeling which animals may also experience (Wemelsfelder, 1984). Subjects could find it painful to sit and wait, staring at the clock on the computer screen, during the delay. Additionally, there could be opportunity costs related to how much subjects value their own time. We found that in the short tasks, subjects with large discount factors also performed the task faster (Figure 3—figure supplement 1). If these subjects value their time more and thus have higher costs of waiting, then given our results Figure 3B there is a surprisingly large correlation between how much subjects value their time (in the short tasks) and how much they discount postponed rewards (in the long task). Regardless of the precise form of the costs of waiting (Chapman, 2001; Paglieri, 2013; Navarick, 2004) in order for these costs to explain the temporal scaling we observed between short and long tasks, relative to the costs of postponing, they would have to be, coincidentally, close in value to the number of seconds in a day.

We feel this coincidence is unlikely, and thus favor the third explanation for the scaling: temporal context. When making decisions about seconds, subjects ‘wait’ for seconds and when making decisions about days subjects ‘postpone reward’ for days (Paglieri, 2013). Although our experiments were not designed to test whether the strong effect of temporal context was due to normalizing, existence of extra costs for waiting in real time, or both, we did find some evidence for the former (Figure 6C). Consistent with this idea, several studies have found that there are both systematic and individual level biases that influence how objective time is mapped to subjective time for both short and long delays (Wittmann and Paulus, 2009; Zauberman et al., 2009). Thus, subjects may both normalize delays to a reference point and introduce a waiting cost at the individual level that will lead short delays to seem as costly as the long ones.

Conclusion

We have shown for the first time that there is a high degree of reliability across verbal and non-verbal delay-discounting tasks. In the analysis of experimental data, we found several interesting phenomena which warrant further examination at both the behavioral and the neural level: the extreme scaling effects from seconds to days; the compression toward the mean of discount factors in weeks vs. days; and the adaptation observed at the beginning of tasks. Nonetheless, these effects were consistent across the subject population: affecting the quantitative estimate of discount factor but not the subjects’ impulsivity relative to the group. Overall, this work provides support for connecting non-verbal animal studies with verbal human studies of delay-discounting.

Materials and methods

Participants

For the main experiment, participants were recruited from the NYU Shanghai undergraduate student population on two occasions leading to a total sample of 67 (45 female, 22 male) NYU Shanghai students. Using posted flyers, we initially recruited 35 students but added 32 more to increase statistical power (the power analysis indicates that for expected correlation r=0.5 and 80% power (the ability of a test to detect an effect, if the effect actually exists; Cohen, 1988; Bonett and Wright, 2000) the required sample size is N = 29, for a medium size correlation of r=0.3 the required sample size is N = 84).

The study was approved by the IRB of NYU Shanghai. The subjects were between 18–23 years old, 34 subjects were Chinese Nationals (out of 67). They received a 30 CNY ($5 USD) per hour participation fee as well as up to an additional 50 CNY ($8 USD) per session based on their individual performance in the task (either in NV task, or total in SV and LV tasks, considering the delay of payment in the LV task). The experiment involved five sessions per subject (three non-verbal sessions followed by two verbal sessions), permitting us to perform within-subject analyses. The sessions were scheduled bi-weekly and took place in the NYU Shanghai Behavioral and Experimental Economics Laboratory. In each session, all decisions involved a choice between a later (delay in seconds and days) option and an immediate (now) option. Three subjects did not pass the learning stages of the NV task. One subject did not participate in all of the sessions. These four subjects were excluded from all analyses.

Experimental design

The experiments were constructed to match the design of tasks used for rodent behavior in Prof. Erlich’s lab. We provided relatively minimal instructions for the subjects other than explaining that coins were worth real money (See subject instructions in Supplementary file 4 and Supplementary file 5). For the temporal discounting task, the value of the later option is mapped to the frequency of pure tone (frequency reward magnitude) and the delay is mapped to the amplitude modulation (modulation period delay). The immediate option was the same on all trials for a session and was unrelated to the sound. There were 25 different ‘later’ options presented in each task: all possible combinations of 5 delays (3, 6.5, 14, 30, 64) and 5 reward magnitudes (1, 2, 5, 8, 10). The immediate option was fixed at 4 coins, so later offers of 1 or 2 were considered ‘smaller-later’ offers that were created to encourage subjects to pay attention to the sound in the non-verbal task, and to make sure subjects were paying attention in the verbal tasks. In the non-verbal task the two ‘smaller-later’ options made up 25% of later options, whereas in the verbal experiment they made up 10% of later options. All ‘larger-later’ offers were equally likely to be presented. Given that the smaller later option is always strictly worse than larger immediate option, if in such a trial ‘smaller-later’ is chosen, economic theory would classify this choice as reflecting a first-order violation. The offers in each task were structured in short blocks. Each block used the same reward magnitude for a ‘later’ option offered at different randomly ordered delays. For each subject the order of reward blocks was chosen randomly. Jittering the number of trials in each reward resulted in 160 trials on average for each subject in each of the two verbal tasks and up to 200 trials in the non-verbal task.

Through experiential learning, subjects learned the map from visual and sound attributes to values and delays. This was accomplished via six learning stages (0, 1, 2, 3, 4, 5; Video 1) that build up to the final non-verbal task (NV) that was used to estimate subjects’ discount-factors. Briefly, the first four stages were designed to (0) learn that a mouse-click in the middle bottom ‘reward-port’ produced coins (that subjects knew would be exchanged for money), (1) learn to initiate a trial by a mouse-click in a highlighted port, (2) learn ‘fixation’: to keep the mouse-cursor in the highlighted port, (3) associate a mouse-click in the blue port with the sooner option (a reward of a fixed 4 coin magnitude that is received instantly) (4) associate varying tone frequencies with varying reward at the yellow port (5) associate varying amplitude modulation frequencies with varying delays at the yellow port. In stage 4, subjects are primed to the sound frequency to learn the variability of reward magnitudes: first, the lower and upper bounds, then, in ascending and descending order and, finally, in random order. In the final stage 5, subjects heard the AM of a sound during fixation that is now mapped to the delay of the later option. The order of the stimuli presented was the same as in the previous stage. On each trial of the stage 3, 4 and 5 there was either a blue port or a yellow port (but not both). The exact values for reward and delay parameters experienced in the learning stages correspond to values that are used throughout the experiment. After selecting the yellow-port (i.e. the delayed option), a countdown clock appeared on the screen and the subject had to wait for the delay which had been indicated by the amplitude modulation of the sound for that trial. Any violation (i.e. a mouse-click in an incorrect port or moving the mouse-cursor during fixation) was indicated by flashing black circles over the entire ‘poke’ wall accompanied by an unpleasant sound (for further demonstration of the experimental time flow, please see the Video 1).

Video 1
Learning.

A video of the learning stages, showing the examples of violations that can be made. The video starts with stage 0 and continues with stage 1 at 00:14, stage 2 at 00:31, stage 3 at 00:55, stage 4 (trimmed) at 01:18 and stage 5 (trimmed) at 01:41.

https://doi.org/10.7554/eLife.39656.024

When a subject passed the learning stages (i.e. four successive trials without a violation in each stage, Figure 1—figure supplement 1), they progressed to the decision stages of the non-verbal task (NV). Progressing from the learning stages, a two-choice decision is present where the subject can choose between an amount now (blue choice) versus a different amount in some number of seconds (yellow choice). During the decision stages the position of blue and yellow circles on the poke wall was randomized between left and right and was always symmetrical (Figure 1, Video 2). Each of the three non-verbal sessions began with learning stages and continued to the decision stages. In the 2nd and the 3rd non-verbal sessions, the learning stages were shorter in duration.

Video 2
NV.

A video of the several consecutive trials of the non-verbal task.

https://doi.org/10.7554/eLife.39656.025

The final two sessions involved verbal stimuli (Video 3, Video 4). During each session, subjects experience an alternating set of tasks: short delay (SV)–long delay (LV)–SV–LV (or LV-SV-LV-SV, counter-balanced per subject). An example of a trial from the short time-horizon task (SV) is shown in the sequence of screens presented in Figure 1. The verbal task in the long time-horizon (LV) includes Initiation, Decision (as in Figure 1) and the screen that confirms the choice. There are two differences in the implementation of these sessions relative to the non-verbal sessions. First, the actual reward magnitude and delay are written within the yellow and blue circles presented on the screen, in place of using sounds. Second, in the non-verbal and verbal short delay sessions, subjects continued to accumulate coins (following experiential learning stages) and the total earned was paid via electronic payment at the end of each experimental session. In the long-verbal sessions, a single trial was selected at random at the conclusion of the session for payment (method of payment commonly used in human studies with long delays, (Cox and Kable, 2014)). The associated payment is made now or later depending on the subject’s choice in the selected trial.

Video 3
SV.

A video of the several consecutive trials of the short delay task.

https://doi.org/10.7554/eLife.39656.026
Video 4
LV.

A video of the several consecutive trials of the long delay task.

https://doi.org/10.7554/eLife.39656.027

Control experiment 1: No Circles (NC)

Request a detailed protocol

In total, 25 (29 started, 4 withdrew) undergraduate students from NYU Shanghai participated in five experimental sessions (three non-verbal and two verbal sessions, in this sequence, that were scheduled bi-weekly). The study requirements in order to meet the IRB protocol conditions remained the same as in the main experiment. In each session, subjects completed a series of intertemporal choices. Across sessions, at least 160 trials were conducted in each of the following tasks mimicking the main experiment, (i) non-verbal (NV), (ii) verbal short delay (SV, 3–64 s), and (iii) verbal long delay (LV, 3–64 days). In each trial, irrespective of the task, subjects made a decision between the sooner and the later options. The NV task was exactly the same as in the main experiment. All subjects passed learning stages. The SV and LV tasks differed from the main experiment in exactly two ways: First, the stimuli presentation didn’t include a display of circles of different colors. Instead, two choices were presented on the left or on the right side (counterbalanced) of the screen (Figure 4—figure supplement 1); Second, the subjects did not have to click on the circles using mouse, instead they used a keyboard to indicate ‘L’ or ‘R’ choice. Everything else stayed the same as in the main experiment, that is the last two sessions included an alternating set of verbal tasks: SV-LV-SV-LV (or LV-SV-LV-SV, for a random half of subjects), the payment was done differently for SV and LV (randomly picked trial for payment in LV), etc. The purpose of this control experiment is to confirm that significant correlation between non-verbal tasks and verbal tasks we report in Results is not an artifact of our main experimental design: subjects experience the same visual display and motor responses in the non-verbal and verbal tasks and this design similarity might drive the correlation between time-preferences in these tasks. Instead, in this control experiment the verbal tasks are made as similar as possible (keeping our experiment structure) to typical intertemporal choice tasks used in human subjects.

Control experiment 2: Days and weeks (DW)

Request a detailed protocol

In total, 16 subjects took part in this experiment (2 of 16 were excluded from analyses because their choices were insensitive to delay). Subjects were undergraduate students from NYU Shanghai. This experiment was approved under the same IRB protocol as the control experiment 1 and the main experiment. This experiment included two following experimental tasks: (i) verbal days delay (DV, 1–64 days) and (ii) verbal weeks delay (WV, 1–35 weeks). Subjects underwent only one session where the verbal tasks were alternated: DV-WV-DV-WV (or WV-DV-WV-DV, for roughly half of subjects; 200 trials per task). For each of the tasks in this control experiment the stimuli and procedures were exactly the same as for LV task in the control experiment 1. The purpose of this control task is to check whether subjects pay attention to units.

Significance tests of demographic and psychological categories

Request a detailed protocol

We did not find any significant differences between any of the categorical subjects’ groups, including gender and nationality in learning stages (Figure 1—figure supplement 1), intertemporal decisions and first-order violations. For the proportion of ‘yellow’ choice (mean ± std. dev.) there is no significant difference between females and males (females: 0.56±0.24 males: 0.53±0.29 Wilcoxon rank sum test, p=0.22) and between Chinese and Non-Chinese (Chinese: 0.57±0.23 Non-Chinese: 0.54±0.29 Wilcoxon rank sum test, p=0.33) subjects. Similarly, for the first-order violations there is no significant difference between females and males (violations per session, females: 1.14±1.97 males: 1.07±2.20 Wilcoxon rank sum test, p=0.2607) and a slight difference between Chinese and Non-Chinese (Chinese: 1.21±2.07 Non-Chinese: 1.00±2.00 Wilcoxon rank sum test, p<0.1) subjects.

We used the Barratt Impulsiveness Scale (BIS-11; (Patton et al., 1995)) as a standard measure of impulsivity. This test is reported to often correlate with biological, psychological, and behavioral characteristics. The mean total score for our students sample was 61.79 (std = 9.53), which is consistent with other reports in the literature (e.g., (Stanford et al., 2009)). The BIS-11 did not correlate significantly with the estimated discount factors (BIS vs. log(kNV): Pearson r=0.2, p=0.1180; BIS vs. log(kSV): Pearson r=0.19, p=0.1384; BIS vs. log(kLV): Pearson r=0.15, p=0.2521). Prior research finds mixed evidence of the association between the BIS-11 and delay: some report significant positive correlations (Mobini et al., 2007; Beck and Triplett, 2009; Cosenza and Nigro, 2015), others do not find significant correlations and suggest that delay-discounting tasks might measure a different aspect of impulsivity (Mitchell, 1999; Fellows and Farah, 2005; Reynolds et al., 2006; Saville et al., 2010). Following earlier research that reports that components of the BIS score might drive the correlation with discounting (Fellows and Farah, 2005; Mobini et al., 2007; Beck and Triplett, 2009; Ahn et al., 2016), we next decomposed the score. Similar to others we found that correlation between BIS nonplanning component and log(kNV) is positive and significant (Pearson r=0.3, p<0.05).

Time and reward re-scaling

Request a detailed protocol

In our main experimental tasks we used two units for delays: seconds and days, where 1 day = 86400 s. We also used three exchange rates: for non-verbal task 1 coin = 0.1 CNY; for verbal short delay 1 coin = 0.05 CNY; for long delay 1 coin = 4 CNY. Humans tend to discount large rewards less steeply than small rewards, that is discounting rates tend to increase as amounts decrease (Green et al., 1999; Green et al., 2004). We re-calculated the model-based (softmax-hyperbolic model) median BHM model fits: 1) we convert them to the same units (1/days): kNV=4173.1 (by multiplying k1/day by the day to seconds conversion rate), kSV=2548.8, kLV=0.0356, 2) we consider reward re-scaling: "going from $10 to $.20, a factor of 50, k values would increase by a factor of 2" (Navarick, 2004) kNV=4173.1, kSV=2548.8, kLV=0.0712 and 3) conclude that discrepancy of discount rates between time-horizons cannot be accounted by magnitude effects. Thus, the discount rate revealed in the verbal short delay task is more than 104 times larger than the rate describing the choices made by the same participants in the verbal long delay task.

Analysis

In order to be sure that our results and main conclusions did not depend on the method (e.g. Bayesian hierarchical vs. maximum likelihood estimation of individual subject parameters) or functional form (e.g. exponential vs. hyperbolic), we validated our results with several methods. We estimated subjects’ time-preferences individually (since discounting factors differ among people) for each experimental task with maximum likelihood estimation (MLE) and used leave-one-trial-out cross-validation for model comparison. In the delay-discounting literature, there is no consensus which functional form of discounting best describes human behavior: the exponential model (Samuelson, 1937) of time discounting has a straightforward economic meaning - a constant probability of loss of reward per waiting time, whereas the hyperbolic model (Mazur, 1987) seems to more accurately describe how individuals discount future rewards, in particular preference reversals (Berns et al., 2007). We considered both a shift-invariant softmax rule and a scale-invariant matching rule to transform the subjective utilities of the sooner and later offers into a probability of choosing the later offer. Thus, we considered four model classes: (1) hyperbolic utility with softmax, (2) exponential utility with softmax, (3) hyperbolic utility with matching rule and (4) exponential utility with matching rule. We also considered models that account for utility curvature, that is V is replaced by Vαi and models that account for trial number and cumulative waiting time. Based on the Bayesian information criterion criterion (BIC; top three models by BIC: (2) −179.47 (SE = 4.99), (1) −191.16 (SE = 4.96), and (4) −192.03 (SE = 4.82)) and number of subjects that were well described by the models, the softmax-hyperbolic model (1) was selected.

Following modern statistical convention, we used a Bayesian hierarchical model (BHM) brms, 2.0.1 (Carpenter et al., 2016; Bürkner, 2017) that allows for pooling data across subjects, recognizing individual differences and estimating posterior distributions, rather than point estimates of the parameters. We validated that our results were not sensitive to the model fitting methods used; the means of BHM posteriors of the individual discount-factors for each task are almost identical to the individual fits done for each experimental task separately using maximum likelihood estimation through fmincon in Matlab (Figure 2—figure supplement 1 , Figure 2—figure supplement 2). We further validated the BHM method by simulating choices from a population of ‘agents’ with known parameters and demonstrating that we could recover those parameters given the same number of choices per agent as in our actual dataset (not shown). In order to assess the goodness of fit for individual subjects in each task, we computed the Bayesian r2 using the ‘bayes_R2’ function in the ‘brms’ package in R.

The first non-verbal session data was excluded from model-fitting due to a comparatively high proportion of first-order violations relative to the following two non-verbal sessions (from 26% of trials in the first non-verbal session (NV1) to 19% and 13% for the next two non-verbal sessions, NV2 and NV3, respectively, Wilcoxon signed-rank test, NV1 vs. NV2 & NV1 vs. NV3, p<0.01). Compliance with first-order stochastic dominance means that, in principle, this observed behavior can be adequately modeled with a utility-function style analysis (Tymula et al., 2013; Yamada et al., 2013). In the non-verbal task, violations could result from lapses in attention, motor errors or difficulty in transforming the perceptual stimuli into offers (in particular, early on in the first session while learning has not completed). In the verbal tasks, inattention and/or misunderstanding are likely explanations of violations. It is important that NV1 did not differ significantly in choice consistency (the number of preference reversals was not significantly different between NV1 and later non-verbal sessions, Wilcoxon signed-rank test, all p>0.2).

A six population level and four subject level parameters model (mixed-effects model) is used to estimate discount-factors and decision-noise from choices. Using the ‘brms’ (Bürkner, 2017) package in R allows to do BHM of nonlinear multilevel models in Stan (Carpenter et al., 2016) with the standard R formula syntax:

choice
inv_logit((later_reward/(1+exp(logk)delay)sooner_reward)/noise),
noisetask+(1|subjid),
logktask+(task|subjid)

where later_reward is the later reward, sooner_reward is the sooner reward; logk is the natural logarithm of the discounting parameter k and noise (τ) is the decision noise (as in Equation 1 and Equation 2, respectively). The population level effects estimate shared shifts in delay discounting log(k) and decision noise τ (e.g. if all subjects are more impulsive in one task).

At the subject level, this model transforms the reward and delays on each trial and individual preferences into a probability distribution about the subject’s choice. For the non-verbal task, we assumed that the subjects had an unbiased estimate of the meaning of the frequency and AM modulation of the sound. Rewards and delays are converted in the subjective value of each choice option using hyperbolic utility model (Equation 1). Then, Equation 2 (a logit, or softmax function) translates the difference between the subjective value of the later and the subjective value of the sooner (estimated using Equation 1) into a probability of later choice for each subject. Two functions below rely on the four parameters (ki,t: (ki,NV,ki,SV,ki,LV), the discounting factor per subject, i, in each task, t, and τi,t individual decision noise). For example, for subject 12 in task NV the effective discount factor is the product of the population level discount factor in NV and subject 12 effect in NV, k12,NV=k^NV×k˙12,NV.

Hyperbolic utility model:

(1) Ui=V1+ki,tT

where V is the current value of delayed asset and T is the delay time.

Softmax rule:

(2) P(Li)=eULi/τieULi/τi+eUSi/τi

where L is the later, S is the sooner offer and τi is the individual decision noise.

To test for differences across tasks we examined the BHM fits using the ‘hypothesis’ function in the ‘brms’ R package. This function allows us to directly test the posterior probability that the log(k) is shifted and/or scaled between treatments. This function returns an ‘evidence ratio’ which tells us how much we should favor the hypothesis over the inverse (e.g. P(a>b)P(a<b)) and we used Bayesian confidence intervals to set a threshold (p<0.05) to assist frequentists in assessing statistical significance.

The bootstrapped (mean, median and variance) tests are done by sampling with replacement and calculating the sample statistic for each of the 10000 boots, therefore creating a distribution of bootstrap statistics and (i) testing where 0 falls in this distribution for unpaired tests or (ii) doing a permutation test to see whether the means are significantly different for paired tests.

Simulations done for both model-based and model-free analyses are described in detail in (Figure 2—figure supplement 3 , Figure 3—figure supplement 4).

To estimate the effect of adaptation (Figure 6B,C), we first used the fitted parameters from the hierarchical model to transform each offer to each subject into a difference in utility, ΔU. We classified the first four trials in a long or short task as early trials. Then, we fit a generalized linear mixed model (using the function ‘glmer’ from the ‘lme4’ R package) where we fit the choice of the subjects with fixed-effects ΔUearly/late, LV/SV, and interactions between LV/SV:ΔU and early/late:LV/SV. We also included a slope and intercept for each subject as random effects. To test for the significance of this adaptation effect, we compared this model to a reduced nested model where we removed the early/late term and interaction early/late:LV/SV.

Full model with adaptation:

choiceΔU+LVSV:ΔU+LVSVearly+(1+ΔU|subjid)

Reduced model without adaptation:

choiceΔU+LVSV:ΔU+LVSV+(1+ΔU|subjid)
DfAICBIClogLikDevianceChisqChi dfPr(>chisq)
Reduced Model710955.5111010.97−5470.7610941.51
Full Model910932.4011003.70−5457.2010914.4027.112<10-4

Software

Tasks were written in Python using the PsychoPy toolbox (1.83.04, (Peirce, 2007)). All analysis and statistics was performed either in Matlab (version 8.6, or higher, The Mathworks, MA), or in R (3.3.1 or higher, R Foundation for Statistical Computing, Vienna, Austria). R package ‘brms’(2.0.1) was used as a wrapper for Rstan (Guo et al., 2016) for Bayesian nonlinear multilevel modeling (Bürkner, 2017), shinystan (Gabry, 2015) was used to diagnose and develop the brms models. R package ‘lme4’ was used for linear mixed-effects modeling (Bates et al., 2014).

Data availability

Request a detailed protocol

Software for running the task, as well as the data and analysis code for regenerating our results are available at https://github.com/erlichlab/delay3ways/tree/v1.0 (Lukinova and Erlich, 2018; copy archived at https://github.com/elifesciences-publications/delay3ways).

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
    Package Lme4: Linear Mixed-Effects Models Using Eigen and S4, version 1
    1. D Bates
    2. M Maechler
    3. B Bolker
    4. S Walker
    (2014)
    R Package.
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
    Valuing the future: temporal discounting of health and money
    1. GB Chapman
    2. AS Elstein
    (1995)
    Medical Decision Making : An International Journal of the Society for Medical Decision Making 15:373–386.
    https://doi.org/10.1177/0272989X9501500408
  17. 17
  18. 18
    Statistical Power Analysis for the Behavioral Sciences (2)
    1. J Cohen
    (1988)
    Hilsdale, United States: Lawrence Erlbaum Associates.
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
    Saving Decisions of the Working Poor: Short-and Long-Term Horizons
    1. C Eckel
    2. C Johnson
    3. C Montmarquette
    (2005)
    Emerald Group Publishing Limited.
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
    Caloric primary rewards systematically alter time perception
    1. BJ Fung
    2. C Murawski
    3. S Bode
    (2017)
    Journal of Experimental Psychology. Human Perception and Performance 43:1925–1936.
    https://doi.org/10.1037/xhp0000418
  31. 31
  32. 32
  33. 33
    Shinystan: Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models, version 2
    1. J Gabry
    (2015)
    R Package.
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
    Impulsivity: The behavioral and neurological science of discounting
    1. GJ Madden
    2. WK Bickel
    (2010)
    American Psychological Association.
  55. 55
    An adjusting procedure for studying delayed reinforcement
    1. JE Mazur
    (1987)
    Commons, ML.; Mazur, JE.; Nevin, JA, Pp 73:55.
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
    Expected subjective value theory (ESVT): A representation of decision under risk and certainty
    1. AA Tymula
    2. PW Glimcher
    (2016)
    SSRN Electronic Journal, 10.2139/ssrn.2783638.
  88. 88
    Delay discounting: pigeon, rat, human—does it matter?
    1. A Vanderveldt
    2. L Oliveira
    3. L Green
    (2016)
    Journal of Experimental Psychology: Animal Learning and Cognition 42:141–162.
    https://doi.org/10.1037/xan0000097
  89. 89
  90. 90
  91. 91
    Animal Boredom: Is a Scientific Study of the Subjective Experiences of Animals Possible?
    1. F Wemelsfelder
    (1984)
    In: M. W Fox, L. D Mickley, editors. Advances in Animal Welfare Science. Springer Netherlands. pp. 115–154.
  92. 92
  93. 93
  94. 94
    Temporal horizons in decision making
    1. M Wittmann
    2. MP Paulus
    (2009)
    Journal of Neuroscience, Psychology, and Economics 2:1–11.
    https://doi.org/10.1037/a0015460
  95. 95
  96. 96
  97. 97

Decision letter

  1. Daeyeol Lee
    Reviewing Editor; Yale School of Medicine, United States
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter, peer reviews, and accompanying author responses.

[Editorial note: This article has been through an editorial process in which the authors decide how to respond to the issues raised during peer review. The Reviewing Editor's assessment is that all the issues have been addressed.]

Thank you for submitting your article "Time preferences are reliable across time-horizons and verbal vs. experiential tasks" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The following individual involved in review of your submission has agreed to reveal his identity: Mehrdad Jazayeri (Reviewer #2). The other reviewers remain anonymous.

The Reviewing Editor has highlighted the concerns that require revision and/or responses, and we have included the separate reviews below for your consideration. If you have any questions, please do not hesitate to contact us.

Summary:

Presently, there is no universal consensus as to whether the discount rates estimated in humans and other animals using different behavioral paradigms of inter-temporal choice all concern the same underlying process or not. This study manipulated two important factors related to this important question, namely, how the delay is cued or indicated, and the range of temporal intervals between choice and reward delivery. The main results support that there is a common core mechanism.

Major concerns:

Reviewers raised several major concerns regarding the description of the methods and experimental design, and also proposed some additional control analyses.

Separate reviews (please respond to each point):

Reviewer #1:

Time preferences are investigated in both humans and animals, but temporal discounting tasks used in humans and animals differ with regard to duration of temporal delays and stimulus modalities. To empirically test the impact of these factors, the authors compared discount rates in humans between a novel non-verbal temporal discounting task with verbal tasks including short and long delays. The results reveal moderate to strong correlations between discount factors across tasks.

There is much to like about the well-written manuscript: it addresses the important topic of the relevance of animal studies for understanding human behavior, the three temporal discounting tasks were rigorously matched with regard to choice structure, and several control analyses and control experiments were conducted to test alternative hypotheses. My main concern is that the current results allow no clear conclusions with regard to the theoretical question which mechanisms are shared (and which not) by the different temporal discounting tasks and that the parallels may be more imposed than it currently seems.

Theoretical impact:

The goal of the current experiments was to assess the impact of two methodological differences between human and animal studies, time horizon and stimulus modality (language-based vs. language-free). However, as the authors themselves write in the Discussion section, comparing discount rates across short and long time delays is not really novel, as this issue has already between addressed by previous studies. This leaves the comparison of different stimulus modalities as the main novel aspect of the current study, and unfortunately the results are not very conclusive here. This is because the correlations of the SN with the SV (0.54) and LV (0.36) are moderate in effect size, which suggests that the tasks measure both partially overlapping and at the same time also clearly dissociable constructs. While this is an interesting empirical finding in its own right, the implications of such moderate correlations for comparisons between human and animal studies are less clear.

The authors themselves appear to oscillate between different interpretations: for example, according to the manuscript title time preferences are reliable across verbal and experiential tasks, and also the Discussion section acknowledges that the three tasks share common cognitive and neural mechanisms, whereas at the same time "caution is warranted" when reaching conclusions from experiential short-delay paradigms in animals to verbal long-delay intertemporal decisions is humans. All this is true and at the same time somewhat trivial. From a theoretical perspective it would have been informative to get some idea about what these common and dissociable mechanisms might be. However the current study provides no answer to this question and further experiments would be needed.

Methodological concerns:

The authors used the monetary rewards in the different temporal discounting tasks, but it seems puzzling why participants chose smaller-sooner rewards at all in the short-delay tasks (SV and SN), given that subjects received the rewards only at the end of the experiment instead of after the delays. Could participants finish an experimental session earlier by choosing the SS instead of the LL options? This would point to possible opportunity costs associated with the LL reward option. Alternatively, could a demand effect contribute to the behavior? In addition, the authors report that participants might have experienced the "slot machine sound" that accompanied (virtual) reward deliveries as rewarding. This suggests that short-delay and long-delay tasks might not have been matched with regard to reward magnitude and modality, because in the LV task participants made choices between SS and LL monetary rewards, whereas in the SV and SN tasks they decided between sooner or later sounds that were experienced as secondary reinforcers (even though these had additionally also consequences for their payoffs).

Participants performed the three tasks in a fixed order (starting with the SN task, then performing the SV and NV tasks in alternating order). It seems possible that after the SN task participants internally set a decision criterion regarding the delay length/waiting time (relative to the given context) that they considered acceptable for a specific LL reward magnitude, which might be an alternative explanation for the significant correlations between the tasks. Due to the fixed task order, I see no way for empirically ruling out such an anchoring effect.

Following on from the previous point, I would assume that the implemented delays and magnitudes were distributed in the same fashion in the different tasks? If so, it may be less surprising that there are correlations between short and long-horizon tasks, at least when relative processing dominates (as in the primary experiment). Participants may simply transfer their choice patterns across similarly structured environments, possibly facilitated by participants desiring to be consistent in their proportions of choices. Conversely, if they were exposed to both short and long (hypothetical or real) delays within the same experiment, I would predict that their choice patterns look rather different.

Is the lack of correlation with BIS-11 not an indication of limited validity of the task?

Statistical analysis:

The authors claim that their tasks show a high test-retest reliability, but it seems that the authors only tested for significant differences in discount parameters between the first and second half of blocks and between experimental sessions. To assess the reliability of a measure, however, it seems more appropriate to compute correlations, that is, to test whether the most impulsive subjects in the first experimental session stay the most impulsive ones in the other sessions.

Regarding the context effect observed in control experiment 2: first of all, I wonder whether such a context effect occurred also in the main experiment or in control experiment 1? The sample size is rather small in control experiment 2 (N=16), so it would be good to test the robustness of this effect in a larger sample size. Furthermore, in the main text, this effect is described as a "small but significant adaptation effect", which seems to contradict the headline of this section ("Strong effect of temporal context"). Given that this effect appears rather weak, whereas the discount factors for the SV and LV tasks are surprisingly similar, I think the authors should be more careful with rejecting the "costs of waiting"-hypothesis. I do not doubt the existence of such a context adaptation effect, but it might just be an additional factor besides the higher costs of waiting/opportunity costs in the SV relative to the LV task.

Minor Comments:

Please specify which learning stage the excluded participants did not pass.

Reviewer #2:

Authors examine patterns of delay discounting across three tasks, one non-verbal (experiential) and two verbal (one with long delay and one with short delay), and argue that individuals exhibit similar behavioral patterns scaled by the temporal context despite differences across conditions (verbal vs. non-verbal and short versus long delays).

The manuscript makes two valuable contributions. First, it provides evidence in support of comparing experiential studies in animal models to a large body of literature using verbal tests in humans. Second, it provides evidence that temporal context plays an important role in delay-discounting behavior.

I am generally positive, and have some specific comments that should help the authors improve the manuscript and make it more accessible.

Comments:

Results section, first paragraph: The description of reward for LV can be made clearer.

Results section first paragraph: The model description in the Results is too terse. I suggest explaining a little bit more the parameters since comparison of model parameters is an important part of Results. For example, "We found that k, for all tasks, had a log-normal distribution across our subjects," would not make a whole lot of sense to a non-expert.

Along the same lines, it would be good to start with a simulation of the model showing the expected effects of various key parameters. That would make understanding of the rest of the paper easier.

The explanation of model is clearer in Materials and methods but if the authors wish to make the results accessible to a larger readership, more details would help. Basically, I recommend unpacking the term "A 6 population level and 4 subject level parameters model." The subject level is explained well but the population level needs clarification. It appears that certain aspects of the package used for modeling are described as a black box.

Figure 2: It would help to discuss in Results the discrepancies between data and model. In some cases, there seem to be systematic errors that are ignored. Also, it would help to provide a measure of goodness of fit across subjects and tasks to get a better sense of how widespread such systematic errors were across subjects.

Figure 3, 4 and 5: Are the linear fits based on total least squares?

Paragraph three of subsection “Strong effect of temporal context”: HBA must be defined (it is defined in later use).

Overall, a very nice paper!

Reviewer #3:

This paper presents evidence that individual differences in temporal discounting have a degree of stability across large differences in both the scale of delays at stake and the format of the decision task. Human participants performed 3 different intertemporal choice tasks. The sample size (n=63) is on the modest side for an individual differences study, although there is quite a bit of data per individual. The tasks ranged from resembling standard rodent paradigms (with a nose-port-like interface, nonverbal cues, and directly experienced delays on the order of seconds) to resembling standard human paradigms (verbally cued delays on the order of days).

The findings add to our knowledge about the stability and generality of temporal discounting, and the comparability of human and animal experimental paradigms. This work is analogous and complementary to other recent research that compared risky decision making for probabilities derived from experience versus description.

I have a few comments and suggestions for potentially strengthening the manuscript.

1) As the authors aptly note (subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”), the correlation between two variables is independent of any difference in their means. There were a couple places where only one of these aspects of the data was quantified, and I thought it would be useful to see both:a) It would be useful to see test-retest reliability reported in terms of the correlation across sessions, in addition to the non-significant Wilcoxon signed-ranks test. This would more thoroughly support the statements about within-task reliability in the conclusions section (subsection “Stability of preferences”).b) For the days-vs-weeks control experiment, it would be nice to see the results of a paired-samples test comparing k-values in the two conditions, not just the correlation (subsection “Strong effect of temporal context” and Figure 5A). Although the correlation is high, the data in Figure 5A look like they might be systematically offset from the unity line.

2) It would be helpful to have more information about how the magnitude-delay pairs were constructed. For instance, what were the ranges of amounts and delays? Were they paired so as to cover a particular range of indifference k-values? (Subsection “Experimental Design” paragraph five and Figure 3—figure supplement 2 give partial information about this but not the complete picture.)

3) Subsection “Strong effect of temporal context” says k-values in the LV task were "almost equivalent (ignoring unexplained variance) to those in the SV task." I found this confusing because the preceding paragraphs emphasized that the LV k-values were significantly lower on average and also had higher variance than the SV k-values. Maybe this sentence just means to refer to the fact that LV and SV were correlated?

I also found it odd that the mismatch in units wasn't dealt with until this paragraph (i.e., whether k-value represents discounting per day or per second). I had assumed common units were being used when I first read the comparisons of k-values between tasks (paragraph five subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”). I think it would be helpful either to use matching units throughout, or point out (and explain) the choice not to at the outset.

4) A striking aspect of the results is the large difference in discount rates between short, directly experienced delays and long, non-experienced delays. In addition to considering the possibility that experienced delay is uniquely aversive (subsection “Cost of waiting vs. discounting future gains”), it would be interesting to consider the possible role of opportunity costs. I gather the NV and SV conditions didn't involve direct opportunity costs within the context of the experiment; that is, choosing longer delays didn't reduce the total number of trials, so the reward-maximizing strategy would always be to choose the larger reward? Did participants know in advance that they had a fixed number of trials (rather than having a time budget)? It would be interesting to know how participants' earnings compared to what they could have earned by following the reward-maximizing strategy. It would also be interesting to know whether they managed to finish the session and leave earlier than they would have by following the reward-maximizing strategy.

Minor Comments:

1) In the first paragraph of the Results section please say whether the given number of trials (160) is total or per task.

2) In the third paragraph of subsection “Stability of preferences” I would replace "eliminated" with "matched" or something similar. I initially read it as "ruled out," which is the opposite of the intended meaning.

3) Tables 2 and 3 would benefit from more descriptive legends. In particular, I initially misunderstood the Table 3 legend as meaning the outcome variable for this analysis was k-value variance (along the lines of the scaling effect mentioned for Table 4).

4) In Figure 5, I suggest noting explicitly in the legend that panels B-C pertain to the main experiment (and not the same experiment represented in panel A).

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Time preferences are reliable across time-horizons and verbal vs. experiential tasks" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Mehrdad Jazayeri (Reviewer #2).

The manuscript has been improved but there are some remaining issues that we suggest you address before this is published.

1) The authors mention that boredom or opportunity costs may play a role with short delay. One may wonder whether these factors point at a potential difference between human and animal tasks as primary rewards could be ingested as they arrive. By extension, the presumed parallel of the present task with animal tasks may be smaller than assumed. This potential limitation could be mentioned in the discussion as it the comparison of human and animal research is a major motivation for the present study.

2) In the analysis in the final paragraph of subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”, which compares discount rates across tasks, it's now stated clearly that different units are used for k-values in the different tasks. But it might be beneficial to more fully describe the motivation for the analysis in light of this. Why is it of interest to test whether per-second discount rates in one task differ from per-day discount rates in another?

3) The Table 3 legend seems to have a typo (the 2nd occurrence of k_NV should be k_LV), and the abbreviation "Ev. Ratio" should be defined and explained (the evidence ratio is not introduced until paragraph seven of the “Analysis” section).

4) In subsection “Time and Reward Re-Scaling” I didn't understand why the k-values were referred to as "unit-free".

https://doi.org/10.7554/eLife.39656.036

Author response

Reviewer #1:

Time preferences are investigated in both humans and animals, but temporal discounting tasks used in humans and animals differ with regard to duration of temporal delays and stimulus modalities. To empirically test the impact of these factors, the authors compared discount rates in humans between a novel non-verbal temporal discounting task with verbal tasks including short and long delays. The results reveal moderate to strong correlations between discount factors across tasks.

There is much to like about the well-written manuscript: it addresses the important topic of the relevance of animal studies for understanding human behavior, the three temporal discounting tasks were rigorously matched with regard to choice structure, and several control analyses and control experiments were conducted to test alternative hypotheses. My main concern is that the current results allow no clear conclusions with regard to the theoretical question which mechanisms are shared (and which not) by the different temporal discounting tasks and that the parallels may be more imposed than it currently seems.

This is an important comment and in our revised draft we make clear where our study contributes to the literature. Specifically, in the Discussion section at we say: “this task can be applied to both human and animal research for direct comparison of cognitive and neural mechanisms underlying delay-discounting … animal models of delay-discounting may have more in common with short time-scale consumer behavior such as impulse purchases.… caution is warranted when reaching conclusions from the broader applicability of these models [short time-scale] to long-time horizon real-world decisions.”

Theoretical impact:

The goal of the current experiments was to assess the impact of two methodological differences between human and animal studies, time horizon and stimulus modality (language-based vs. language-free). However, as the authors themselves write in the Discussion section, comparing discount rates across short and long time delays is not really novel, as this issue has already between addressed by previous studies.

This comment builds on the prior point that the earlier draft did not make explicit the contribution of this study. To the best of our knowledge and from undertaking a second review of the literature while completing the revision, our study offers several advantages relative to earlier work that compared decisions across time horizons. Our within-subject design removes the possibility that any differences arises due to unobserved confounds that differ between subjects. While, a few previous studies also considered within-subject variation either these studies had much smaller sample sizes and lower statistical power or used hypotheticals where none of the delayed rewards were actually experienced (e.g. Johnson et al., 2015). For those with low power, it is not a surprise that prior findings with small samples of within-subject variation did not yield conclusive results.

This leaves the comparison of different stimulus modalities as the main novel aspect of the current study, and unfortunately the results are not very conclusive here. This is because the correlations of the SN with the SV (0.54) and LV (0.36) are moderate in effect size, which suggests that the tasks measure both partially overlapping and at the same time also clearly dissociable constructs. While this is an interesting empirical finding in its own right, the implications of such moderate correlations for comparisons between human and animal studies are less clear.

We apologize if our descriptions of the correlations were unclear. We report correlations for main experiment in Table 1 and control experiments in Figure 4 and Figure 5. The Spearman correlation between non-verbal (NV) and short-verbal (SV) for the main experiment was 0.76 (Table 1), for Pearson correlation, 0.79. We have added the following text to the Discussion section on reliability "This correlation [0.79] is on par with test-retest reliability of personality traits (Viswesvaran and Ones, 2000)." The correlation between the short and long tasks was lower, and we acknowledge and discuss that difference at length in the discussion. We agree with the reviewer that the lower correlation between short and long tasks suggests that there are partially overlapping (but dissociable) mechanisms underlying short vs. long tasks. However, before our study the relative contribution of time-horizon vs. verbal-experiential to reliability of measured time-preferences was unknown.

The authors themselves appear to oscillate between different interpretations: for example, according to the manuscript title time preferences are reliable across verbal and experiential tasks, and also the Discussion section acknowledges that the three tasks share common cognitive and neural mechanisms, whereas at the same time "caution is warranted" when reaching conclusions from experiential short-delay paradigms in animals to verbal long-delay intertemporal decisions is humans. All this is true and at the same time somewhat trivial. From a theoretical perspective it would have been informative to get some idea about what these common and dissociable mechanisms might be. However the current study provides no answer to this question and further experiments would be needed.

We respectfully disagree with the reviewer. As to our "oscillatory" appearance, we have written the manuscript in a tone which we feel is honest to the data. The correlations between all three tasks were significant but they are not equal (Table 1). As to the "trivial" nature of our conclusions, there is no existing literature to provide guidance as to the relative reliability of time-preferences across the verbal/non-verbal gap. "To date, no published study with humans has examined discounting under a condition in which only symbolically presented information, and no specifically stated information, is provided about the delays and amounts." Vanderveldt et al., 2016.

Our main and 2nd control experiment (days vs. weeks) do provide evidence that the difference between waiting and postponing is what drives the differences across the time-horizons. We agree with the reviewer that future studies are required to fully understand the common and dissociable mechanisms between waiting and postponing. We are currently collecting brain imaging data to address that question to be published in a future manuscript.

Methodological concerns:

The authors used the monetary rewards in the different temporal discounting tasks, but it seems puzzling why participants chose smaller-sooner rewards at all in the short-delay tasks (SV and SN), given that subjects received the rewards only at the end of the experiment instead of after the delays. Could participants finish an experimental session earlier by choosing the SS instead of the LL options? This would point to possible opportunity costs associated with the LL reward option.

We have addressed these questions in Figure 3—figure supplement 1 and in the Discussion: “…there could be opportunity costs related to how much subjects value their own time. We found that in the short tasks subjects with large discount factors also performed the task faster (Figure 3—figure supplement 1). If these subjects value their time more and thus have higher costs of waiting, then given our results Figure 3B there is a surprisingly large correlation between how much subjects value their time (in the short tasks) and how much they discount postponed rewards (in the long task)”.

Alternatively, could a demand effect contribute to the behavior?

A frequent critique of laboratory studies using human subject idea is the above idea of demand effect. One of the coauthors (Lehrer) has done some work with bargaining experiments where they created a design where one choice was strictly dominated by the other options (Embrey et al., 2014). Hence, if selected, it could reflect an experimenter demand effect. In their setting they found that it was only chosen less than 4% of the time and by very few subjects. In general, our sense from reading the broader literature in experimental economics is that the magnitude of demand effects is quite small.

Moreover, all of our tasks are computer-controlled and there is no interaction between the experimenter and the subject other than the instructions. Given that the vast majority of subjects' choices are sensitive to delays and rewards and are well-fit by a ~2 parameter model (per task) and that our estimates of these parameters are continually distributed over a range that is consistent with large scale studies of delay-discounting (Sanchez-Roige et al., 2018), we think that it is more likely that subjects are expressing their preferences than trying to adjust their behavior to a perceived belief in the purpose of the experiment.

We added our instructions for nonverbal and verbal experiments as Supplementary files to the manuscript for transparency. We read instructions out loud and answered questions of the participants. If questions about strategy arise, we left those questions not answered. In the nonverbal instructions there were no cues, except for “The only thing we can say now is USE THE MOUSE and react to different visual stimuli that appear on the screen and sound stimuli that you hear.” In the verbal experiments subjects are told about delays and waiting time. Both instructions are similar in describing the experiment: “play the game and earn coins”.

Embrey, M., Fréchette, G. R., and Lehrer, S. F. (2014). Bargaining and reputation: An experiment on bargaining in the presence of behavioural types. The Review of Economic Studies, 82(2), 608-631.

In addition, the authors report that participants might have experienced the "slot machine sound" that accompanied (virtual) reward deliveries as rewarding. This suggests that short-delay and long-delay tasks might not have been matched with regard to reward magnitude and modality, because in the LV task participants made choices between SS and LL monetary rewards, whereas in the SV and SN tasks they decided between sooner or later sounds that were experienced as secondary reinforcers (even though these had additionally also consequences for their payoffs).

This is an interesting point which we hadn’t considered. Inspired by the reviewer’s comment, we fit an expanded model to our data, which scales the reward differentially (per subject) in the short vs. long tasks. We found although subjects did perceive the two rewards as slightly different, the correlations between discount factors were slightly increased, and the temporal scaling between the short and long tasks was unchanged, thus supporting our original conclusions. See section “Controlling for differences in reward experience”.

Participants performed the three tasks in a fixed order (starting with the SN task, then performing the SV and NV tasks in alternating order). It seems possible that after the SN task participants internally set a decision criterion regarding the delay length/waiting time (relative to the given context) that they considered acceptable for a specific LL reward magnitude, which might be an alternative explanation for the significant correlations between the tasks. Due to the fixed task order, I see no way for empirically ruling out such an anchoring effect.

We thank the reviewer for bringing up the potential confounds of anchoring and reference point effects. First, we apologize if it wasn't clear, but subjects were put into two groups. All subjects did three sessions of NV first. These were followed by two sessions. Half of the subjects performed LV-SV-LV-SV in each session. Half of the subjects performed SV-LV-SV-LV in each session. To make this clearer, we have added this to the caption of Figure 1 "Note: The order of short and long delay verbal for sessions 4 and 5 was counter-balanced across subjects".

As such, the temporal proximity of SV and LV to NV differed between the groups. If our results were simply an anchoring effect, we would expect that the group to perform LV first would have a higher correlation to NV and the group performing SV first would have a higher correlation to NV. Instead we found:

SLSL NV vs SV r = 0.7662 p = 8.0880e-08

LSLS NV vs SV r = 0.8173 p = 1.1087e-07

SLSL NV vs LV r = 0.3742 p = 0.0268

LSLS NV vs LV r = 0.4313 p = 0.0219

Both types of correlations are slightly higher for LSLS than for SLSL. This is not consistent with an anchoring effect.

We have added an additional supplemental figure (Figure 3—figure supplement 2) to the paper to address these issues and added addition text.

“We further checked whether the correlations between discount factors in the three tasks may have arisen due to some undesirable features of our task design. For example, different subjects experienced the offers in different orders. Anchoring effects (Tversky and Kahneman, 1974; Furnham and Boo, 2011) may have set a reference point in the early part of the experiment that guided choices throughout the rest. As such, we repeated the analyses described in the previous paragraph, but we added 6 additional factors: the mean rewards, delays presented in the first block of the 2nd and 3rd non-verbal session and also the% of yellow choices made in those blocks. We reasoned that if anchoring effects were playing a role then subjects that were presented longer delays, or smaller rewards early in the experiment should have correlations between these factors and log(𝑘𝑆𝑉) or log(𝑘𝑁𝑉). Likewise, if subjects were simply trying to be consistent with early choices, then the ‘% yellow’ in the early blocks would have an important influence. We tested the contribution of each factor by dropping it from the model to create a reduced nested model and using a likelihood ratio test against the full model. We found no evidence for anchoring effects or that subjects were simply trying to be consistent with their early choices.”

Following on from the previous point, I would assume that the implemented delays and magnitudes were distributed in the same fashion in the different tasks? If so, it may be less surprising that there are correlations between short and long-horizon tasks, at least when relative processing dominates (as in the primary experiment). Participants may simply transfer their choice patterns across similarly structured environments, possibly facilitated by participants desiring to be consistent in their proportions of choices. Conversely, if they were exposed to both short and long (hypothetical or real) delays within the same experiment, I would predict that their choice patterns look rather different.

We have added details to the Materials and methods about how the offers were presented. Although all subjects experienced all pairs of rewards and delays, the order was chosen randomly.

We included early choices in our regression mentioned above to minimize the potential that subjects were adopting a "consistency" strategy. However, even if subjects were driven by a desire to be internally consistent, that doesn't explain the wide distribution of time-preferences across subjects. Additionally, since the subjects first did the non-verbal (NV) task, it would be hard for them to be consistent (as a strategy) as they were themselves learning the mapping between sound features and rewards and delays.

Is the lack of correlation with BIS-11 not an indication of limited validity of the task?

We initially shared the concern of the reviewer, but after a closer read of the literature, we do not think that this is the case. The literature has mixed evidence for correlation between BIS and delay-discounting: some papers report significant positive correlations (Beck and Triplett, 2009; Mobini et al., 2007; Cosenza and Nigro, 2015), others don’t find significant correlations and suggest that delay discounting tasks might measure a different aspect of impulsivity (Mitchell, 1999; Fellows and Farah, 2005; Reynolds et al., 2006; Saville et al., 2010) We looked into components of BIS, since some researchers suggest that only some of them correlate with discounting coefficient (Fellows and Farah, 2005; Mobini et al., 2007; Beck and Triplett, 2009; Ahn et al., 2016). We found that correlation between NV log(k) and BIS nonplanning component is positive and significant at.05 level (r=0.3, p<0.05). We added addition text and analyses as follows:

“Prior research finds mixed evidence of the association between the BIS-11 and delay: some report significant positive correlations (Mobini et al., 2007; Beck and Triplett, 2009; Cosenza and Nigro, 2015), others don’t find significant correlations and suggest that delay discounting tasks might measure a different aspect of impulsivity (Mitchell, 1999; Fellows and Farah, 2005; Reynolds et al., 2006; Saville et al., 2010). Following earlier research that reports that components of the BIS score might drive the correlation with discounting (Fellows and Farah, 2005; Mobini et al., 2007; Beck and Triplett, 2009; Ahn et al., 2016), we next decomposed the score. Similar to others we found that correlation between BIS nonplanning component and log(𝑘𝑁𝑉) is positive and significant (Pearson𝑟=0.3,𝑝<0.05)."

Statistical analysis:

The authors claim that their tasks show a high test-retest reliability, but it seems that the authors only tested for significant differences in discount parameters between the first and second half of blocks and between experimental sessions. To assess the reliability of a measure, however, it seems more appropriate to compute correlations, that is, to test whether the most impulsive subjects in the first experimental session stay the most impulsive ones in the other sessions.

We have computed correlations as recommended. We report these correlation in the paragraph starting in paragraph four of “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”. “Consistent with existing research, we find that time-preferences are stable in the same task within subjects between the first half of the block and the second half of the block within sessions (time-preferences are measured as percent `yellow' choices, Wilcoxon signed-rank test, p = 0.35; Pearson r = 0.81, p < 10^-9) and also across experimental sessions that take place every two weeks: percent `yellow' choice between NV sessions (Wilcoxon signed-rank test, p = 0.47; Pearson r = 0.7, p < 10^-9), between SV sessions (Wilcoxon signed-rank test, p = 0.66; Pearson r = 0.82, p < 10^-9) and a slight difference between LV sessions (Wilcoxon signed-rank test, p < 0.1; Pearson r = 0.66, p < 10^-9) (Meier and Sprenger, 2015; Augenblick et al., 2015).

Regarding the context effect observed in control experiment 2: first of all, I wonder whether such a context effect occurred also in the main experiment or in control experiment 1? The sample size is rather small in control experiment 2 (N=16), so it would be good to test the robustness of this effect in a larger sample size. Furthermore, in the main text, this effect is described as a "small but significant adaptation effect", which seems to contradict the headline of this section ("Strong effect of temporal context"). Given that this effect appears rather weak, whereas the discount factors for the SV and LV tasks are surprisingly similar, I think the authors should be more careful with rejecting the "costs of waiting"-hypothesis. I do not doubt the existence of such a context adaptation effect, but it might just be an additional factor besides the higher costs of waiting/opportunity costs in the SV relative to the LV task.

We apologize for the confusion. The "Strong effect of temporal context" refers to the 5-order of magnitude shift in discount factors between short and long delays that we observed in the main experiment and also control experiment 1 (Figures 3, 4). The "small but significant" adaptation effect refers to the difference between the first few trials of a context and the rest of the trials in that context for the main results (n=63 subjects). We have moved this to its own figure (Figure 6) to make it clearer that it is not part of the days/weeks experiment.

We agree the cost of waiting plays a key role in the two short tasks. In fact, we believe that the reason that the correlation between the two short tasks is higher than the two verbal tasks is due to the importance of cost of waiting, as distinct from postponing reward. What we intended to communicate was that it seemed like too much of a coincidence that the shift in the mean discount factors between the short and long tasks was exactly the days to seconds shift.

Minor Comments:

Please specify which learning stage the excluded participants did not pass.

Stage 2 of the learning stages, learning 'fixation', was the most difficult stage. This is reported in caption of Figure 1—figure supplement 1. We also added to the same caption that this is the stage subjects that were excluded couldn’t pass.

“Some subjects experienced difficulty with the learning to ‘fixate’ during learning stage 2. Subjects that didn’t pass learning stages stopped at this stage.”

In summary, we would like to thank this referee for a comprehensive review that alerted us to the possibility of alternative explanations for the observed patterns in the data. These comments led us to undertake additional robustness exercises, increasing our confidence in the main findings.

Reviewer #2:

Authors examine patterns of delay discounting across three tasks, one non-verbal (experiential) and two verbal (one with long delay and one with short delay), and argue that individuals exhibit similar behavioral patterns scaled by the temporal context despite differences across conditions (verbal vs. non-verbal and short versus long delays).

The manuscript makes two valuable contributions. First, it provides evidence in support of comparing experiential studies in animal models to a large body of literature using verbal tests in humans. Second, it provides evidence that temporal context plays an important role in delay-discounting behavior.

I am generally positive, and have some specific comments that should help the authors improve the manuscript and make it more accessible.

Comments:

Results section, first paragraph: The description of reward for LV can be made clearer.

We revised the description of reward for LV task by adding examples in the Results section:

“In the verbal long delay task, after each choice, subjects were given feedback confirming their choice (e.g. “Your choice: 8 coins in 30 days”) and then proceeded to the next trial. Unlike the short tasks, there was no sound of dropping coins nor visual display of coins. At the end of the session, a single long-verbal trial was selected randomly to determine the payment (e.g. a subject was notified that “Trial 10 from session 1 was randomly chosen to pay you. Your choice in that trial was 8 coins in 30 days”). If the selected trial corresponded to a subject having chosen the later option, she received her reward via an electronic transfer after the delay (e.g. in 30 days).”

The exchange rates from coins to money also differed between tasks. This is pointed out in the section “Strong effect of temporal context”

Results section first paragraph: The model description in the Results is too terse. I suggest explaining a little bit more the parameters since comparison of model parameters is an important part of Results. For example, "We found that k, for all tasks, had a log-normal distribution across our subjects," would not make a whole lot of sense to a non-expert.

We have made changes throughout the manuscript in order to make the results (and the model) easier to understand for a general audience.

For example, we provided additional explanation for model parameters:

“The subject level effects are drawn from a normal distribution with mean zero. In other words, the subject level effects reflect the difference of each subject relative to the mean across subjects. As such, the actual discount factor for the 𝑛𝑡h subject in the SV task, 𝑘𝑛,𝑆𝑉= 𝑒log(𝑘̂ 𝑆𝑉)+log(𝑘̇ 𝑛,𝑆𝑉) = 𝑘̂ 𝑆𝑉⋅𝑘̇ 𝑛,𝑆𝑉, where log(𝑘̂𝑆𝑉) represents the population level log discount factor for SV and log(𝑘̇ 𝑛,𝑆𝑉) represents the subjects level effect for subject 𝑛in SV. For the sake of brevity, we use the term ‘discount factor’ to refer to ‘log discount factor’ throughout the text. The population level parameters reflect the mean over all subjects. For example, if the mean discount factor across subjects was equal in all tasks, then the population level discount factor parameters would be also be equal. If all subjects were exactly twice as impulsive in short vs. long tasks, then that change would be reflected in the population level discount factor (𝑘𝑆𝑉= 2 ⋅𝑘𝐿𝑉→ log(𝑘𝑆𝑉) = log(𝑘𝐿𝑉) + log(2)), and the subject level parameters would be the same across tasks. If, on the other hand, impulsive subjects (relative to the mean) became more impulsive, and patient subjects became more patient, that would result in clear changes to subject level parameters, with relatively little change in the population level parameters (assuming the same scaling factor for impulsive and patient subjects).”

As well, we added Supplementary files that visually show the data and model fits and report (in text) the point estimates of the model parameters for each subject in all three experiments. We feel that these give an interested reader a very concrete connection between the data and the model parameters.

+ For main: Supplementary File 1

+ For 1st control exp: Supplementary File 2

+ For 2nd control exp: Supplementary File 3

Additionally, in order to help readers visualize how different discount factors lead to different subjective values of reward as a function of time we have added Figure 3—figure supplement 3.

Along the same lines, it would be good to start with a simulation of the model showing the expected effects of various key parameters. That would make understanding of the rest of the paper easier.

The expected effects of various key parameters are shown in the supplemental plot (Figure 2—figure supplement 4).

The explanation of model is clearer in Materials and methods but if the authors wish to make the results accessible to a larger readership, more details would help. Basically, I recommend unpacking the term "A 6 population level and 4 subject level parameters model." The subject level is explained well but the population level needs clarification. It appears that certain aspects of the package used for modeling are described as a black box.

The population level parameters are further described in subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”, cited above.

Figure 2: It would help to discuss in Results the discrepancies between data and model. In some cases, there seem to be systematic errors that are ignored. Also, it would help to provide a measure of goodness of fit across subjects and tasks to get a better sense of how widespread such systematic errors were across subjects.

We appreciate the reviewers comments. In the submitted version, we ignored these systematic errors because we found that our main results were robust to different methods of estimation. We apologize for that.

The difference between the model and data in Figure 2 are not very informative. We added the following text to the caption to explain "Note: The lines here are not a model fit to aggregate data, but rather reflect the mean model parameters for each group. As such, discrepancies between the model and data here are not diagnostic. See individual subject plots (Supplementary File 1) to visualize the quality of the model fits."

The subject plots provide a visual check on goodness of fit and also report the Bayesian r2 of the fit per subject-task. We found that for some subjects the Bayesian r2 was quite low. For NV, 9 subjects, for SV, 1 subject and for LV, 2 subjects had r2 < 0.2. We re-checked the Pearson correlations of log(k) excluding those subjects and did not find any change in our main effects:

NV vs. SV

r = 0.76 [0.62 0.85]

n = 54, p-value = 1.759e-11

SV vs. LV

r = 0.61 [0.42 0.74]

n = 60, p-value = 2.97e-07

NV vs. LV

r = 0.36 [0.10 0.58]

n = 52, p-value = 0.00841

The goodness of fits for each experiment and each task is depicted by plotting individual fits against real data for each subject of the main (Figure 2—figure supplement 2), control 1 (Figure 4—figure supplement 2) and control 2 (Figure 5—figure supplement 2) experiments and computing the Bayesian r2 for each subject-task (shown in figures).

Figure 3, 4 and 5: Are the linear fits based on total least squares?

We did not use total-least squares. That was an oversight, since the X-values are not under experimental control. We have re-generated the figures (and corresponding text) using total least square, x(y) and y(x) lines, since three lines are better than one (Huang et al., 2013).

For Figures 3AB, 4AB, 5A caption text added:”Three lines are vertical, horizontal and perpendicular (or total) least squares.”

Paragraph three of subsection “Strong effect of temporal context”: HBA must be defined (it is defined in later use).

Thank you for noticing that. We have now defined hierarchical Bayesian analysis model at its first appearance, but also decided to change the notation to a more common one:

“Subjects' impulsivity was estimated by fitting their choices with a Bayesian hierarchical model (BHM) of hyperbolic discounting with decision noise.”

Overall, a very nice paper!

Reviewer #3:

[…] I have a few comments and suggestions for potentially strengthening the manuscript.

1) As the authors aptly note (subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”), the correlation between two variables is independent of any difference in their means. There were a couple places where only one of these aspects of the data was quantified, and I thought it would be useful to see both:a) It would be useful to see test-retest reliability reported in terms of the correlation across sessions, in addition to the non-significant Wilcoxon signed-ranks test. This would more thoroughly support the statements about within-task reliability in the conclusions section (subsection “Stability of preferences”).

We apologize for that oversight. We have now included those checks. “Consistent with existing research, we find that time-preferences are stable in the same task within subjects between the first half of the block and the second half of the block within sessions (time-preferences are measured as percent `yellow' choices, Wilcoxon signed-rank test, p = 0.3491; Pearson r = 0.81, p < 10^-9) and also across experimental sessions that take place every two weeks: percent `yellow' choice between NV sessions (Wilcoxon signed-rank test, p = 0.4721; Pearson r = 0.7, p < 10^-9), between SV sessions (Wilcoxon signed-rank test, p = 0.6613; Pearson r = 0.82, p < 10^-9) and a slight difference between LV sessions (Wilcoxon signed-rank test, p < 0.1; Pearson r = 0.66, p < 10^-16) (Meier and Sprenger, 2015; Augenblick et al., 2015)

b) For the days-vs-weeks control experiment, it would be nice to see the results of a paired-samples test comparing k-values in the two conditions, not just the correlation (subsection “Strong effect of temporal context” and Figure 5A). Although the correlation is high, the data in Figure 5A look like they might be systematically offset from the unity line.

Indeed, there is a systematic offset from the unity line for control experiment 2, so that the conversion rate from weeks to days is not 7 days = 1 week, we report that and we provide an additional supplemental figure (Figure 5—figure supplement 1) to discuss this effect.

“If subjects had ignored units then we would expect that log(𝑘𝑊) = log(𝑘𝐷) + log(7) = log(𝑘𝐷) + 1.95. Comparing the posteriors with that predicted shift, we can say that the shift is highly unlikely (𝑝 < 0.0001). Nonetheless, the discount factors in the two tasks were not equal. We observed a kind of amplification of preferences: the impulsive subjects were more impulsive in days than weeks and the patient subjects were more patient in days than weeks (Figure 5—figure supplement 1).”

2) It would be helpful to have more information about how the magnitude-delay pairs were constructed. For instance, what were the ranges of amounts and delays? Were they paired so as to cover a particular range of indifference k-values? (Subsection “Experimental Design” paragraph five and Figure 3—figure supplement 2 give partial information about this but not the complete picture.)

We are sorry for confusion, in the updated manuscript we write:

“Across trials, the delay and the magnitude of the sooner option were fixed (4 coins, immediately), later options were drawn from all possible pairs of five magnitudes and delays (25 different offers, Materials and methods).”

“There were 25 different “later” options presented in each task: all possible combinations of 5 delays (3, 6.5, 14, 30, 64) and 5 reward magnitudes (1, 2, 5, 8, 10).”

3) Subsection “Strong effect of temporal context” says k-values in the LV task were "almost equivalent (ignoring unexplained variance) to those in the SV task." I found this confusing because the preceding paragraphs emphasized that the LV k-values were significantly lower on average and also had higher variance than the SV k-values. Maybe this sentence just means to refer to the fact that LV and SV were correlated?

We apologize for the confusion: yes we intended to convey the message that they were surprisingly close given the differences between seconds and days. We have edited that text as follows:

“But, we found that the discount factors in the LV task, 𝑘𝐿𝑉, were close to those in the other tasks (within ~1 log-unit) (Figure 3C).”

I also found it odd that the mismatch in units wasn't dealt with until this paragraph (i.e., whether k-value represents discounting per day or per second). I had assumed common units were being used when I first read the comparisons of k-values between tasks (paragraph five subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”). I think it would be helpful either to use matching units throughout, or point out (and explain) the choice not to at the outset.

We thank you for your comment and tried to clarify the use of units throughout the paper. The reason for fitting the units of experiment at first, rather than converting to common units was due to not having an a priori hypothesis about the relationship between the tasks. Thus, we first checked rank correlations between discount factors in different tasks. After seeing the results, we decided that utilizing the units of the task makes it easier to see that the scaling between short and long tasks almost matches the units conversion. We now mention the units in many places throughout the manuscript. It appears first in the text on reported in paragraph two of subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”:

“Since we did not ex ante have a strong hypothesis about how the subjects' impulsivity measures in one task would translate across tasks, we fit subjects choices in the units of the task (i.e. seconds and days), examined ranks of impulsivity at first and found significant correlations across experimental tasks (Table 1).”

Units are also indicated on Figure 3A,B and in the caption.

4) A striking aspect of the results is the large difference in discount rates between short, directly experienced delays and long, non-experienced delays. In addition to considering the possibility that experienced delay is uniquely aversive (subsection “Cost of waiting vs. discounting future gains”), it would be interesting to consider the possible role of opportunity costs. I gather the NV and SV conditions didn't involve direct opportunity costs within the context of the experiment; that is, choosing longer delays didn't reduce the total number of trials, so the reward-maximizing strategy would always be to choose the larger reward? Did participants know in advance that they had a fixed number of trials (rather than having a time budget)? It would be interesting to know how participants' earnings compared to what they could have earned by following the reward-maximizing strategy. It would also be interesting to know whether they managed to finish the session and leave earlier than they would have by following the reward-maximizing strategy.

We thank the reviewer for this feedback. We agree that the difference in discount rates between short and long tasks is quite striking! We have reviewed additional literature on opportunity costs added some relevant references to the Discussion: "Second, it may be that with short delay tasks we are capturing cost of waiting while long delay tasks measure delay-discounting. The costs of waiting could take several forms (Paglieri, 2013). One form is the cost of boredom (Mills and Christoff, 2018). Subjects could find it painful to sit and wait, starting at the clock, during the delay. Additionally, there could be opportunity costs related to how much subjects value their own time. We found that in the short tasks subjects with large discount factors also performed the task faster (Figure 3—figure supplement 1). If these subjects value their time more and thus have higher costs of waiting, then given our results Figure 3B there is a surprisingly large correlation between how much subjects value their time (in the short tasks) and how much they discount postponed rewards (in the long task). Regardless of the precise form of the costs of waiting (Chapman, 2001; Paglieri, 2013; Navarick, 2004) in order for these costs to explain the temporal scaling we observed between short and long tasks, relative to the costs of postponing, they would have to be, coincidentally, close in value to the number of seconds in a day.”

As to your specific questions: subjects experiences a fixed number of trials but were not told so. Choosing longer delay did not reduce the total number of trials. We have added the subject instructions as an appendix to make this clear to readers. The only info regarding this matter that was provided was in the consent form, it said “the session won’t take longer than 1 hour”.

We looked into fraction of total rewards and total time spent by subjects for each task (Figure 3—figure supplement 1). As expected both were correlated with discount factor. Plus, subjects that took longer times to finish the session were also slower (mouse clicking, etc.) in non waiting time as shown in the supplemental figure.

Minor Comments:

1) In the first paragraph of the Results section please say whether the given number of trials (160) is total or per task.

Thank you for this comment. We reported in the Results section that the given number of trials is per task:

“Across sessions, at least 160 trials in each task were conducted after learning (Materials and Methods, Figure 1—figure supplement 1).”

We also report the total number of trials in the main experiment.

2) In the third paragraph of subsection “Stability of preferences” I would replace "eliminated" with "matched" or something similar. I initially read it as "ruled out," which is the opposite of the intended meaning.

We see how that could be confusing. We have changed the text to "Our control study using delays of days vs. weeks compared tasks with different scales but did not differ in the experience of the delayed rewards, as in LV, only (at most) one delayed reward was experienced for both days and weeks tasks."

3) Tables 2 and 3 would benefit from more descriptive legends. In particular, I initially misunderstood the Table 3 legend as meaning the outcome variable for this analysis was k-value variance (along the lines of the scaling effect mentioned for Table 4).

We updated Table 3 (now Table 2) legend for clarity:

“Relative contributions of two gaps to variance in log(k) (2-factor model comparison with two reduced 1-factor models)”

4) In Figure 5, I suggest noting explicitly in the legend that panels B-C pertain to the main experiment (and not the same experiment represented in panel A).

We realized that was confusing. We decided to split this Figure into two: Figure 5 now has only control experiment 2 results, whereas Figure 6 only displays the adaptation effect.

Figure 6 caption reads: “(A,B) Main experiment early trials adaptation effect.”

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that we suggest you address before this is published.

1) The authors mention that boredom or opportunity costs may play a role with short delay. One may wonder whether these factors point at a potential difference between human and animal tasks as primary rewards could be ingested as they arrive. By extension, the presumed parallel of the present task with animal tasks may be smaller than assumed. This potential limitation could be mentioned in the discussion as it the comparison of human and animal research is a major motivation for the present study.

Animals certainly also experience opportunity costs and there is substantial evidence that they experience boredom as well. There is interesting work on the evolutionary pressures that selected for time-discounting, but we feel that this is outside the scope of our discussion. We have added the following lines to the Discussion:

In the first paragraph of “Subjective Scaling of Time”

Whether or not the secondary reinforcer used in our task is experienced in an analogous way to primary reinforcers used in animal studies may limit the degree of overlap in underlying neural mechanisms.

In the third paragraph:

One form is the cost of boredom (Mills and Christoff, 2018); a feeling which animals may also experience (Wemelsfelder, 1984)

2) In the analysis in the final paragraph of subsection “Subjects’ time-preferences are reliable across both verbal/experiential and second/day differences”, which compares discount rates across tasks, it's now stated clearly that different units are used for k-values in the different tasks. But it might be beneficial to more fully describe the motivation for the analysis in light of this. Why is it of interest to test whether per-second discount rates in one task differ from per-day discount rates in another?

Re-reading that section, we can appreciate the reviewers’ comment. We now include this in that section:

“Note, that 𝑘𝑁𝑉 and 𝑘𝑆𝑉 have units of Hz (1∕𝑠),but 𝑘𝐿𝑉 has units of 1∕𝑑𝑎𝑦. Thus, while the 95% credible intervals of the means of log(𝑘) are overlapping for the three tasks when expressed in the units of each task, the mean log(𝑘𝐿𝑉) is in fact shifted to -14.86 when 𝑘𝐿𝑉 is expressed in units of 1/s. We further analyze and discuss this scaling subsequently, but first we compare log(𝑘) in the units of each task, in consideration of subjects potentially ignoring the time units in their choices (Furlong and Opfer, 2009; Cox and Kable, 2014). We find that, on average, subjects were most patient in LV, then SV then NV Table 3). Note, that a shift of 1 log-unit is substantial. For example, a subject with log(𝑘𝑆𝑉) ≈ −3 would value 10 coins at half its value in just 20 seconds. But for log(𝑘𝑆𝑉) ≈ −4 the coins would lose half their value in 55 seconds (Figure 3—figure supplement 3).

And in the caption for Table 3 we added:

“Expressing log(𝑘𝐿𝑉) in units of 1/s (for direct comparison with the other tasks) results in a negative shift in log(𝑘𝐿𝑉) and even larger differences in means without changing the difference between standard deviations.

3) The Table 3 legend seems to have a typo (the 2nd occurrence of k_NV should be k_LV), and the abbreviation "Ev. Ratio" should be defined and explained (the evidence ratio is not introduced until paragraph seven of the “Analysis” section).

Thank you for pointing that out. The typo is fixed and evidence ratio is explained in the caption.

4) In subsection “Time and Reward Re-Scaling” I didn't understand why the k-values were referred to as "unit-free".

We apologize for the confusion. What we meant was that we originally compared K in units of the respective task. “Unit-free” was a poor word choice and has been removed from the two places it appeared in the paper.

https://doi.org/10.7554/eLife.39656.037

Article and author information

Author details

  1. Evgeniya Lukinova

    1. NYU-ECNU Institute of Brain and Cognitive Science at NYU Shanghai, Shanghai, China
    2. NYU Shanghai, Shanghai, China
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8357-9307
  2. Yuyue Wang

    1. NYU-ECNU Institute of Brain and Cognitive Science at NYU Shanghai, Shanghai, China
    2. NYU Shanghai, Shanghai, China
    Contribution
    Software, Investigation
    Competing interests
    No competing interests declared
  3. Steven F Lehrer

    1. NYU-ECNU Institute of Brain and Cognitive Science at NYU Shanghai, Shanghai, China
    2. NYU Shanghai, Shanghai, China
    3. School of Policy Studies and Department of Economics, Queen’s University, Kingston, Canada
    4. The National Bureau of Economic Research, Cambridge, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Writing—review and editing
    Competing interests
    No competing interests declared
  4. Jeffrey C Erlich

    1. NYU-ECNU Institute of Brain and Cognitive Science at NYU Shanghai, Shanghai, China
    2. NYU Shanghai, Shanghai, China
    3. Shanghai Key Laboratory of Brain Functional Genomics (Ministry of Education), East China Normal University, Shanghai, China
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    jerlich@nyu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9073-7986

Funding

National Science Foundation of China (NSFC-31750110461)

  • Evgeniya Lukinova

Shanghai Eastern Scholar Program

  • Evgeniya Lukinova

NYU Shanghai (Research Challenge Grant)

  • Steven F Lehrer
  • Jeffrey C Erlich

Science and Technology Commission of Shanghai Municipality (15JC1400104)

  • Jeffrey C Erlich

NYU-ECNU Joint Institute for Brain and Cognitive Science at NYU Shanghai

  • Jeffrey C Erlich

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank NYU Shanghai undergraduate students Stephen Mathew, Xirui Zhao, Wanning Fu and Jonathan Lin and Research Specialist at Behavior and Experimental Economics Lab Siyan Yao who helped with data collection. Members of the Erlich lab provided thoughtful feedback throughout the project. Paul Glimcher, Ming Hsu and Joseph Kable contributed helpful advice about this project. We also thank Mehrdad Jazayeri and the other reviewers for their suggestions which substantially improved the paper through the review process.

Ethics

Human subjects: The study was approved by the institutional review board of NYU Shanghai following all Chinese and USA regulations regarding human subjects research (IRB Protocol #003-2015). All subjects were NYU Shanghai students recruited on campus and gave informed consent and consent to publish the results (with anonymized data) before participation in the study.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Daeyeol Lee, Yale School of Medicine, United States

Publication history

  1. Received: June 29, 2018
  2. Accepted: January 16, 2019
  3. Version of Record published: February 5, 2019 (version 1)

Copyright

© 2019, Lukinova et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 664
    Page views
  • 65
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Physics of Living Systems
    Eugenia Lyashenko et al.
    Research Article
    1. Cell Biology
    2. Computational and Systems Biology
    Jakub Tomek et al.
    Research Article Updated