Introduction

What is the worth of a pursuit? At the most universal level, temporal decision-making regards weighing the return of pursuits against their cost in time. The fields of economics, psychology, behavioral ecology, neuroscience, and artificial intelligence have endeavored to understand how animals, humans, and learning agents evaluate the worth of pursuits: how they factor the cost of time in temporal decision-making. A central step in doing so is to identify a normative principle and then to solve for how an agent, abiding by that principle, would best invest time in pursuits that compose a world. A normative principle with broad appeal identified in behavioral ecology is that of reward-rate maximization, as expressed in Optimal Foraging Theory (OFT), where animals seek to maximize reward rate while foraging in an environment (Charnov, 1976a, 1976b; Krebs et al., 1977; Pyke et al., 1977; Pyke, 1984). Solving for the optimal decision-making behavior under this objective provides the means to examine the curious pattern of adherence and deviation that is exhibited by humans and animals with respect to that ideal behavior. This difference provides clues into the process that animals and humans use to learn the value of, and represent, pursuits. Therefore, it is essential to analyze reward rate maximizing solutions for the worth of initiating a pursuit to clarify what behavioral signs are—and are not—deviations from optimal performance in the identification of the process (and its sources of error) actually used by humans and animals.

Equivalent immediate reward (subjective value, sv)

To ask, ‘what is the value of a pursuit?’ is to quantify by some metric the worth of a future state—the pursuit’s outcome—at the time of a prior one, the pursuit’s initiation. A sensible metric for the worth of a pursuit is the magnitude of immediate reward that would be treated by an agent as equivalent to a policy of investing the requisite time in the pursuit and obtaining its reward. This equivalent immediate reward, as judged by the agent, is the pursuit’s “Subjective Value” (sv), in the parlance of the field (Mischel et al., 1969). It is widely assumed that decisions about what pursuits should be taken are made on the basis of their subjective value (Niv, 2009). However, a decision-making algorithm needn’t calculate subjective value in its evaluation of the worth of initiating a pursuit. It could, for instance, assess the reward rate of the pursuit over that of the reward rate received in the world as a whole. Indeed, algorithms leading to reward rate optimization can arise from different underlying processes, each with their own controlling variables. Nonetheless, any algorithm’s evaluation can be re-expressed in terms of equivalent immediate reward, providing a ready means to compare evaluation across different learning algorithms and representational architectures as biologically realized in animals and humans or as artificially implemented in silico.

Decisions to initiate pursuits

As decisions occur at branch points between pursuits, the value of initiating a pursuit is of particular importance, as it is on this basis that an agent would decide 1) whether to accept or forgo an offered pursuit; or, 2) how to choose between mutually exclusive pursuits. Though ‘Forgo’ decisions are regarded as near-optimal, as in prey selection (Krebs et al., 1977; Stephens and Krebs, 1986; Blanchard and Hayden, 2014), ‘Choice’ decisions—as commonly tested in laboratory settings—reveal a suboptimal bias for smaller-sooner rewards when selection of later-larger rewards would maximize global reward rate (Logue et al., 1985; Blanchard and Hayden, 2015; Carter and Redish, 2016; Kane et al., 2019). This curious pattern of behavior, wherein forgo decisions can present as optimal while choice decisions as suboptimal, poses a challenge to any theory purporting to rationalize temporal decision-making as observed in animals and humans.

Temporal Discounting Functions

Historically, temporal decision-making has been examined using a temporal discounting function to describe how delays in rewards influence their valuation. The “temporal discounting function” describes the magnitude-normalized subjective value of an offered reward as a function of when the offered reward is realized. An understanding of the form of temporal discounting has important implications in life, as steeper temporal discounting has been associated with many negative life outcomes (Bretteville-Jensen, 1999; Critchfield and Kollins, 2001; Bickel et al., 2007, 2012; Story et al., 2014), most notably the risk of developing an addiction. Psychologists and behavioral scientists have long found that animals’ temporal discounting in intertemporal choice tasks is well-fit by a hyperbolic discounting function (Ainslie, 1974; Mazur, 1987; Richards et al., 1997; Monterosso and Ainslie, 1999; Green and Myerson, 2004; Hwang et al., 2009; Louie and Glimcher, 2010). Other examples of motivated behavior also show hyperbolic temporal discounting (Haith et al., 2012).

Often, this perspective assumes that the delay in and of itself devalues a pursuit’s reward, failing to carefully distinguish the impact of its delay from the impact of the time required and reward obtained outside the considered pursuit. As a result, the discounting function tends to be treated as a process unto itself rather than the consequence of a process. Consequently, the field has concerned itself with the form of the discounting function—exponential (Glimcher et al., 2007; McClure et al., 2007), hyperbolic (Rachlin et al., 1972; Ainslie, 1975; Thaler, 1981; Mazur, 1987; Benzion et al., 1989; Green et al., 1994; Frederick et al., 2002; Kobayashi and Schultz, 2008; Calvert et al., 2010), pseudo-hyperbolic (Laibson, 1997; Montague et al., 2006; Berns et al., 2007), etc., as either derived from some normative principle, or as fit to behavioral observation. An exponential discounting function, for instance, was derived by Samuelson from the normative principle of time consistency (Samuelson 1937) and is widely held as rational (Samuelson, 1937; Koopmans, 1960; Laibson, 1997; Montague and Berns, 2002; McClure et al., 2004, 2007; Mazur, 2006; Schweighofer et al., 2006; Berns et al., 2007; Nakahara and Kaveri, 2010; Kane et al., 2019), and by implication, reward rate maximizing. Observed temporal decision-making behavior, however, routinely exhibits time inconsistencies (Strotz, 1956; Ainslie, 1975; Laibson, 1997; Frederick et al., 2002) and is better fit by a hyperbolic discounting function (Ainslie, 1975; Mazur et al., 1985; Frederick et al., 2002; Green and Myerson, 2004), and on that contrasting basis, humans and animals have commonly been regarded as irrational (Takahashi and Han, 2012; Kane et al., 2019). In addition, the case that humans and animals are irrational is, ostensibly, furthered by the observation of the ‘Magnitude Effect’ (Green et al., 1997; Baker et al., 2003; Estle et al., 2006; Yi et al., 2006; Grace et al., 2012; Kinloch and White, 2013) and the ‘Sign Effect’ (Thaler, 1981; Loewenstein and Thaler, 1989; Loewenstein and Prelec, 1992; Frederick et al., 2002; Baker et al., 2003; Estle et al., 2006; Kalenscher and Pennartz, 2008), where the apparent discounting function is affected by the magnitude and the sign of the offered pursuit’s outcome, respectively.

Here, we aim to identify equations for evaluating the worth of initiating pursuits that an agent could implement to enable reward-rate maximization. We wish to gain deeper insight into how a considered pursuit, with its defining features (its reward and time), relates to the world of pursuits in which it is embedded, in determining the pursuit’s worth. Specifically, we investigate how pursuits and the pursuit-to-pursuit structure of a world interact with policies of investing time in particular pursuits to determine the global reward rate reaped from an environment. We aim to provide greater clarity into what constitutes time’s cost and how it can be understood with respect to the reward and temporal structure of an environment and to counterfactual time investment policies. We propose that, by determining optimal decision-making equations and converting them to their equivalent subjective value and temporal discounting functions, actual (rather than assumed) deviations from optimality exhibited by humans and animals can be truly determined. We speculate that purported anomalies deviating from ostensibly ‘rational’ decision-making may in fact be consistent with reward rate optimization. Further, by identifying parameters enabling reward rate maximization and assessing resulting errors in valuation caused by their misestimation, we aim to gain insight into which parameters humans and animals may (mis)-represent that most parsimoniously explains the pattern of temporal decision-making actually observed.

Results

To gain insight into the manner by which animals and humans attribute value to pursuits, it is essential to first understand how a reward rate maximizing agent would evaluate the worth of any pursuit within a temporal decision-making world. Here, by considering Forgo and Choice temporal decisions, we re-conceptualize how an ideal reward rate maximizing agent ought to evaluate the worth of initiating pursuits. We begin by formalizing temporal decision-making worlds as constituted of pursuits, with pursuits described as having reward rates and weights (their relative occupancy). Then, we examine Forgo decisions to examine what composes the cost of time and how a policy of taking/forgoing pursuits factors into the global reward rate of an environment and thus the worth of a pursuit. Having done so, we derive two equivalent expressions for the worth of a pursuit and from them re-express the worth of a pursuit as its equivalent immediate reward (its ‘subjective value’) in terms of the global reward rate achieved under policies of 1) accepting or 2) forgoing the considered pursuit type. We next examine Choice worlds to investigate the apparent nature of a reward rate optimizing agent’s temporal discounting function. Finally, having identified reward rate maximizing equations, we examine what parameter misestimation leads to suboptimal pursuit evaluation that best explains behavior observed in humans and animals. Together, by considering the temporal structure of a time investment world as one composed of pursuits described by their rates and weights (relative occupancy), we seek to identify equations for how a reward rate maximizing agent could evaluate the worth of any pursuit comprising a world and how those evaluations would be affected by misestimation of enabling parameters.

Temporal decision worlds are composed of pursuits with reward rates and weights

A temporal decision-making world is one composed of pursuits. A pursuit is a defined path over which an agent can traverse by investing time that often (but not necessarily) results in reward but which always leads to a state from which one or more potential other pursuits are discoverable. Pursuits have a reward magnitude (r) and a time (t). A pursuit therefore has 1) a reward rate (ρ, rho) and 2) a weight (w), being its relative occupancy with respect to all other pursuits. To refer to the reward, the time, the reward rate, or the weight of a given pursuit, r, t, ρ, or w, respectively, is prepended to the subscript (or name) of the pursuit (ρPursuit, wPursuit). In this way, the pursuit structure of temporal decision-making worlds, and the qualities defining pursuits, can be adequately referenced.

The temporal decision-making worlds considered are recurrent in that an agent traversing a world under a given policy will eventually return back to its current location. As pursuits constitute an environment, the environment itself then has a reward rate, the ‘global reward rate’ ρg, achieved under a given decision policy, ρgPolicy. Whereas the global reward rate realized under a given policy of choosing one or another pursuit path may or may not be reward-rate optimal, the global reward rate achieved under a reward-rate maximal policy will be denoted as ρg*.

Forgo and Choice decision topologies

Having established a nomenclature for the properties of a temporal decision-making world, we now identify two fundamental types of decisions regarding whether to initiate a pursuit: “Forgo” decisions, and “Choice” decisions. In a Forgo decision (Figure 1, left), the agent is presented with one of possibly many pursuits that can either be accepted or rejected. After either the conclusion of the pursuit, if accepted, or immediately after rejection, the agent returns to a pursuit by default (the “default” pursuit), which effectively can be a waiting period, until the next pursuit opportunity becomes available. Rejecting the offered pursuit constitutes a policy of spending less time to make a traversal across that decision-making world, whereas accepting the offered pursuit constitutes a policy of spending more time to make a traversal. In a Choice decision (Figure 1, right), the agent is presented with a choice between at least two simultaneous and mutually exclusive pursuits, typically differing in their respective rewards’ magnitudes and delays. Under any decision, upon exit from a pursuit, the agent returns to the same environment that it would have entered were the pursuit rejected. In the Forgo case in Figure 1, a policy of spending less time to traverse the world by rejecting the purple pursuit to return to the gold pursuit—and thus obtaining a smaller amount of reward (left)—must be weighed against a policy of acquiring more reward by accepting the purple pursuit at the expense of spending more time to traverse the world (right). In the Choice case in Figure 1, a policy of spending less time to traverse the world (left) by taking the smaller-sooner pursuit (aqua) must be weighed against a policy of spending more time to traverse the world (right) by accepting the larger-later pursuit (purple).

Fundamental classes of temporal decision-making regarding initiating a pursuit: “Forgo” and “Choice”. Fundamental classes of temporal decision-making regarding initiating a pursuit: “Forgo” and “Choice”. 1st row- Topologies. The temporal structure of worlds exemplifying Forgo (left) and Choice (right) decisions mapped as their topologies. Forgo: A forgo decision to accept or reject the purple pursuit. When exiting the gold pursuit having obtained its reward (small blue circle), an agent is faced with 1) a path to re-enter gold, or 2) a path to enter the purple pursuit, which, on its completion, re-enters gold. Choice: A choice decision between an aqua pursuit, offering a small reward after a short amount of time, or a purple pursuit offering a larger amount of reward after a longer time. When exiting the gold pursuit, an agent is faced with a path to enter 1) the aqua or 2) the purple pursuit, both of which lead back to the gold pursuit upon their completion. 2nd row-Policies. Decision-making policies chart a course through the pursuit-to-pursuit structure of a world. Policies differ in the reward obtained, and in the time required, to complete a traversal of that world under that policy. Policies of investing less (left) or more (right) time to traverse the world are illustrated for the considered Forgo and Choice worlds. Forgo: A policy of rejecting the purple pursuit to re-enter the gold pursuit (left) acquires less reward though it requires less time to make a traversal of the world than a policy of accepting the purple option (right). Choice: A policy of choosing the aqua pursuit (left) results in less reward though requires less time to traverse the world than a policy of choosing the purple pursuit (right). 3rd row-Time/reward investment. The times (solid horizontal lines, colored by pursuit) and rewards (vertical blue lines) of pursuits, and their associated reward rates (dashed lines) acquired under a policy of forgo or accept in the Forgo world, or, of choosing the sooner smaller or later larger pursuit in the Choice world.

Behavioral observations under Forgo and Choice decisions

These classes of temporal decisions have been investigated by ecologists, behavioral scientists, and psychologists for decades. Forgo decisions describe instances likened to prey selection (Krebs et al., 1977; Stephens and Krebs, 1986; Blanchard and Hayden, 2014). Choice decisions have extensively been examined in intertemporal choice experiments (Rachlin et al., 1972; Ainslie, 1974; Bateson and Kacelnik, 1996; Stephens and Anderson, 2001; Frederick et al., 2002; Hayden and Platt, 2007; McClure et al., 2007; Carter et al., 2015; Carter and Redish, 2016). Experimental observation in temporal decision-making demonstrates that animals are optimal (or virtually so) in Forgo (Krebs et al., 1977; Stephens and Krebs, 1986; Blanchard and Hayden, 2014), taking the offered pursuit when its rate exceeds the “background” reward rate, and are as if sub-optimally impatient in choice, selecting the smaller-sooner (SS) pursuit when the larger-later (LL) pursuit is just as good if not better (Logue et al., 1985; Blanchard and Hayden, 2015; Carter and Redish, 2016; Kane et al., 2019).

Deriving optimal policy from forgo decision-making worlds

We begin our examination of how to maximize the global reward rate reaped from a landscape of rewarding pursuits by examining forgo decisions. A general formula for the global reward rate of an environment in which agents must invest time in obtaining rewards is needed in order to formally calculate a policy’s ability to accumulate reward. Optimal policies maximize reward accumulation over the time spent foraging in that environment. In a forgo decision, an agent is faced with the decision to take, or to forgo, pursuit opportunities. We sought to determine the reward rate an agent would achieve were it to pursue rewards with magnitudes r1, r2, …, rn each requiring an investment of time t1, t2, …, tn. At any particular time, the agent is either 1) investing time in a pursuit of a specific reward and time, or 2) available to encounter and take new pursuits from a pursuit to which it defaults. With the assumption that reward opportunities become randomly encountered by the agent at a frequency of f1, f2, …, fn from the default pursuit, it becomes possible to calculate the total global reward rate of the environment, ρg, as in Equation 1 (Ap. 1 - Derivation of global reward rate under multiple pursuits)…

…where ρd is the rate of reward attained in the default pursuit. Should rewards not occur while in the default pursuit, ρd, will be zero. Equation 1 allows for the calculation of the global reward rate achieved by any policy accepting a particular set of pursuits from the environment. This derivation of global reward rate is akin to those similarly derived for prey selection models (see (Charnov and Orians, 1973) and (Stephens and Krebs, 1986).

Parceling the world into the considered pursuit type (“in” pursuit) and everything else (“out” of pursuit)

In order to simplify representations of policies governing any given pursuit opportunity, we reformulate the above expression for global reward rate, ρg, from the perspective of a policy of accepting any given pursuit. The environment may be parcellated into the time spent and rewards achieved inside the considered pursuit on average, for every instance that time is spent and rewards achieved outside the considered pursuit, on average. We can pull out the inside reward (rin) and inside time (tin) from the equation above, to isolate the inside and outside components of the equation.

From there, we define tout as the average time spent outside the considered pursuit for each instance that the considered pursuit is experienced.

Similarly, the outside reward, rout, encompasses the average amount of reward collected from all sources outside the considered pursuit.

Parceling a pursuit world into a considered pursuit (all instances “inside” the considered pursuit type) and everything else (i.e., everything “outside” the considered pursuit type), then gives the generalized form for the reward rate of an environment under a given policy as…

…which depends on the average reward earned and the average time spent between opportunities to make the decision, in addition to the average reward returned and average time spent in the considered pursuit (Ap. 3 & Figure 3).

Figure 2 depicts the global reward rate achieved with respect to the time and reward obtained from a considered pursuit (“Inside”) and the time and reward obtained outside that considered pursuit type, i.e., that pursuit’s (“Outside”). By so parsing the world into “in” and “outside” the considered pursuit, it can also be appreciated from Figure 2 that the fraction of time in the environment invested in the considered option, in, can be expressed as , and the fraction of time spent outside the considered pursuit as 1 − win. A world can thus be understood in terms of its composing pursuits’ reward rates and weights (their relative occupancy), with the global reward rate being a weighted average of the reward rate from the considered pursuit,, and the reward rate outside the considered pursuit, .

Global reward rate with respect to parceling the world into “in” and “outside” the considered pursuit.

A-C as in Figure 1 “Forgo”. D) The world divided into “Inside” and “Outside” the purple pursuit, as the agent decides whether to forgo or accept. The axes are centered on the position of the agent, just before the purple pursuit, where the upper right quadrant shows the inside (purple) pursuit’s reward rate (ρin), while the bottom left quadrant shows the outside (gold) pursuit reward rate (ρout). The global reward rate (ρg) is shown in magenta, calculated from the equation in the box to the right. The agent may determine the higher reward rate yielding policy by comparing the outside reward rate (ρout) with the resulting global reward rate (ρg) under a policy of accepting the considered pursuit.

Therefore, the global reward rate is the sum of the local reward rates of the world’s constituent pursuits under a given policy when weighted by their relative occupancy: the weighted average of the local reward rates of the pursuits constituting the world.

Reward-rate optimizing forgo policy: compare a pursuit’s local reward rate to its outside reward rate

We can now compare two competing policies to identify the policy that maximizes reward rate, such that it is the maximum possible reward rate, ρg*. A policy of taking or forgoing a given pursuit type may improve the reward rate reaped from the environment as a whole (Figure 3). Using equation 5, the policy achieving the greatest global reward rate can be realized through an iterative process where pursuits with lower reward rates than the reward rate obtained from everything other than the considered pursuit type, are sequentially removed from the policy. The optimal policy for forgoing can therefore be calculated directly from the considered pursuit’s reward rate, ρin, and the reward rate outside of that pursuit type, ρout. Global reward rate can be maximized by iteratively forgoing the considered pursuit if its reward rate is less than its outside reward rate, ρin < ρout, treating forgoing and taking a considered pursuit as equivalent when ρin = ρout, and taking the considered pursuit when ρin > ρout (Ap. 5).

Forgo Decision-making.

A) When the reward rate of the considered pursuit exceeds that of its outside rate, the global reward rate will be greater than the outside, and therefore the agent should accept the considered pursuit. B) When the reward rates inside and outside the considered pursuit are equivalent, the global reward rate will be the same when accepting or forgoing: the policies are equivalent. C) When the reward rate of the considered pursuit is less than its outside rate, the resulting global reward rate if accepting the considered pursuit will be less than its outside reward rate and therefore should be forgone.

Following this policy would be equivalent to comparing the local reward rate of a pursuit to the global reward rate obtained under the reward rate optimal policy: forgo the pursuit when its local reward rate is less than the global reward under the reward rate optimal policy, ρin < ρg *, take or forgo pursuit when the reward rate of the pursuit is equal to the global reward rate under the optimal policy ρin = ρg *, and take pursuit when its local reward rate is more than the global reward rate under the reward rate optimal policy, ρin > ρg * (Ap. 5). The maximum reward rate reaped from the environment can thus be eventually obtained by comparing the local reward rate of a considered pursuit to its outside reward rate (i.e., the global reward rate of a policy of not accepting the considered pursuit type).

Equivalent immediate reward: the ‘subjective value’, sv, of a pursuit

Having recognized how a world can be decomposed into pursuits described by their rates and weights and identifying optimal policies under forgo decisions, we may now ask anew, “What is the worth of a pursuit?” Figure 2D illustrates that the global reward rate obtained under a policy of taking a pursuit is not just a function of the time and return of the pursuit itself, but also the time spent and return gained outside of that pursuit type. Therefore, the worth of a pursuit relates to how much the pursuit would add (or detract) from the global reward rate realized in its acquisition.

Subjective Value of the considered pursuit with respect to the global reward rate

This relationship between a considered pursuit type, its outside, and the global reward rate can be re-expressed in terms of an immediate reward magnitude requiring no time investment that yields the same global reward rate as that arising from a policy of taking the pursuit (Figure 4). Thus, for any pursuit in a world, the amount of immediate reward that would be accepted in place of its initiation and attainment could serve, then, as a metric of the pursuit’s worth at the time of its initiation. Given the optimal policy above, an expression for this immediate reward magnitude can be derived (Ap. 6). This global reward-rate equivalent immediate reward (see Figure 4) is the subjective value of a pursuit, svPursuit (or simply, sv, when the referenced pursuit can be inferred).

The Subjective Value (sv) of a pursuit is the global reward rate-equivalent immediate reward magnitude.

The subjective value of a pursuit is that amount of reward requiring no investment of time that the agent would take as equivalent to accepting and acquiring the considered pursuit. For this amount to be equivalent, the immediate reward magnitude must result in the same global reward rate as that of accepting the pursuit. The global reward rate obtained under a policy of accepting the considered pursuit type is the slope of the line connecting the average times and rewards obtained in and outside the considered pursuit type. Therefore, the global reward rate equivalent immediate reward (i.e., the subjective value of the pursuit) can be depicted graphically as the y-axis intercept of the line representing the global reward rate achieved under a policy of accepting the considered pursuit.

Equation 8. The Subjective Value of a pursuit expressed in terms of the global reward rate achieved under a policy of accepting that pursuit

The subjective value of a pursuit under the reward-rate optimal policy will be denoted as sv*Pursuit.

The calculation of the subjective value of a pursuit, sv, quantifies precisely the worth of a pursuit in terms of an immediate reward that would result in the same global reward rate as that pursuant to its attainment. Thus, choosing either an immediate reward of magnitude sv, or choosing to pursue the considered pursuit, investing the required time and acquiring its reward, would produce an equivalent global reward rate. An agent pursuing an optimal policy would find immediate rewards of magnitude less than sv less preferred than the considered pursuit, and immediate rewards of magnitude greater than sv more preferred than the pursuit.

The forgo decision can also be made from subjective value

With this understanding, in the case that the considered pursuit’s reward rate is greater than its outside reward rate, it will be greater than the optimal global reward rate, and therefore the subjective value under an optimal policy will be greater than zero (Figure 3A).

Should the considered pursuit’s reward rate be equal to its outside reward rate, it will be equal to the optimal global reward rate, and the subjective value of the considered pursuit will be zero (Figure 3B).

Finally, if the considered pursuit’s reward rate is less than the outside reward rate, it must also be less than the global optimal reward rate; therefore, the subjective value of the considered pursuit under the optimal policy will be less than zero (Figure 3C).

While brains of humans and animals may not in fact calculate subjective value, converting to the equivalent immediate reward, sv 1) makes connection to temporal decision-making experiments where such equivalences between delayed and immediate rewards are assessed, 2) serves as a common scale of comparison irrespective of the underlying decision-making process, and 3) deepens an understanding of how the worth of a pursuit is affected by the temporal structure of the environment’s reward-time landscape.

Subjective value with respect to the pursuit’s outside: insights into the cost of time

To the latter point, Equation 8 has a (deceptively) simple appeal: the worth of a pursuit ought be its reward magnitude less its cost of time (Figure 5A). But what is the cost of time? The cost of time of a considered pursuit is the global reward rate of the world under a policy of accepting the pursuit, times the time that the pursuit would take, ρgtin (Figure 5B). Therefore, the equivalent immediate reward of a pursuit, its subjective value, corresponds to the subtraction of the cost of time from the pursuit’s reward. The subjective value of a pursuit is how much extra reward is earned from the pursuit than would on average be earned by investing that amount of time, in that world, under a policy of accepting the considered pursuit.

Equivalent expressions for subjective value reveal time’s cost comprises an opportunity as well as apportionment cost.

A. The subjective value of a pursuit can be expressed in terms of the global reward rate obtained under a policy of accepting the pursuit. It is how much extra reward is earned from the pursuit over its duration than would on average be earned under a policy of accepting the pursuit. B. The cost of time of a pursuit is the amount of reward earned on average in an environment over the time needed for its obtainment under a policy of accepting the pursuit. The reward rate earned on average is the global reward rate (slope of maroon line). Projecting that global reward over the time of the considered pursuit (dashed maroon line) provides the cost of time for the pursuit (vertical maroon bar). Therefore, the subjective value of a pursuit is equivalent to its reward magnitude less the cost of time of the pursuit. C. Expressing subjective value with respect to the outside reward rate rather than the global reward rate reveals that a portion of a pursuit’s time costs arises from an opportunity cost (orange bar). The opportunity cost of a pursuit is the amount of reward earned over the considered pursuit’s time on average under a policy of not taking the considered pursuit (the outside reward rate (slope of gold line). Projecting the slope of the gold line over the time of the considered pursuit (dashed gold line) provides the opportunity cost of the pursuit (vertical orange bar). The opportunity cost-subtracted reward (cyan bar) can then be scaled to a magnitude of reward requiring no time investment that would be equivalent to investing the time and acquiring the reward of the pursuit, i.e., its subjective value. The equation’s denominator provides this scaling term, which is the proportion that the outside time is to the total time to traverse the world (the equation’s denominator). D. The difference between time’s cost and the opportunity cost of a pursuit is a pursuit’s apportionment cost (brown bar). The apportionment cost is the amount of the opportunity subtracted reward that would occur on average over the pursuit’s time under a policy of accepting the pursuit. E&F. Whether expressed in terms of the global reward rate achieved under a policy of not accepting the considered pursuit (E) or accepting the considered pursuit (F), the subjective value expressions are equivalent.

While appealing in its simplicity, the terms on the right-hand side of Equation 8, rin and ρgtin, lack independence from one another—the reward of the considered pursuit type contributes to the global reward rate, ρg. Subjective value can alternatively and more deeply be understood by re-expressing subjective value in terms that are independent of one another. Rather than expressing the worth of a pursuit in terms of the global reward rate obtained when accepting it, as in Equation 8, the worth of a pursuit can be expressed in terms of the rate of reward obtained outside the considered pursuit type (Figure 5C), as in Equation 9 (and see Ap. 8 for derivation).

Equation 9. Subjective value of a pursuit from perspective of the considered pursuit and its outside

These expressions are equivalent to one another (see Ap. 8 and Figure 5).

For an interactive exploration of the effects of changing the outside and inside reward and time on subjective value, see Supplemental GUI.

Time’s cost: opportunity & apportionment costs determine a pursuit’s subjective value

By decomposing the global reward rate into ‘inside’ and ‘outside the considered pursuit, the cost of time is revealed as being determined by an 1) opportunity cost, and an 2) apportionment cost (Figure 5). The opportunity cost associated with a considered pursuit, ρouttin, is the reward rate of the world under a policy of not accepting the considered pursuit (its outside rate), ρout, times the time of the considered pursuit, tin (Figure 5C). In the numerator of Equation 9 (right hand side), this opportunity cost is subtracted from the reward obtained from accepting the considered pursuit. In addition to this opportunity cost subtraction, the cost of time is also determined by time’s apportionment cost (Figure 5D). The apportionment cost relates to time’s allocation in the world: the time spent within a pursuit type relative to the time spent outside that pursuit type, appearing in the denominator. The denominator uses time’s apportionment to scale the opportunity cost subtracted reward of the pursuit to its global reward rate equivalent magnitude requiring no time investment. The amount of reward by which this downscaling decreases the opportunity cost subtracted reward is the apportionment cost of time. In so downscaling, the subjective value of a considered pursuit (green) is to the time it would take to traverse the world were the pursuit not taken, tout, as its opportunity cost subtracted reward (cyan) is to the time to traverse the world were it to be taken (tin+ tout) (Figure 5E). Let us now consider the impact that changing the outside reward and/or outside time has on these two determinants of time’s cost— opportunity and apportionment cost—to further our understanding of the subjective value of a pursuit.

The effect of increasing the outside reward on the subjective value of a pursuit

Figure 6 illustrates the impact of changing the reward reaped from outside the pursuit on the pursuit’s subjective value. By holding the time spent outside the considered pursuit constant, changing the outside reward thus changes the outside reward rate. When the considered pursuit’s reward rate is greater than its outside reward rate, the subjective value is positive (Figure 6A). The subjective value diminishes linearly (Figure 6B, green dots) to zero as the outside reward rate increases to match the pursuit’s reward rate, and turns negative as the outside reward rate exceeds the pursuit’s reward rate, indicating that a policy of accepting the considered pursuit would result in a lower attained global reward rate than that garnered under a policy of forgoing the pursuit. Under these conditions, the subjective value is shown to decrease linearly as the outside reward increases because the cost of time increases linearly (Figure 6B, shaded region).

The impact of outside reward on the subjective value of a pursuit.

A) Increasing the outside reward while holding the outside time constant increases the outside reward rate (slope of gold lines), resulting in increasing the global reward rate (slope of the purple lines), and decreasing the subjective value (green dots) of the pursuit. As the reward rate of the environment outside the considered pursuit type increases from lower than, to higher than that of the considered pursuit, the subjective value of the pursuit decreases, becomes zero when the in/outside rates are equivalent, and goes negative when ρout exceeds ρin. B) Plotting the subjective value of the pursuit as a function of increasing the outside reward (while holding tout constant) reveals that the subjective value of the pursuit decreases linearly. This linear decrease is due to the linear increase in the cost of time of the pursuit (purple dotted region). C) Time’s cost (the area, as in B, between the pursuit’s reward magnitude and its subjective value) is the sum of the opportunity cost of time (orange dotted region) and the apportionment cost of time (plum annuli region). When the outside reward rate is zero, time’s cost is composed entirely of an apportionment cost. As the outside reward increases, opportunity cost increases linearly as apportionment cost decreases linearly, until the reward rates in and outside the pursuit become equivalent, at which point the subjective value of the pursuit is zero. When subjective value is zero, the cost of time is entirely composed of opportunity cost. As the outside rate exceeds the inside rate, opportunity cost continues to increase, while the apportionment cost becomes negative (which is to say, the apportionment cost of time becomes an apportionment gain of time). Adding the positive opportunity cost and the negative apportionment cost (subtracting the purple & orange region of overlap from opportunity cost) yields the subjective value of the pursuit.

Time’s cost is the sum of the opportunity cost and apportionment cost of time (Figure 6C). When the outside reward is zero, there is zero opportunity cost of time, with time’s cost being entirely constituted by the apportionment cost of time. Apportionment cost (Figure 6C, left hand y-axis) decreases as outside reward increases because the difference between the inside and outside reward rate diminishes, thus making how time is apportioned in and outside the pursuit less relevant. At the same time, as outside reward increases, the opportunity cost of time increases (Figure 6C, right hand y-axis). When inside and outside rates are the same, how the agent apportions its time in or outside the pursuit does not impact the global rate of reward. At this point, the apportionment cost of time has fallen to zero, while the opportunity cost of the pursuit has now come to entirely constitute time’s cost. Further increases in the outside reward now result in the outside rate being increasingly greater than the inside rate making the apportionment of time in/outside the pursuit increasingly relevant. Now, however, though the opportunity cost of time continues to grow positively, the apportionment cost of time grows increasingly negative (which is to say the pursuit has an apportionment gain). Subtracting the sum of the opportunity cost of the pursuit and the negative apportionment cost (i.e., the apportionment gain), from the pursuit’s reward, yields the subjective value of the pursuit.

The effect of changing the outside time on the subjective value of the considered pursuit

Figure 7 examines the effect of changing the outside time on the subjective value of a pursuit, while holding the outside reward constant at a value of zero. Doing so affords a means to examine the apportionment cost of time in isolation from the opportunity cost of time. Despite there being no opportunity cost, there is yet a cost of time (Figure 7B) composed entirely of the apportionment cost (Figure 7C). When the portion of time spent outside dominants, time’s apportionment cost of the pursuit is small. As the portion of time spent outside the pursuit decreases and the relative apportionment of time spent in the pursuit increases, the apportionment cost of the pursuit increases purely hyperbolically, resulting in the subjective value of the pursuit decreasing purely hyperbolically (Figure 7). As time spent outside the considered pursuit becomes diminishingly small, the pursuit comprises more and more of the world, until the apportionment of time is entirely devoted to the pursuit, at which point the apportionment cost of time equals the pursuit’s reward rate * t (i.e., the pursuit’s reward magnitude).

The impact of the apportionment cost of time on the subjective value of a pursuit.

A) The apportionment cost of time can best be illustrated dissociated from the contribution of the opportunity cost of time by considering the special instance in which the outside has no reward, and therefore has a reward rate of zero. B) In such instances, the pursuit still has a cost of time, however. C) Here, the cost of time is entirely composed of apportionment cost, which arises from the fact that the considered pursuit is contributing its proportion to the global reward rate. How much is the pursuit’s time cost is thus determined by the ratio of the time spent in the pursuit versus outside the pursuit: the more time is spent outside the pursuit, the less the apportionment cost of time of the pursuit, and therefore, the greater the subjective value of the pursuit. When apportionment cost solely composes the cost of time, the cost of time decreases hyperbolically as the outside time increases, resulting in the subjective value of a pursuit increasing hyperbolically.

The effect of changing the outside time and the outside reward rate on the subjective value of a pursuit

In having examined the effect of varying outside reward (Figure 6) and outside time (Figure 7), let us now consider the impact of varying, jointly, the outside time and the outside reward rate (Figure 8). By changing the outside time while holding the outside reward constant, the reward rate obtained in the outside will be varied while the apportionment of time in & outside the pursuit changes (Figure 8A), thus impacting the opportunity and apportionment cost of time. Plotting the subjective value-by-outside time function, Figure 8B then reveals that subjective value increases hyperbolically under these conditions as outside time increases, which is to say, time’s cost decreases hyperbolically. Decomposing time’s cost into its constituent opportunity and apportionment costs (Figure 8C) illustrates how these components vary when varying outside time. Opportunity cost (orange dots) decreases hyperbolically as the outside time increases. Apportionment cost varies as the difference of two hyperbolas (plum annuli area), initially decreasing to zero as the outside and inside rates become equal, and then increasing (plum annuli area). Taken together, their sum (opportunity and apportionment costs) decreases hyperbolically as outside time increases, resulting in subjective values that hyperbolically increase, spanning from the negative of the outside reward magnitude to the inside reward magnitude.

The effect of changing the outside time and the outside reward rate on the subjective value of a pursuit.

A) The subjective value (green dots) of the considered pursuit when changing the outside time and outside reward rate. B) As outside time increases under these conditions (holding positive outside reward constant), the subjective value of the pursuit increases hyperbolically, from the negative of the outside reward magnitude to, in the limit, the inside reward magnitude. Conversely, time’s cost (purple annuli) decreases hyperbolically. C) Opportunity cost decreases hyperbolically as outside time increases. Apportionment cost initially decreases to zero as the outside and inside rates become equal, and then increases as the difference of two hyperbolas (plum annuli area). When the outside reward rate is greater than the inside reward rate, apportionment could be said to have a gain (a negative cost). Summing opportunity cost and apportionment cost yields time’s cost.

The value of initiating pursuits in choice decision-making

Above, we determined how a reward rate maximizing agent would evaluate the worth of a pursuit, identifying the impact of a policy of taking (or forgoing) that pursuit on the realized global reward rate, and expressing that pursuit’s worth as subjective value. We did so by opposing a pursuit with its equivalent offer requiring no time investment—a special and instructive case. In this section we consider what decision should be made when an agent is simultaneously presented with a choice of more than one pursuit of any potential magnitude and time investment. Using the subjective value under these choice decisions, we more thoroughly examine how the duration and magnitude of a pursuit, and the context in which it is embedded (its ‘outside’), impacts reward rate optimal valuation. We then re-express subjective value as a temporal discounting function, revealing the nature of the apparent temporal discounting function of a reward rate maximizing agent as one determined wholly by the temporal structure and magnitude of rewards in the environment. We then assess whether hyperbolic discounting and the “Magnitude” and “Sign” effect—purported signs of suboptimal decision-making (Thaler, 1981; Loewenstein and Thaler, 1989; Estle et al., 2006)—are in fact consistent with optimal decision-making.

Choice decision-making

Consider a temporal decision in which two or more mutually exclusive options are simultaneously presented following a period that is common to policies of choosing one or another of the considered options (Figure 9). In such scenarios, subjects choose between outcomes differing in magnitude and the time at which they will be delivered. Of particular interest are choices between a smaller, sooner reward pursuit (“SS” pursuit) and a larger, later reward pursuit (“LL” pursuit) (Myerson and Green, 1995; Frederick et al., 2002; Madden and Bickel, 2010; Peters and Büchel, 2011). Such intertemporal decision-making is commonplace in the laboratory setting (McDiarmid and Rilling, 1965; Rachlin et al., 1972; Ainslie, 1974; Snyderman, 1983; Myerson and Green, 1995; Bateson and Kacelnik, 1996; Ostaszewski, 1996; Stephens and Anderson, 2001; Cheng et al., 2002; Frederick et al., 2002; Hayden and Platt, 2007; Hayden et al., 2007; McClure et al., 2007; Beran and Evans, 2009; Peters and Büchel, 2011; Stevens and Mühlhoff, 2012; Carter et al., 2015; Carter and Redish, 2016).

Policy options considered during the initiation of pursuits in worlds with a “Choice” topology.

A-C) Choice topology, and policies of choosing the small-sooner or larger-later pursuit, as in Figure 1 “Choice”. D) The world divided into “Inside” and “Outside” the selected pursuit, as the agent decides whether to accept SS (aqua) or LL (purple) pursuit. The global reward rate (ρg) under a policy of choosing the SS or LL (slopes of the magenta lines), calculated from the equation in the box to the right.

Global reward rate equation and Optimal Choice Policy

With the global reward rate equation previously derived, which choice policy (i.e., choosing SS, or LL) would maximize global reward rate can be identified. The optimal choice between the SS and the LL pursuit is as follows…

These policies’ optimality is intuitive. By choosing option LL, the subject earns rLLrSS more reward than when choosing SS but spends tLLtSS more time. If the reward rate from that extra time spent exceeds the reward rate of the environment generally, it would be optimal to spend the extra time on the larger-later option. In other words, if the agent were to choose pursuit SS, tLLtSS time would be spent earning reward at a global reward rate under that policy, ρg,choose SS, with the magnitude ρg(tLLtSS). If ρg(tLLtSS) exceeds the extra reward rLLrSS that could be earned with that extra time by investing the LL pursuit, more reward would be earned in the same amount of time by choosing the SS Pursuit.

Optimal Choice Policies based on Subjective Value

As under forgo decision-making, we can now also identify the global reward rate optimizing choice policies based on subjective value (Figure 9). The following policies would optimize reward rate when choosing between two options of different magnitude that require different amounts of time invested:

The impact of opportunity & apportionment costs on choice decision-making

With optimal policies for choice expressed in terms of subjective value, the impact of time’s opportunity and apportionment costs on choice decision-making can now be more deeply appreciated. Keeping the outside time constant, the opportunity cost of time increases as the outside reward (and thus the outside reward rate) increases, decreasing linearly the subjective value of the considered pursuits (Figure 10). However, as the opportunity cost of the LL pursuit is greater than that of the SS due to its greater time requirement, its slope is greater than that of the SS, resulting in a switch in preference from the LL pursuit to that of the SS pursuit at some critical outside reward rate threshold.

Effect of opportunity cost on subjective value in choice decision-making.

The effect of increasing the outside reward while holding the outside time constant is to linearly increase the opportunity cost of time, thus decreasing the subjective value of pursuits considered in choice decision-making. When the outside reward is sufficiently small, the subjective value of the LL pursuit can exceed the SS pursuit, indicating that selection of the LL pursuit would maximize the global reward rate. As outside reward increases, however, the subjective value of pursuits will decrease linearly as the opportunity cost of time increases. Since a policy of choosing the LL pursuit will have the greater opportunity cost, the slope of its function relating subjective value to outside reward will be greater than that of a policy of choosing the SS pursuit. Thus, outside reward can be increased sufficiently such that the subjective value of the LL and SS pursuits will become equal, past which the agent will switch to choosing the SS pursuit.

A switch in preference between the SS and LL pursuits will also occur when the time spent outside the considered pursuit increases past some critical threshold even if the outside reward rate earned remains constant (Figure 11). As any inside time will constitute a greater fraction of the total time under a LL versus a SS pursuit policy, the apportionment cost of the LL pursuit will be greater. This can result in the subjective value of the SS pursuit being greater, initially, than the LL pursuit. As the outside time increases, however, the ordering of subjective value will switch as apportionment costs becoming diminishingly small.

Effect of apportionment cost on subjective value in choice decision-making.

The effect of increasing the outside time (while maintaining outside rate) is to decrease the apportionment cost of the considered pursuit, thus increasing its subjective value. When the outside time is sufficiently small, the apportionment cost for LL and SS pursuits will be large, but can be greater still for the LL pursuit given its proportionally longer duration to the outside time. As outside reward time increases, however, the subjective value of pursuits increase as the apportionment cost of time of the considered pursuit decreases. As apportionment costs diminish and the magnitudes of pursuits’ rewards become more fully realized, the subjective value of the LL pursuit will eventually exceed that of the SS pursuit at sufficiently long outside times.

Finally, the effect of varying opportunity and apportionment costs on subjective value in Choice behavior is considered (Figure 12). Opportunity and apportionment costs can simultaneously be varied, for instance, by maintaining outside reward but increasing outside time. Doing so decreases the apportionment as well as the opportunity cost of time by changing the proportion of time in and outside the considered pursuit, which, in turn, lowers the outside reward rate. A switch in preference will then occur from the SS to the LL pursuit as they are differentially impacted by both the opportunity as well as the apportionment cost of time.

Effect of varying opportunity and apportionment costs on Choice behavior.

The effect of increasing the outside time while maintaining outside reward is to decrease the apportionment as well as the opportunity cost of time, thus increasing pursuit’s subjective value. Increasing outside time, which in turn, also decreases outside reward rate, results in the agent appearing as if to become more patient, being willing to switch from a policy of selecting the SS pursuit to a policy of selecting the LL pursuit past some critical threshold (vertical dashed black line).

A reward rate optimal agent will thus appear as if more patient the longer the time spent outside a considered pursuit, the lower the outside reward rate, or both, switching from a policy of choosing the SS to choosing the LL option at some critical outside reward rate and/or time. Having analyzed the impact of time spent and reward obtained outside a pursuit on a pursuit’s valuation, we now examine the impact time spent within a pursuit has on its valuation.

The Discounting Function of a reward rate optimal agent

How does the value of a pursuit change as the time required for its obtainment grows? Intertemporal decision-making between pursuits requiring differing time investments resulting in different reward magnitudes has typically been examined using a ‘temporal discounting function’ to describe how delays in reward influence their valuation. This question has been investigated experimentally by pitting smaller-sooner options against later-larger options to experimentally determine the subjective value of the delayed reward (Mischel, Grusec, & Masters, 1969), with the best fit to many such observations across delays determining the subjective value function. After normalizing by the magnitude of reward, the curve of subjective values as a function of delay is the “temporal discounting function” (for review see Frederick et al., 2002). While the temporal discounting function has historically been used in many fields, including economics, psychology, ethology, and neuroscience to describe how delays influence rewards’ subjective value, its origins—from a normative perspective—remain unclear (Hayden, 2015). What, then, is the temporal discounting function of a reward-rate optimal agent? And would its determination provide insight into why experimentally derived discounting functions present in the way they do, with their varied forms and curious sensitivity to the context, magnitude, and sign of pursuit outcomes?

Discounting Function of an Optimal Agent is a Hyperbolic Function

The temporal discounting function of an optimal agent can be expressed by normalizing its subjective value-time function by the considered pursuit’s magnitude.

Equation 9. The Discounting Function of a Global Reward Rate Optimal Agent.

To illustrate the discounting function of a reward-rate maximal agent, Figure 13 depicts how the worth of a pursuit’s reward would change as its required time investment increases in three different world contexts: a world in which there is, A) zero outside reward rate & large outside time, B) zero outside reward rate & small outside time, and, C) positive outside reward rate & small outside time. Figure 13 first graphically depicts the subjective values of the pursuit’s reward at increasing temporal delays (the y-intercepts of the lines depicting the resulting global reward rates, green dots) in each of these world contexts (A-C). Then, by replotting these subjective values at their corresponding delays, the subjective value-time function is created for this increasingly delayed reward in each of these worlds (D-F). By normalizing by the reward magnitude, these subjective value-time functions are then converted to their corresponding discounting functions (color coded) and overlaid so that their shapes may be compared (G).

The temporal discounting function of a global reward-rate optimal agent is a hyperbolic function relating the apportionment and opportunity cost of time.

A-C) The effect, as exemplified in three different worlds, of varying the outside time and reward on the subjective value of a pursuit as its reward is displaced into the future. The subjective value, sv, of this pursuit, as its temporal displacement into the future increases, is indicated as the green dots along the y-intercept in these three different contexts: a world in which there is A) zero outside reward rate & large outside time, B) zero outside reward rate & small outside time, and C) positive outside reward rate & the small outside time as in B. D-F) Replotting these subjective values at their corresponding temporal displacement yields the subjective value function of the offered reward in each of these contexts. G: Normalizing these subjective value functions by the reward magnitude and superimposing the resulting temporal discounting functions reveals how the steepness and curvature of the apparent discounting function of a reward rate maximizing agent changes with respect to the average reward and time spent outside the considered pursuit. When the time spent outside is increased (compare B to A)—thus decreasing the apportionment cost of time—the temporal discounting function becomes less curved, making the agent appear as if more patient. When the outside reward is increased (compare B to C)—thus increasing the opportunity cost of time—the temporal discounting function becomes steeper, making the agent appear as if less patient.

Doing so illustrates how the mathematical form of the temporal discount function—as it appears for the optimal agent—is a hyperbolic function. This function’s form depends wholly on the temporal reward structure of the environment and is composed of hyperbolic and linear components which relate to the apportionment and to the opportunity cost of time. To best appreciate the contributions of opportunity and apportionment costs to the discounting function of a reward rate-optimal agent, consider the following instances exemplified in Figure 13. First, in worlds in which no reward is received outside a considered pursuit, the apparent discounting function is purely hyperbolic (Figure 13A). Purely hyperbolic discounting is therefore optimal when the subjective value function follows the equation sv = rt + ITI (ITI: intertrial interval with no reward), as in many experimental designs. Second, as less time is apportioned outside the considered pursuit type (Figure 13B), this hyperbolic curve becomes more curved as the pursuit’s time apportionment cost increases. The curvature of the hyperbolic component is thus controlled by how much time the agent spends in versus outside the considered pursuit: with the more time spent outside the pursuit, the gentler the curvature of apparent hyperbolic discounting, and the more patient the agent appears to become for the considered pursuit. Third, in worlds in which reward is received outside a considered pursuit (compare B to C), the apparent discounting function will become more steep the more outside reward is obtained, as the linear component relating the opportunity cost of time increases (while the apportionment cost of time decreases).

Thus, by expressing the worth of a pursuit as would be evaluated by a reward-rate optimal agent in terms of its discounting function, we find that its form is consonant with what is commonly reported experimentally in humans and animals, and will exhibit apparent changes in curvature and steepness that relate directly to the reward acquired and time spent outside the considered pursuit for every time spent within it.

Magnitude effect and the Sign Effect

With this insight into how opportunity and apportionment costs impact the cost of time, and therefore the subjective value of pursuits in Choice decision-making, reward-rate optimal agents are now understood to exhibit a hyperbolic form of discounting, as commonly exhibited by humans and animals (Rachlin et al., 1972; Ainslie, 1975; Thaler, 1981; Mazur, 1987; Benzion et al., 1989; Green et al., 1994; Rachlin et al., 2000; Kobayashi and Schultz, 2008; Calvert et al., 2010; Fedus et al., 2019). As hyperbolic discounting is not a sign of suboptimal decision-making, as is widely asserted, are other purported signs of suboptimal decision-making, namely the “Magnitude” and “Sign” effect, also consistent with optimal temporal decisions?

Magnitude effect

The Magnitude Effect refers to the observation that the temporal discounting function, as experimentally determined, is observed to become less steep the larger the offered reward. If brains apply a discounting function to account for the delay to reward, why, as it is posed, do different magnitudes of reward appear as if discounted with different temporal discounting functions? Figure 14 considers how a reward-rate maximizing agent would appear to discount rewards of two magnitudes (large - top row; small - bottom row), first by determining the subjective value (green dots) of differently sized rewards (Figure 14 A & D) across a range of delays, and second, by replotting the sv’s at their corresponding delays (Figure B & E), to form their subjective value functions (blue and red curves, respectively). After normalizing these subjective value functions by their corresponding reward magnitudes, the resulting temporal discounting functions that would be fit for a reward-rate maximizing agent are then shown in (Figure 14C). The pursuit with the larger reward outcome (blue) thus would appear as if discounted by a less steep discounting function than the smaller pursuit (red), under what are otherwise the same circumstances. Therefore, the ‘Magnitude Effect’, as observed in humans and animals, would also be exhibited by a reward-rate maximizing agent.

Reward-rate maximizing agents would exhibit the “Magnitude effect”.

A&B) The global reward rate (the slope of magenta vectors) that would be obtained when acquiring a considered pursuit’s reward of a given size (either relatively large as in A or small as in B) but at varying temporal removes, depicts how a considered pursuit’s subjective value (green dots, y-intercept) would decrease as the time needed for its obtainment increases in environments that are otherwise the same. C&D) Replotting the subjective values of the considered pursuit to correspond to their required delay forms the subjective value-time function for the “large” reward case (C), and the “small” reward case (D). E) Normalizing the subjective value-time functions by their reward magnitude transforms these functions into their corresponding discounting functions (blue: large reward DF; red: small reward DF), and reveals that a reward-rate maximizing agent would exhibit the “Magnitude Effect” as the steepness of the apparent discounting function would change with the size of the pursuit, and manifest as being less steep the greater the magnitude of reward.

The Sign Effect

The Sign Effect refers to the observation that the discounting functions for outcomes of the same magnitude but opposite valence (rewards and punishments) appear to discount at different rates, with punishments discounting less steeply than rewards. Should the brain apply a discounting function to outcomes to account for their temporal delays, why does it seemingly use different discount functions for rewards and punishments of the same magnitude? Figure 15 considers how a reward-rate maximizing agent would appear to discount outcomes (reward and punishment) of the same magnitude but opposite valence when spending time outside a pursuit, obtaining a positive reward rate. By determining the subjective value of these oppositely signed outcomes across a range of delays and plotting their normalized subjective values at their corresponding delay, the apparent discounting function for reward and punishment, as expressed by a reward-rate maximizing agent, exhibits the “Sign effect” observed in humans and animals. In addition, we note that the difference in discounting function slopes between rewards and punishments of equal magnitude would diminish as the outside reward approached zero, become identical when zero, and even invert when the outside reward rate is negative (which is to say, reward would appear to discount less steeply than punishments).

Reward-rate maximizing agents would exhibit the “Sign effect”.

A&B) The global reward rate (the slope of magenta lines) that would be obtained when acquiring a considered pursuit’s outcome of a given magnitude but differing in sign (either rewarding as in A, or punishing as in B), depicts how the subjective value (green dots, y-intercept) would decrease as the time of its obtainment increases in environments that are otherwise the same (one in which the agent spends the same amount of time and receives the same amount of reward outside the considered pursuit for every instance within it). C&D) Replotting the subjective values of the considered pursuit to correspond to their required delay forms the subjective value-time function for the reward (C) and for the punishment (D). E) Normalizing the subjective value-time functions by their outcome transforms these functions into their corresponding discounting functions (blue: reward DF; red: punishment DF). This reveals that a reward-rate maximizing agent would exhibit the “Sign Effect”, as the steepness of the apparent discounting function would change with the sign of the pursuit, manifesting as being less steep for punishing than for rewarding outcomes of the same magnitude.

Summary

In the above sections, we provide a richer understanding of the origins of time’s cost in evaluating the worth of initiating a pursuit. We demonstrate that the intuitive, if deceptively simple, equation for subjective value (Equation 8) that subtracts time’s cost is equivalent to subtracting an opportunity cost and an apportionment cost of time (Equation 9). Whereas the simple equation’s time cost is calculated from the global reward rate under a policy of accepting the considered pursuit (Equation 8), parceling the world into the contribution from in and outside the considered pursuit type (Equation 9) reveals that the opportunity cost of time arises from the global reward rate achieved under a policy of not accepting the considered pursuit (it’s outside reward rate), and that the apportionment cost of time arises from the allocation of time spent in, versus outside, the considered pursuit. These equivalent expressions for the normatively-defined (reward-rate maximizing) subjective value of a pursuit give rise to an apparent discounting function that is a hyperbolic function of time, who’s hyperbolic component constitutes the apportionment cost, and whose linear component constitutes the opportunity cost of time. By re-expressing reward rate maximization as its apparent temporal discounting function, we demonstrate how fits of hyperbolic discounting, as well as observations of the Magnitude and Sign effect—commonly taken as signs of suboptimal decision-making—are in fact consistent with optimal temporal decision-making.

Sources of error and their consequences

While these added insights enrich our understanding of time’s cost and reveal how purported signs of irrationality can in fact be consistent with a reward-rate maximizing agent, it nonetheless remains true that animals and humans are suboptimal temporal decision makers—exhibiting an “impatience” by selecting smaller, sooner (SS) options in cases where selecting larger, later (LL) options would maximize global reward rate. However, when decisions to accept or reject pursuits are presented in Forgo situations, they are observed to be optimal. As the equivalent immediate reward equations enabling global reward rate optimization may potentially be instantiated by neural representations of their underlying variables, we conjecture that misrepresentation of one or another variable may best explain the particular ways in which observed behavior deviates, as well as accords, with optimality. Therefore, we now ask what errors in temporal decision-making behavior would result from misestimating these variables, with the aim of identifying the nature of misestimation that best accounts for the pattern actually observed in animals and humans regarding whether to initiate a given pursuit.

To understand how systematic error in an agent’s estimation of different time and/or reward variables would affect its behavior, we examine the agent’s pattern of behavior in both Choice and Forgo decisions across different outside reward rates. First, we ask whether the agent would choose a SS or LL pursuit as in a choice task. Then we ask whether the agent would take or forgo the same LL and SS pursuits when either are presented alone in a forgo task. The actions taken by the agent can therefore be described as a triplet of policies referring to the two pursuits (e.g., choose SS, forgo LL, forgo SS).

Let us first consider how a reward rate optimal agent would transition from one to another pattern of decision-making as outside reward rate increases for the situation of fundamental interest: where the reward rate of the SS pursuit is greater than that of the LL pursuit (Figure 16). When the outside reward rate (slope of golden line) is sufficiently low (Figure 16A), the agent should prefer LL in Choice, be willing to take the LL pursuit in Forgo, and be willing to take the SS pursuit in Forgo (choose LL, take LL, take SS). Here, a “sufficiently low” outside rate is one such that the resulting global reward rate (slope of magenta line) is less than the difference in the reward rates of the SS and LL pursuits. When the outside reward rate increases to greater than this difference in the pursuits’ reward rates but is less than the reward rate of the LL option, the agent should choose SS in Choice and be willing to take either in Forgo (choose SS, take LL, take SS) (Figure 16B). Further increases in outside rate up to that equaling the reward rate of the SS results in the agent selecting the SS in Choice, forgoing LL in Forgo, and taking SS in Forgo (choose SS, forgo LL, take SS) (Figure 16C). Finally, any additional increase in outside rate would result in choosing the SS pursuit under Choice, and forgoing both pursuits in Forgo (choose SS, forgo LL, forgo SS) (Figure 16D). Colored regions thus describe the pattern of decision-making behavior exhibited by a reward rate optimal agent under any combination of outside reward and time.

Relationship between outside time and reward with optimal temporal decision-making behavioral transitions.

An agent may be presented with three decisions: the decision to take or forgo a smaller, sooner reward of 2.5 units after 2.5 seconds (SS pursuit), the decision to take or forgo a larger, later reward of 5 units after 8.5 seconds (LL pursuit), and the decision to choose between the SS and LL pursuits. The slope of the purple line indicates the global reward rate (ρg) resulting from a Choice or Take policy, while the slope of “outside” the pursuit (golden line) indicates the outside reward rate (i.e., global reward rate resulting from a Forgo policy). In each panel (A-D), an example outside reward rate is plotted, illustrating the relative ordering of ρg slopes for each policy. Location in the lower left quadrant is thereby shaded according to the combination of global rate-maximizing policies for each of the three decision types.

With this understanding of the optimal thresholds between behavior policies, we can now examine the impact on decision-making behavior of different types of error in the agent’s understanding of the world (Figure 17). We introduce an error term, ω, such that different parameters impacting the global reward rate of each considered policy are underestimated (ω<1) or overestimated (ω>1) (Figure 17 column 1, see Ap. 11 for formal definitions). Resulting global reward rate mis-estimations are equivalent to introducing error in the considered pursuit’s subjective value, which will result in various deviations from reward-rate maximization (Figure 17). Conditions wherein overestimation of global reward rate would lead to suboptimal choice behavior are identified formally in Ap. 12.

Patterns of suboptimal temporal decision-making behavior resulting from time and/or reward misestimation.

Patterns of temporal decision-making in Choice and Forgo situations deviate from optimal (top row) under various parameter misestimations (subsequent rows). Characterization of the nature of suboptimality is aided by the use of the outside reward rate as the independent variable influencing decision-making (x-axis), plotted against the degree of error (y-axis) of a given parameter (ω<1 underestimation, ω=1 actual, ω>1 overestimation). The leftmost column provides a schematic exemplifying true outside (gold) and inside (blue) pursuit parameters and the nature of parameter error (dashed red) investigated per row (all showing an instance of underestimation). For each error case, the agent’s resulting choice between SS and LL pursuits (2nd column), decision to take or forgo the LL pursuit (3rd column), and decision to take or forgo the SS pursuit (4th column) are indicated by the shaded color (legend, bottom of columns) for a range of outside rates and degrees of error. The rightmost column depicts the triplet of behavior observed, combined across tasks. Rows: A) “No error” - Optimal choice and forgo behavior. Vertical white lines show outside reward rate thresholds for optimal forgo behavior. B-G) Suboptimal behavior resulting from parameter misestimation. B-D) The impact of outside pursuit parameter misestimation. B) “Outside Time”- The impact of misestimating outside time (and thus misestimating outside reward rate). C) “Outside Reward”- The impact of misestimating outside reward (and thus misestimating outside reward rate). D) “Outside Time & Reward”- The impact of misestimating outside time and reward, but maintaining outside reward rate. E-G) The impact of inside pursuit parameter misestimation. E) “Pursuit Time”- The impact of misestimating inside pursuit time (and thus misestimating inside pursuit reward rate. F) “Pursuit Reward” - The impact of misestimating the pursuit reward (and thus misestimating the pursuit reward rate). G) “Pursuit Time and Reward” - The impact of misestimating the pursuit reward and time, but maintaining the pursuit’s reward rate. For this illustration, we determined the policies for a SS pursuit of 2 reward units after 2.5 seconds, a LL pursuit of 4.75 reward units after 8 seconds, and an outside time of 10 seconds. The qualitative shape of each region and resulting conclusions are general for all situations where the SS pursuit has a higher rate than the LL pursuit (and where a region exists where the optimal agent would choose LL at low outside rates).

The sources of error considered are mis-estimations of the reward obtained and/or time spent “outside” (rows B-D) and “inside” (rows E-G) the considered pursuit. When both reward and time are misestimated, we examine the case in which the reward rate of that portion of the world is maintained (rows D & G). The agent’s resulting policies in Choice (second column) and both Forgo situations (third and fourth columns) are determined across a range of outside reward rates (x-axes) and degrees of parameter misestimation (y-axes) and color-coded, with the boundary between the colored regions indicating the outside reward rate threshold for transitions in the agent’s behavior. These individual policies are collapsed into the triplet of behavior expressed across the decision types (fifth column). In this way, characterization of the nature of suboptimality is aided by the use of the outside reward rate as the independent variable influencing decision-making, with the outside reward rate thresholds for optimal behavior being compared to the outside reward rate thresholds under any given parameter misestimation (comparing top “optimal” row A, against any subsequent row B-G). Any deviations in this pattern of behavior from that of the optimal agent (row A) are suboptimal, resulting in a failure to maximize reward rate in the environment.

While misestimation of any of these parameters will lead to suboptimal behavior, only specific sources and directions of error may result in behavior that qualitatively matches human and animal behavior observed experimentally. Misestimation of outside time (B), outside reward (C), inside time (E), and inside reward (F) all display Choice behavior that is qualitatively similar to experimentally observed behavior, either via underestimation or overestimation of the key variable. For example, underestimation of the outside time (B, ω<1) leads to selection of the SS pursuit at sub-optimally low outside reward rates. However, agents with these types of error never display optimal Forgo behavior. By contrast, misestimation of either outside time and reward (D) or inside time and reward (G) display suboptimal Choice while maintaining optimal Forgo. Specifically, underestimation of outside time and reward (D, ω<1) and overestimation of inside time and reward (G, ω>1) both result in suboptimal preference for SS at low outside rates. Therefore, and critically, if the rates of both inside and outside are maintained despite misestimating reward and time magnitudes, the resulting errors allow for optimal Forgo behavior while displaying suboptimal “impatience” in Choice, and thus match experimentally observed behavior.

Discussion

In order to understand why humans and animals factor time the way they do in temporal decision-making, our initial step has been to understand how a reward-rate maximizing agent would evaluate the worth of initiating a pursuit within a temporal decision-making world. We did so in order to identify what are and are not signs of suboptimality and to gain insight into how animals’ and humans’ valuation of pursuits actually deviate from optimality. By analyzing fundamental temporal decisions, we identified equations enabling reward-rate maximization that evaluate the worth of initiating a pursuit. We first considered Forgo decisions to appreciate that a world can be parcellated into its constituent pursuits, revealing how pursuits’ rates and relative occupancies (their ‘weights’), along with the decision policy, determine the global reward rate. In doing so, we derived an expression for the worth of a pursuit in terms of the resulting global reward rate, and from it, re-expressed the pursuit’s worth in terms of its global reward rate-equivalent immediate reward, i.e., its ‘subjective value’. We further show that time’s cost, rather than being calculated from the global reward rate under a policy of accepting the considered pursuit, can equally be calculated in terms of the outside reward rate and time (a policy of not accepting the considered pursuit type). Expressing subjective value in terms of a pursuit’s outside reward rate and time reveals that time’s cost is constituted by an apportionment cost, as well as an opportunity cost. By then examining Choice decisions, we provide a deeper understanding of the nature of apparent temporal discounting in reward rate maximizing agents and establish that hyperbolic discounting, the Magnitude Effect, and the Sign Effect, are not signs of suboptimal decision-making, but rather are consistent with reward-rate maximization. While these purported signatures of suboptimality would in fact arise from reward-rate maximization, humans and animals are, nonetheless, suboptimal temporal decision makers, exhibiting apparent discounting functions that are too steep. By examining misestimation of the parameters that enable reward-rate maximization identified here, we implicate overestimation of the relative time spent in versus outside the considered pursuit type as the likely source of error committed by animals and humans in temporal decision-making that underlies their suboptimal pursuit valuation. We term this “The Malapportionment Hypothesis”.

Temporal decision-making theories and frameworks

Two theories have predominated over the course of theorizing about how animals should invest time when pursuing rewards of a diversity of magnitudes and delays: a theory of exponential discounting (Samuelson, 1937; Frederick et al., 2002; Kalenscher and Pennartz, 2008), and a theory of optimal foraging (Charnov, 1976b; Pyke et al., 1977; Stephens and Krebs, 1986; Stephens, 2008). According to the former, exhibiting a permanent preference for one option over another through time was argued to be rational (Montague and Berns, 2002; Mazur, 2006; Nakahara and Kaveri, 2010), as in Discounted Utility Theory (DUT) (Samuelson, 1938). Discounting functions operating under this principle would then be exponential, with the best fit exponent controlling and embodying the agent’s appreciation of the cost of time. In contrast, Optimal Foraging Theory (OFT) invoked reward rate maximization as the normative principle. Referenced by a wide assortment of ethologists and ecologists (for review see (Pyke, 1984)), the specific formulation proponents of OFT generally use would result in an apparent discounting function that is hyperbolic. Indeed, in controlled laboratory experiments in which animals make decisions about how to spend time between rewarding options (Hariri et al., 2006; Hayden et al., 2011; Wikenheiser et al., 2013; Blanchard and Hayden, 2014, 2015; Carter et al., 2015; Carter and Redish, 2016), experimental observations have demonstrated that hyperbolic functions are better fits to choice behavior in intertemporal choice tasks than exponential functions (Ainslie, 1975; Thaler and Shefrin, 1981; Frederick et al., 2002; Green and Myerson, 2004; Kim et al., 2008; Blanchard and Hayden, 2015). Nonetheless, and problematically for OFT, in most intertemporal choice tasks, animal behavior is far from optimal for maximizing reward rate (Reynolds and Schiffbauer, 2004; Hayden et al., 2011; Blanchard et al., 2013; Blanchard and Hayden, 2015).

Hyperbolic Temporal Discounting Functions

Indeed, with respect to global reward rate maximization, animals and humans typically exhibit much too great a preference for smaller-sooner rewards (SS) in apparent discounting of delayed rewards (Chung and Herrnstein, 1967; Rachlin et al., 1972; Ainslie, 1974; Thaler, 1981; Ito and Asaki, 1982; Grossbard and Mazur, 1986; Mazur, 1988; Benzion et al., 1989; Loewenstein and Prelec, 1992; Green et al., 1994; Bateson and Kacelnik, 1996; Kacelnik and Bateson, 1996; Cardinal et al., 2001; Stephens and Anderson, 2001; Bennett, 2002; Frederick et al., 2002; Holt et al., 2003; Winstanley et al., 2004; Kalenscher et al., 2005; Roesch et al., 2007; Kobayashi and Schultz, 2008; Louie and Glimcher, 2010; Pearson et al., 2010). More precisely, what is meant by this suboptimal bias for SS is that the switch in preference from LL to SS occurs at an outside reward rate that is lower—and/or an outside time that is less than—what an optimal agent would exhibit. To account for this departure from optimality, a free-fit parameter, k, controlling the steepness of temporal discounting was introduced, , accommodating the variability observed across and within subjects, and is commonly interpreted as a psychological trait, such as patience, or willingness to delay gratification (Ainslie, 1975).

In this way, the Discounting Function framework has often been reified into a function possessed by the brain, an intrinsic property used to reduce, in a manner idiosyncratic to the agent, the value of delayed reward. Indeed, discounting functions have been directly incorporated into numerous models (Nakahara and Kaveri, 2010; Kane et al., 2019), motivating the search for its neurophysiological signature (Montague et al., 2006). In addition to accommodating intra- and inter-subject variability through the use of this free-fit parameter, discounting function formulations must also contend with the fact that best fits differ in steepness 1) when the time spent and reward gained outside the pursuit changes (Lea, 1979; Stephens and Dunlap, 2009; Blanchard et al., 2013; Blanchard and Hayden, 2015; Carter et al., 2015; Smethells and Reilly, 2015; Carter and Redish, 2016), 2) when the reward magnitude of the pursuit changes (the Magnitude Effect), and 3) when considering the sign of the outcome of the pursuit (the Sign Effect). This sensitivity to conditions and variability across and within subjects has spurred a hunt for the ‘perfect’ discounting function (Namboodiri and Hussain Shuler, 2016) in an effort to better fit behavioral observations, resulting in formulations of increasing complexity (Laibson, 1997; McClure et al., 2004; al-Nowaihi and Dhami, 2008; Killeen, 2009). While such accommodations may provide for better fits of data, the uncertain origins of discounting functions (Hayden, 2016) pose a challenge to the utility of this framework in rationalizing observed behavior.

The apparent discounting function of global reward-rate optimal agents exhibits purported signs of suboptimality

Of the array of temporal decision-making behaviors commonly observed and viewed through the lens of discounting, what might be better accounted for by a deeper understanding of how a reward rate optimal agent would evaluate the worth of initiating a pursuit? To address this, we derived expressions of reward rate maximization, translated them into subjective value, and then re-expressed subjective value in terms of the apparent discounting function that would be exhibited by a reward-rate maximizing agent. We demonstrate that a simple and intuitive equation subtracting time’s cost is equivalent to a hyperbolic discounting equation. This analysis determines that the form and sensitivity to conditions that temporal discounting is experimentally observed to exhibit would actually be expressed by a reward-rate maximizing agent. In doing so, we emphasize how discounting functions should be considered as descriptions of the result of a process, rather than being the process itself.

Regarding form, our analysis reveals that the apparent discounting function of a reward-rate maximizing agent is a hyperbolic function. The diminishment of the value of a pursuit as its time investment increases is thus due to time’s cost―itself hyperbolic―which is shown to be composed of an apportionment (hyperbolic – linear) as well as an opportunity cost (linear) (Figure 18 & Table 1, right column).

Opportunity cost, apportionment cost, time cost, and subjective value functions by change in outside and inside reward and time.

Functions assume positive inside and outside rewards and times. 1If outside reward rate is zero, opportunity cost becomes a constant at zero. 2If outside reward rate is zero, as outside or inside time is varied, apportionment cost becomes purely hyperbolic.

The cost of time of a pursuit comprises both an opportunity as well as an apportionment cost.

The global reward rate under a policy of accepting the considered pursuit type (slope of magenta time), times the time that that pursuit takes (tin), is the pursuit’s time’s cost (height of maroon bar). The subjective value of a pursuit (height of green bar) is its reward magnitude (height of the purple bar) less its cost of time. Opportunity and apportionment costs are shown to compose the cost of time of a pursuit. Opportunity cost associated with a considered pursuit, ρout*tin, (height of orange bar) is the reward rate of the world under a policy of not accepting the considered pursuit (its outside rate), ρout, times the time of the considered pursuit, tin. The amount of reward that would be (on average) obtained over the time of accepting the considered pursuit—were there to be no opportunity cost—is the apportionment cost of time (height of brown bar).

In addition to demonstrating the form of the discounting function of an optimal agent, we can now also rationalize why it would appear to change in relationship to the features of the temporal decision-making world. First, rather than being a free-fit parameter like k in hyperbolic discounting models (Figure 19A), the reciprocal of the time spent outside the considered pursuit type controls the degree of curvature in reward-rate optimizing agents (Figure 19B, denominator). Therefore, changes in the apparent ‘willingness’ of a reward-rate optimal agent to wait for reward would accompany any change in the amount of time that that agent needs to spend outside the considered pursuit, making the agent act as if more patient the greater the time spent outside a pursuit for every instance it spends within it.

Comparison of typical hyperbolic discounting versus apparent discounting of a reward-rate optimal agent.

Whereas (A) the curvature of hyperbolic discounting models is typically controlled by the free fit parameter k, (B) the curvature and steepness of the apparent discounting function of a reward rate optimal agent is controlled by the time spent and reward rate obtained outside the considered pursuit. Understanding the shape of discounting models from the perspective of a reward-rate optimal agent reveals that k ought relate to the apportionment of time spent in, versus outside, the considered pursuit, underscoring, how typical hyperbolic discounting models fail to account for the opportunity cost of time (and thus cannot yield negative sv’s no matter the temporal displacement of reward). Should k be understood as representing time’s apportionment cost, the failure to account for the opportunity cost of time would lead to aberrantly high values of k.

Second, discounting frameworks must also rationalize why the apparent steepness of discounting changes as the reward rate acquired outside the considered pursuit changes, which we show here to be related to the linear opportunity cost of time in a reward rate maximizing agent (Figure 19B, subtraction of opportunity cost occurring in the numerator). The greater the opportunity cost of time, the steeper the apparent discounting function, and the less patient the agent would appear to be, even forgoing pursuits resulting in reward (when their acceptance would yield rates less than the outside rate, i.e., when sv < 0). Hyperbolic discounting functions that lack a proper accounting of the opportunity cost cannot then fit negative subjective values, and thus must compensate by overestimating k (which rightfully should only relate to the apportionment cost). In this way, such hyperbolic discounting models are only appropriate in worlds with no “outside” reward, or, where being in a pursuit does not exclude the agent from receiving rewards at the rate that occurs outside of it (Ap. 13).

Third and fourth, discounting frameworks must make an accounting of the Magnitude Effect and Sign Effect, respectively, as they are considered important “anomalous” departures from microeconomic theory (Loewenstein and Thaler, 1989). To do so, rationalizations from previous work have invoked additional assumptions, such as separate processes for small and large rewards (Thaler, 1981), or the inclusion of a utility function (Loewenstein and Prelec, 1992b; Killeen, 2009). We demonstrate here how the ‘Magnitude Effect’ would be a natural consequence of a process that would maximize reward rate, without invoking specialized processes or additional functions. This analysis predicts that the size of the Magnitude Effect would be observed, experimentally, to diminish the greater the outside time and/or the smaller the outside reward rate. Whereas discounting frameworks need invoke separate discounting functions to contend with different discounting rates for positive (rewarding) and negative (punishing) outcomes of the same magnitude (the Sign Effect), here too, we demonstrate how this is consistent with a reward-rate maximizing process, wherein the asymmetry in the steepness of apparent discounting to rewards and punishments results from the average time and magnitude of rewards (or punishments) received outside the considered pursuit. The average of rewards and punishments experienced outside the considered pursuit type thus forms a bias in evaluating equivalently sized outcomes of opposite sign. From the global reward-rate maximizing perspective, we then also predict that the size of the Sign effect would diminish as the outside reward rate decreases (and as the outside time increases), and in fact would invert should the outside reward rate turn negative (become net punishing), such that punishments would appear to discount more steeply than rewards.

Collectively, our analysis of discounting functions reveals that features typically taken as signs of suboptimal/irrational decision-making are, in fact, consistent with reward-rate maximization. In this way, the general form and sensitivity to conditions of discounting functions, as observed experimentally, can be better understood from the perspective of a reward-rate optimal agent (Table 1), providing a more parsimonious accounting of a confusing array of temporal decision-making behaviors reported.

Humans and animals are nonetheless suboptimal. What is the nature of this suboptimality?

These insights into the behavior of a reward-rate maximizing agent inform on the meaning of the concept “patience”. Patience oughtn’t imply a willingness to wait a longer time, as it is not correct to say that an agent that chooses a pursuit requiring a long time investment is more patient that one that does not, for the amount of time a reward-rate maximizing agent is willing to invest isn’t an intrinsic property of the agent itself. Rather, it is a consequence of the temporal decision-making world’s reward-time structure. So, if patience is to mean investing the ‘correct’ amount of time (i.e., the reward-rate maximizing time), then a reward-rate optimal agent doesn’t become more or less patient as the context of what is otherwise the same pursuit changes; rather, it is precisely patient, under all circumstances. Impatience and over-patience then are terms to describe the behavior of a global reward-rate suboptimal agent that invests either too little, or too much time into a pursuit policy than one that would maximize global reward rate.

Having clarified what behaviors are and are not signs of suboptimality, actual differences to optimal performance exhibited by humans and animals can now be identified and quantified. So, what then are the decision-making behaviors of humans and animals when tasked with valuing the initiation of a pursuit, as in forgo and choice decisions? In controlled experimental situations, forgo decision-making is observed to be near optimal, consistent with observations from the field of behavioral ecology (Krebs et al., 1977; Stephens and Krebs, 1986; Blanchard and Hayden, 2014). In contrast, a suboptimal bias for smaller-sooner rewards is widely reported in Choice decision-making in situations where selection of later-larger rewards would maximize global reward rate (Logue et al., 1985; Blanchard and Hayden, 2015; Carter and Redish, 2016; Kane et al., 2019). Collectively, the pattern of temporal decision-making behavior observed under forgo and choice decisions shows that humans and animals act as if sub-optimally impatience under choice, while exhibiting near-optimal decision-making under forgo decisions.

The Malapportionment Hypothesis

How can animals and humans be sub-optimally impatient in choice, but optimal in forgo decisions? We postulated that previous behavioral findings of suboptimality can be understood from the perspective of overestimating the global reward rate. While misestimation of any variable underlying global reward rate calculation will lead to errors, not all misestimations will lead to errors that match the behavioral pattern of decisions observed experimentally. Having identified equations and their variables enabling reward-rate maximization, we sought to identify the likely source of error committed by animals and humans by analyzing the pattern of behavior consequent to misestimating one or another parameter. To do so, we identified the reward rate obtained outside a considered pursuit type as a useful variable to characterize departure from optimal decision-making behavior. Sweeping over a range of these values as the independent variable, we determined change points in decision-making behavior that would arise from misestimation (over- and under-estimations) of given reward-rate maximizing parameters.

Our analysis shows how, precisely, misestimation of the inside and outside time or reward will lead to suboptimal temporal decision-making behavior. What errors, however, result in decisions that best accord with what is observed experimentally (i.e., result in suboptimal impatience in choice and optimal forgo decision-making)? Overestimating outside time, underestimating outside reward, underestimating inside time, or overestimating inside reward would fail to match suboptimal ‘impatience’ in Choice and would result in suboptimal Forgo. Underestimating outside time, overestimating outside reward, overestimating inside time, or underestimating inside reward would match experimentally observed ‘impatience’ in Choice, but fail to match experimentally observed optimal Forgo behavior. To exhibit optimal forgo behavior, the inside and outside reward rates must be accurately appreciated. Therefore, misestimations of reward and time that preserve the true reward rates in and outside the pursuit would permit optimal forgo decisions while still misestimating the global reward rate. Overestimation of the outside time or underestimation of the inside time―while maintaining reward rates―fails to match experimentally observed ‘impatience’ in choice tasks while achieving optimal forgo decisions. However, underestimation of the outside time or overestimation of the inside time―while maintaining true inside and outside reward rates―would allow optimal forgo decision-making behavior while resulting in impatient choice behavior, as experimentally observed.

Previous experimental observations are consistent with, and have been interpreted as, an agent underestimating the time spent outside the considered pursuit (Stephens and Dunlap, 2009; Blanchard et al., 2013; Smethells and Reilly, 2015), as would occur with underestimation of post-reward delays (Stephens and Dunlap, 2009; Smethells and Reilly, 2015; Hayden, 2016). Therefore, observed behavioral errors point to misestimating time apportionment in/outside the pursuit, either by 1) overestimating the occupancy of the considered choice or 2) underestimating the time spent outside the considered pursuit type, but not by 3) an misestimation of either the inside or outside reward rate. Only errors in time apportionment that underweight the outside time, (or, equivalently, overweight the inside time)―while maintaining the true inside and outside reward rates―will accord with experimentally observed temporal decision-making regarding whether to initiate a pursuit.

Thus, when a temporal decision world can effectively be bisected into two components, as often the case in experimental situations, only the reward rates, but not the weights of those portions need be accurately appreciated for the agent to optimally perform forgo decisions. Therefore, when tested in such situations, even agents that misestimate the apportionment of time can yet make optimal forgo decisions based solely from a comparison of the reward rate in versus outside the pursuit. However, when faced with a choice between two or more pursuits when emerging from a path in common to any choice policy, optimal pursuit selection based on relative rate comparisons is no longer guaranteed, as not only the reward rates of pursuits, but their weights as well must then be accurately appreciated. Misestimation of the weights of pursuits comprising a world then results in errors in valuation regarding the initiation of a pursuit under choice instances. We term this reckoning of the source of error committed by animals and humans the Malapportionment Hypothesis, which identifies the underweighting of the time spent outside versus inside, a considered pursuit (but not the misestimation of pursuit rates) as the source of error committed by animals and humans (Figure 20). This hypothesis therefore captures previously published behavioral observations showing that animals can make decisions to take or forgo reward options that optimize reward accumulation (Krebs et al., 1977; Stephens and Krebs, 1986; Blanchard and Hayden, 2014), but make suboptimal decisions when presented with simultaneous and mutually exclusive choices between rewards of different delays (Blanchard and Hayden, 2015; Calhoun and Hayden, 2015; Carter and Redish, 2016).

The Malapportionment Hypothesis.

The Malapportionment Hypothesis holds that suboptimal decision-making, as revealed under Choice decision-making, arises in humans and animals as a consequence of the valuation process underweighting the contribution of accurately assessed reward rates outside versus inside the considered pursuit type. A) An example Choice situation where the global reward rate is maximized by choosing a larger later reward over a smaller sooner reward. B) An agent that underweights the outside time but accurately appreciates the outside and inside reward rates, overestimates the global reward rate resulting from each policy, and thus exhibits suboptimal impatience by selecting the smaller sooner reward. C) Similarly, an agent that overweights the time inside the considered pursuits but accurately appreciates the outside and inside reward rates also overestimates the global reward rate and selects the smaller sooner reward. As inside and outside reward rates are accurately assessed, forgo decisions can correctly be made despite any misappreciation of the relative time spent in/outside the considered pursuit.

Comparisons to prior models

As our description of global reward rate-optimizing valuation is motivated by the same normative principle, how is our formalism unique from OFT, and, more generally, from other models proposing some form of reward-rate maximization? Firstly, the specific formulation proponents of OFT have used fails to adequately recognize how outside rewards influence the value of considered pursuits. Additionally, the relationship between time’s cost and apparent temporal discounting has not been explicitly identified in prior OFT explanations. By contrast, our formulation, because of its specificity, can potentially align with neural representations of the variables we propose, and their misestimations may explain the ways in which observed animal behavior may deviate from optimality. Models inspired by OFT’s objective of global reward rate maximization but that seek to make a better accounting of observed deviations make the concession that, while global reward rate maximization is sought, it is not achieved. Rather, some non-global reward rate maximization is obtained by the agent (Bateson and Kacelnik, 1996; Blanchard et al., 2013; Namboodiri et al., 2014b; Fung et al., 2021). Of particular interest, the TIMERR model (Namboodiri et al., 2014c) and the Heuristic model (Blanchard et al., 2013) both assume non-global reward-rate maximization.

TIMERR Model

The essential feature of the TIMERR model (Namboodiri et al., 2014b) is that the agent looks back into its near past to estimate the reward rate of the environment, with this ‘look-back’ time, Time, being the model’s free-fit parameter. In contrast to the reward rate optimal agent, this look-back time, then, is not a basic feature of the external world, but rather is related to how the animal uses its experience. TIMERR’s policy is then determined by the reward rate obtained across this interval and that of the considered pursuit. In this way, TIMERR includes sources outside of the considered pursuit type in its evaluation, and because of this, exhibits many of the behaviors that the reward rate optimal agent is demonstrated here to express (Namboodiri et al., 2014a, 2014b, 2014c; Shuler and Namboodiri, 2018). Indeed, the TIMERR model and the optimal agent share the same mathematical form, though, critically, the meaning of their terms differ. An important additional difference is that TIMERR is specific in the manner in which reward obtained outside the current instance of the considered pursuit is used: as recently experienced rewards from the past contribute to the estimation of the average reward rate of the environment, this ‘look-back’ time can include rewards from the pursuit type currently under consideration. Therefore, TIMERR commits an overestimation of the outside reward rate, and thus, an overestimation of global reward rate, manifesting as suboptimal impatience in choice and forgo decisions. In this way, while TIMERR is appealing in assuming that the recent past is used to estimate the global reward rate, and reproduces a number of sensitivities to conditions observed behaviorally, it is not in accordance with the Malapportionment Hypothesis as it mistakes pursuits’ rates as well as their weights.

Heuristic Model

In the “Heuristic” model (Blanchard et al., 2013), as in Ecological Rationality Theory, ERT (Stephens et al., 2004), it is thought that animals prioritize the local reward rate of considered pursuits, rather than the global reward rate. In the Heuristic model, however, suboptimal “impatience” is rationalized as being the consequence of the animal’s inability to fully appreciate post-reward delays (time subsequent to reward until re-entry into states/pursuits common to one or another policy). Indeed, while animals are demonstrated to be sensitive to post-reward delays, they act as if they significantly underestimate post-reward delays incurred, exhibiting a suboptimal bias for SS pursuits when LL pursuits would maximize global reward rate (Blanchard et al., 2013). Through a parameter, ω, which adjusts the degree in which post-reinforcer delays are underestimated, the Heuristic model can be sufficient to capture observed animal behavior in intertemporal choice tasks (Blanchard et al., 2013). However, as the Heuristic model is quite specific as to the source of error—the underestimation post-reward delays—it would well fit observed behavior only in certain experimental conditions. Should appreciable 1) reward be obtained or 2) time be spent outside of a considered pursuit type and its post-reward interval, then the Heuristic model would fail to make a good accounting of observed behavior.

The Heuristic model can be modified to specify the uniform downscaling of all non-pursuit intervals (rather than just post-reward delays), as in the implementation by Carter and Redish (Carter and Redish, 2016). This modification would bring the Heuristic model closer into alignment with the Malapportionment Hypothesis. But, as temporal underestimation would not apply to pursuits occurring outside the currently considered one, fits to observed behavior would be strained in worlds composed predominantly of pursuits with little non-pursuit time. Further, by underestimating the time spent outside the considered pursuit without a corresponding underestimation of reward earned outside the considered pursuit, the Heuristic model ought to overestimate the outside reward rate and thus the global reward rate.

So, while impatience under Choice could be fit under some experimental circumstances, behavior under Forgo instances would then be expected to also be sub-optimally impatient. Therefore, to bring the Heuristic model fully into alignment with the Malapportionment Hypothesis, it must be further assumed that the reward rate from the considered pursuit can be compared to the true outside or true global reward rate of the environment (Carter and Redish, 2016), as well as expanding the model to incorporate all intervals of time occurring outside a considered pursuit.

Conclusion

An enriched understanding of how a reward-rate optimal agent evaluates temporal decision-making empowers insight into the nature of human and animal valuation. It does so not by advancing the claim that we are optimal, but rather by clarifying what are and are not signs of optimality, which then permits quantification of the intriguing pattern of adherence and deviation from this normative expectation. Therein lies clues for deducing the learning algorithm and representational architecture used by brains to attribute value to representations of the temporal structure of the world. Here we have conceptualized and generalized temporal decision-making worlds as composed of pursuits, described by their rates and weights, and in so doing, come to better appreciate the cost of time, how policies impact the reward rates reaped from those worlds, and how processes that fail to accurately appreciate those features would misvalue the worth of initiating pursuits. We propose the Malapportionment Hypothesis, which identifies a failure to accurately appreciate the weights rather than the rates of pursuits, as the root cause of errors made, to reckon with the curious pattern of behavior observed regarding whether to initiate a pursuit. We postulate that the value learning algorithm and representational architecture selected for by evolution has favored the ability to appreciate the reward rates of pursuits over that of their weights.

Appendices

Ap 1. Derivation of equation for global reward rate given a menu of options

E(r): the expected reward magnitude for each reward opportunity

E(t): the expected time between the initiation of reward pursuits

global reward rate: the average reward per pursuit divided by the average time per pursuit.

ρd: the average rate of collecting rewards while in the default pursuit

pi : reward opportunities i as a proportion of total pursued rewards

the average reward received per reward opportunity

the average time invested per reward opportunity

: the average time spent in the default pursuit between reward opportunities

: the average reward received in the default pursuit between reward opportunities

the global reward rate of the reward opportunity landscape

Ap 2. Average time spent outside tout the considered pursuit type, in, and the average reward rate earned outside that pursuit type, ρout

ρout is the reward rate achieved from all the time spent outside the considered pursuit, in, which is also the reward rate achieved if the considered pursuit, in, is never pursued.

Ap 3. Reformulation of global reward rate in terms of ρout and tout

Ap 4. Global reward rate is a weighted average of an option’s reward rate and its outside reward rate

Ap 5. Derivation of reward-rate maximizing forgo policies

Forgo the considered pursuit in if ρi < ρiin

Choose considered pursuit in if ρi > ρiin

Choosing and forgoing the considered option in are equivalent if

Ap 6. Derivation of the equivalent immediate reward (i.e. the subjective value) for optimal global reward rate

Pursuit in1 and pursuit in2 produce the equivalent global reward rate if

By definition, if tin2 = 0, pursuit in2 is an immediate reward. Finding rin2 such that describes the equivalent immediate subjective value of pursuit in1.

Therefore, for a considered pursuit, in,…

Ap 7. Equivalent immediate subjective value need not be calculated from option-specific estimations of global reward rate

If svin1 < svin2

If svin1 = svin2

If svin1 > svin2

If ρin < ρout

If ρin = ρout

If ρin > ρout

Ap 8. Reformulation of equivalent immediate subjective value in terms of outside parameters

Ap 9. Derivation of choice policies that optimize global reward rate

Let tin1 > tin2

Choose option in1 if ρin1+∀iin1,in2 > ρin2j+∀iin1,in2

fin1,in2: the frequency at which the choice between option in1 and in2 are presented.

ρout: the reward rate earned outside of the in1 v. in2 choice

tout: the average time per choice spent outside of in1 or in2.

relationship between ρin1+∀iin1,in2, ρin1+∀iin1,in2

ρg * : the maximum reward rate

Choose option in2 if ρin1+∀iin1,in2 < ρin2+∀iin1,in2

Option in2 and option in1 are equivalent if ρin1+∀iin1,in2 = ρin2+∀iin1,in2

Ap 10. Equivalent immediate subjective value policies that optimize global reward rate

Choose option in1 over pursuit in2 if ρg(in1) > ρg(in2)

Choose option in2 over option in1 if ρg(in2) > ρg(in1)

Option in2 and option in1 are equivalent if ρg(in1) = ρg(in2)

Ap 11. Definitions for misestimating global reward rate-enabling parameters

Each misestimated variable (column 1) is multiplied by an error term, ω, to give , the misestimated global reward rate (column 2). When ω = (0,1) the variable is underestimated, when ω = (1,2) the variable is overestimated, and when ω = 1 the variable is correctly estimated and .

Ap 12. Conditions wherein overestimation of global reward rate leads to suboptimal choice behavior

If tLL > tss and rLL > rss

pursuit LL is optimal if and svLL > svSS

Policy from global reward rate overestimation

, the animal will choose pursuit LL

, the animal will choose pursuit SS

pursuit LL is optimal if and pursuit LL is chosen if

pursuit LL is optimal if but pursuit SS is chosen if

The policy from overestimation is suboptimal if

The policy from overestimation is suboptimal if but svLL > svSS

Ap 13. Situations in which the rewarding option does not exclude the animal from receiving outside reward