Homeostatic reinforcement learning for integrating reward collection and physiological stability
 Cited 10
 Views 1,646
 Annotations
Abstract
Efficient regulation of internal homeostasis and defending it against perturbations requires adaptive behavioral strategies. However, the computational principles mediating the interaction between homeostatic and associative learning processes remain undefined. Here we use a definition of primary rewards, as outcomes fulfilling physiological needs, to build a normative theory showing how learning motivated behaviors may be modulated by internal states. Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior. We further suggest a formal basis for temporal discounting of rewards by showing that discounting motivates animals to follow the shortest path in the space of physiological variables toward the desired setpoint. We also explain how animals learn to act predictively to preclude prospective homeostatic challenges, and several other behavioral patterns. Finally, we suggest a computational role for interaction between hypothalamus and the brain reward system.
https://doi.org/10.7554/eLife.04811.001eLife digest
Our survival depends on our ability to maintain internal states, such as body temperature and blood sugar levels, within narrowly defined ranges, despite being subject to constantly changing external forces. This process, which is known as homeostasis, requires humans and other animals to carry out specific behaviors—such as seeking out warmth or food—to compensate for changes in their environment. Animals must also learn to prevent the potential impact of changes that can be anticipated.
A network that includes different regions of the brain allows animals to perform the behaviors that are needed to maintain homeostasis. However, this network is distinct from the network that supports the learning of new behaviors in general. These two systems must, therefore, interact so that animals can learn novel strategies to support their physiological stability, but it is not clear how animals do this.
Keramati and Gutkin have now devised a mathematical model that explains the nature of this interaction, and that can account for many behaviors seen among animals, even those that might otherwise appear irrational. There are two assumptions at the heart of the model. First, it is assumed that animals are capable of guessing the impact of the outcome of their behaviors on their internal state. Second, it is assumed that animals find a behavior rewarding if they believe that the predicted impact of its outcome will reduce the difference between a particular internal state and its ideal value. For example, a form of behavior for a human might be going to the kitchen, and an outcome might be eating chocolate.
Based on these two assumptions, the model shows that animals stabilize their internal state around its ideal value by simply learning to perform behaviors that lead to rewarding outcomes (such as going into the kitchen and eating chocolate). Their theory also explains the physiological importance of a type of behavior known as ‘delay discounting’. Animals displaying this form of behavior regard a positive outcome as less rewarding the longer they have to wait for it. The model proves mathematically that delay discounting is a logical way to optimize homeostasis.
In addition to making a number of predictions that could be tested in experiments, Keramati and Gutkin argue that their model can account for the failure of homeostasis to limit food consumption whenever foods loaded with salt, sugar or fat are freely available.
https://doi.org/10.7554/eLife.04811.002Introduction
Survival requires living organisms to maintain their physiological integrity within the environment. In other words, they must preserve homeostasis (e.g. body temperature, glucose level, etc.). Yet, how might an animal learn to structure its behavioral strategies to obtain the outcomes necessary to fulfill and even preclude homeostatic challenges? Such, efficient behavioral decisions surely should depend on two brain circuits working in concert: the hypothalamic homeostatic regulation (HR) system, and the corticobasal ganglia reinforcement learning (RL) mechanism. However, the computational mechanisms underlying this obvious coupling remain poorly understood.
The previously developed classical negative feedback models of HR have tried to explain the hypothalamic function in behavioral sensitivity to the ‘internal’ state, by axiomatizing that animals minimize the deviation of some key physiological variables from their hypothetical setpoints (Marieb & Hoehn, 2012). To this end, a direct corrective response is triggered when a deviation from setpoint is sensed or anticipated (Sibly & McFarland, 1974; Sterling, 2012). A key lacuna in these models is how a simple corrective action (e.g. ‘go eat’) in response to a homeostatic deficit might be translated into a complex behavioral strategy for interacting with the dynamic and uncertain external world.
On the other hand, the computational theory of RL has proposed a viable computational account for the role of the corticobasal ganglia system in behavioral adaptation to the ‘external’ environment, by exploiting experienced environmental contingencies and reward history (Sutton & Barto, 1998; Rangel et al., 2008). Critically, this theory is built upon one major axiom, namely, that the objective of behavior is to maximize reward acquisition. Yet, this suite of theoretical models does not resolve how the brain constructs the reward itself, and how the variability of the internal state impacts overt behavior.
Accumulating neurobiological evidence indicates intricate intercommunication between the hypothalamus and the rewardlearning circuitry (Palmiter, 2007; Yeo & Heisler, 2012; Rangel, 2013). The integration of the two systems is also behaviorally manifest in the classical behavioral pattern of anticipatory responding in which, animals learn to act predictively to preclude prospective homeostatic challenges. Moreover, the ‘good regulator’ theoretical principle implies that ‘every good regulator of a system must be a model of that system’ (Conant & Ashby, 1970), accentuating the necessity of learning a model (either explicit or implicit) of the environment in order to regulate internal variables, and thus, the necessity of associative learning processes being involved in homeostatic regulation.
Given the apparent coupling of homeostatic and learning processes, here, we propose a formal hypothesis for the computations, at an algorithmic level, that may be performed in this biological integration of the two systems. More precisely, inspired by previous descriptive hypotheses on the interaction between motivation and learning (Hull, 1943; Spence, 1956; Mowrer, 1960), we suggest a principled model for how the rewarding value of outcomes is computed as a function of the animal's internal state, and of the approximated needreduction ability of the outcome. The computed reward is then made available to RL systems that learn over a statespace including both internal and external states, resulting in approximate reinforcement of instrumental associations that reduce or prevent homeostatic imbalance.
The paper is structured as follows: After giving a heuristic sketch of the theory, we show several analytical, behavioral, and neurobiological results. On the basis of the proposed computational integration of the two systems, we prove analytically that rewardseeking and physiological stability are two sides of the same coin, and also provide a normative explanation for temporal discounting of reward. Behaviorally, the theory gives a plausible unified account for anticipatory responding and the risefall pattern of the response rate. We show that the interaction between the two systems is critical in these behavioral phenomena and thus, neither classical RL nor classical HR theories can account for them. Neurobiologically, we show that our model can shed light on recent findings on the interaction between the hypothalamus and the rewardlearning circuitry, namely, the modulation of dopaminergic activity by hypothalamic signals. Furthermore, we show how orosensory information can be integrated with internal signals in a principled way, resulting in accounting for experimental results on consummatory behaviors, as well as the pathological condition of overeating induced by hyperpalatability. Finally, we discuss limitations of the theory, compare it with other theoretical accounts of motivation and internal state regulation, and outline testable predictions and future directions.
Results
Theory sketch
A selforganizing system (i.e. an organism) can be defined as a system that opposes the second law of thermodynamics (Friston, 2010). In other words, biological systems actively resist the natural tendency to disorder by regulating their physiological state to fall within narrow bounds. This general process, known as homeostasis (Cannon, 1929; Bernard, 1957), includes adaptive behavioral strategies for counteracting and preventing selfentropy in the face of constantly changing environments. In this sense, one would expect organisms to reinforce responses that mitigate deviation of the internal state from desired ‘setpoints’. This is reminiscent of the drivereduction theory (Hull, 1943; Spence, 1956; Mowrer, 1960) according to which, one of the major mechanisms underlying reward is the usefulness of the corresponding outcome in fulfilling the homeostatic needs of the organism (Cabanac, 1971). Inspired by these considerations (i.e. preservation of selforder and reduction of deviations), we propose a formal definition of primary reward (equivalently: reinforcer, economic utility) as the approximated ability of an outcome to restore the internal equilibrium of the physiological state. We then demonstrate that our formal homeostatic reinforcement learning framework accounts for some phenomena that classical drivereduction was unable to explain.
We first define ‘homeostatic space’ as a multidimensional metric space in which each dimension represents one physiologicallyregulated variable (the horizontal plane in Figure 1). The physiological state of the animal at each time t can be represented as a point in this space, denoted by ${H}_{t}=({h}_{1,t},{h}_{2,t},..,{h}_{N,t})$, where ${h}_{i,t}$ indicates the state of the ith physiological variable. For example, ${h}_{i,t}$ can refer to the animal's glucose level, body temperature, plasma osmolality, etc. The homeostatic setpoint, as the ideal internal state, can be denoted by ${H}^{*}=({h}_{1}^{*},{h}_{2}^{*},..,{h}_{N}^{*})$. As a mapping from the physiological to the motivational state, we define the ‘drive’ as the distance of the internal state from the setpoint (the threedimensional surface in Figure 1):
m and n are free parameters that induce important nonlinear effects on the mapping between homeostatic deviations and their motivational consequences. Note that for the simple case of m = n = 1, the drive function reduces to Euclidian distance. We will later consider more general nonlinear mappings in terms of classical utility theory. We will also discuss that the drive function can be viewed as equivalent to the informationtheoretic notion of surprise, defined as the negative logprobability of finding an organism in a certain state ($D\left({H}_{t}\right)=\mathrm{ln}\text{\hspace{0.17em}}p\left({H}_{t}\right)$).
Having defined drive, we can now provide a formal definition for primary reward. Let's assume that as the result of an action, the animal receives an outcome ${o}_{t}$ at time t. The impact of this outcome on different dimensions of the animal's internal state can be denoted by ${K}_{t}=({k}_{1,t},{k}_{2,t},\mathrm{..},{k}_{N,t})$. For example, ${k}_{i,t}$ can be the quantity of glucose received as a result of outcome ${o}_{t}$. Hence, the outcome results in a transition of the physiological state from ${H}_{t}$ to ${H}_{t+1}={H}_{t}+{K}_{t}$ (See Figure 1) and thus, a transition of the drive state from $D\left({H}_{t}\right)$ to $D\left({H}_{t+1}\right)=D({H}_{t}+{K}_{t})$. Accordingly, the rewarding value of this outcome can be defined as the consequent reduction of drive:
Intuitively, the rewarding value of an outcome depends on the ability of its constituting elements to reduce the homeostatic distance from the setpoint or equivalently, to counteract selfentropy. As discussed later, the additive effect (${K}_{t}$) of these constituting elements on the internal state can be approximated by the orosensory properties of outcomes. We will also discuss how erroneous estimation of drive reduction can potentially be a cause for maladaptive consumptive behaviors.
We hypothesize in this paper that the primary reward constructed as proposed in Equation 2 is used by the brain's reward learning machinery to structure behavior. Incorporating this physiological reward definition in a normative RL theory allows us to derive one major result of our theory, which is that the rationality of behavioral patterns is geared toward maintaining physiological stability.
Rationality of the theory
Here we show that our definition of reward reconciles the RL and HR theories in terms of their normative assumptions: reward acquisition and physiological stability are mathematically equivalent behavioral objectives. More precisely, given the proposed definition of reward and given that animals discount future rewards (Chung & Herrnstein, 1967), any behavioral policy, π, that maximizes the sum of discounted rewards ($SDR$) also minimizes the sum of discounted deviations from the setpoint, and vice versa. In fact, starting from an initial internal state ${H}_{0}$, the sum of discounted deviations (SDD) for a certain behavioral policy π that causes the internal state to move in the homeostatic space along the trajectory p(π), can be defined as:
Similarly, the sum of discounted rewards (SDR) for a policy π can be defined as:
It is then rather straightforward to show that for any initial state ${H}_{0}$, we will have (See ‘Materials and methods’ for the proof):
where γ is the discount factor. In other words, the same behavioral policy satisfies optimal rewardseeking as well as optimal homeostatic maintenance. In this respect, reward acquisition sought by the RL system is an efficient means to guide an animal's behavior toward fulfilling the basic objective of defending homeostasis. Thus, our theory suggests a physiological basis for the rationality of reward seeking.
Normative role of temporal discounting
In the domain of animal behavior, one fundamental question is why animals should discount rewards the further they are in the future. Our theory indicates that reward seeking without discounting (i.e., if γ = 1) would not lead, and may even be detrimental, to physiological stability (See ‘Materials and methods’). Intuitively, this is because a futurediscounting agent would always tend to expedite bigger rewards and postpone punishments. Such an agent, therefore, tries to reduce homeostatic deviations (which is rewarding) as soon as possible, and thus, tries to find the shortest path toward the setpoint. A nondiscounting agent, in contrast, can always compensate for a deviationinduced punishment by reducing that deviation any time in the future.
While the formal proof of the necessity of discounting is given in the ‘Materials and methods’, let us give an intuitive explanation. Imagine you had to plan a 1hr hill walk from a droppoint toward a pickup point, during which you wanted to minimize the height (equivalent to drive) summed over the path you take. In this summation, if you give higher weights to your height in the near future as compared to later times, the optimum path would be to descend the hill and spend as long as possible at the bottom (i.e. homeostatic setpoint) before returning to the pickup point. Equation 5 shows that this optimization is equivalent to optimizing the total discounted rewards along the path, given that descending and ascending steps are defined as being rewarding and punishing, respectively (Equation 2).
In contrast, if at all points in time you give equal weights to your height, then the summed height over path only depends on the drop and pickup points, since every ascend can be compensated with a descend at any time. In other words, in the absence of discounting, the rewarding value of a behavioral policy that changes the internal state only depends on the initial and final internal states, regardless of its trajectory in the homeostatic space. Thus, when γ = 1, the values of any two behavioral policies with equal net shifts of the internal state are equal, even if one policy moves the internal state along the shortest path, whereas the other policy results in large deviations of the internal state from the setpoint and threatens survival. These results hold for any form of temporal discounting (e.g., exponential, hyperbolic). In this respect, our theory provides a normative explanation for the necessity of temporal discounting of reward: to maintain internal stability, it is necessary to discount future rewards.
A normative account of anticipatory responding
A paradigmatic example of behaviors governed by the internal state is the anticipatory responses geared to preclude perturbations in regulated variables even before any physiological depletion (negative feedback) is detectable. Anticipatory eating and drinking that occur before any discernible homeostatic deviation (Woods & Seeley, 2002), anticipatory shivering in response to a cue that predicts the cold (Mansfield et al., 1983; Hjeresen et al., 1986), and insulin secretion prior to meal initiation (Woods, 1991), are only a few examples of anticipatory responding.
One clear example of a conditioned homeostatic response is animals' progressive tolerance to ethanolinduced hypothermia. Experiments show that when ethanol injections are preceded (i.e., are predictable) by a distinctive cue, the ethanolinduced drop of the body core temperature of animals diminishes along the trials (Mansfield & Cunningham, 1980). Figure 2 shows that when the temperature was measured 30, 60, 90, and 120 min after daily injections, the drop of temperature below the baseline was significant on the first day, but gradually disappeared over 8 days. Interestingly, in the first extinction trial on the ninth day where the ethanol was omitted, the animal's temperature exhibited a significant increase above normal after cue presentation. This indicates that the enhanced tolerance response to ethanol is triggered by the cue, and results in an increase of temperature in order to compensate for the forthcoming ethanolinduced hypothermia. Thus, this tolerance response is mediated by associative learning processes, and is aimed at regulating temperature. Here we demonstrate that the integration of HR and RL processes accounts for this phenomenon.
We simulate the model in an artificial environment where on every trial, the agent can choose between initiating a tolerance response and doing nothing, upon observing a cue (Figure 3A). The cue is then followed by a forced drop of temperature, simulating the effect of ethanol (Figure 3B). We also assume that in the absence of injection, the temperature does not change. However, if the agent chooses to initiate the tolerance response in this condition, the temperature increases gradually (Figure 3D). Thus, if ethanol injection is preceded by cuetriggered tolerance response, the combined effect (Figure 3F, as superposition of Figure 3B,D) will have less deviation from the setpoint as compared to when no response is taken (Figure 3B). As punishment (as the opposite of reward) in our model is defined by the extent to which the deviation from the setpoint increases, the ‘null’ response will have a bigger punishing value than the ‘tolerance’ response and thus, the agent gradually reinforces the ‘tolerance’ action (Figure 3C) (More precisely, the rewarding value of each action is defined by the sum of discounted drivereductions during the 24 hr upon taking that action). This results in gradual fade of the ethanolinduced deviation of temperature from setpoint (Figure 3E; See Figure 3—source data 1 for simulation details).
Clearly, if after this learning process cuepresentation is no longer followed by ethanol injection (as in the first extinction trial, E1), the cuetriggered tolerance response increases the temperate beyond the setpoint (Figure 3E).
In general, these results show that the tolerance response caused by predicted hypothermia is an optimal behavior in terms of minimizing homeostatic deviation and thus, maximizing reward. Thus, this optimal homeostatic maintenance policy is acquired by associative learning mechanisms.
Our theory implies that animals are capable of learning not only Pavlovian (e.g. shivering, or tolerance to ethanol), but also instrumental anticipatory responding (e.g., pressing a lever to receive warmth, in response to a coldpredicting cue). This prediction is in contrast to the theory of predictive homeostasis (also known as allostasis) where anticipatory behaviors are only reflexive responses to the predicted homeostatic deprivation upon observing cues (Woods & Ramsay, 2007; Sterling, 2012).
Behavioral plausibility of drive: accounting for key phenomena
The definition of the drive function (Equation 1) in our model has two degrees of freedom: m and n are free parameters whose values determine the properties of the homeostatic space metric. Appropriate choice of m and n (n > m > 1) permits our theory to account for the following four key behavioral phenomena in a unified framework. First, it accounts for the fact that the reinforcing value of an appetitive outcome increases as a function of its dose (${K}_{t}$) (Figure 4A):
This is supported by the fact that in progressive ratio schedules of reinforcement rats maintain higher breakpoints when reinforced with bigger appetitive outcomes, reflecting higher motivation toward them (Hodos, 1961; Skjoldager et al., 1993). Secondly, the model accounts for the potentiating effect of the deprivation level on the reinforcing value (i.e., food will be more rewarding when the animal is hungrier) (Figure 4B,C):
This is consistent with experimental evidence showing that the level of food deprivation in rats increases the breakpoint in a progressive ratio schedule (Hodos, 1961). Note that this point effectively establishes a formal extension for the ‘incentive’ concept as defined by incentive salience theory (Berridge, 2012) (Discussed later).
Thirdly, the theory accounts for the inhibitory effect of irrelevant drives, which is consistent with a large body of behavioral experiments showing competition between different motivational systems (See Dickinson & Balleine, 2002 for a review). In other words, as the deprivation level for one need increases, it inhibits the rewarding value of other outcomes that satisfy irrelevant motivational systems (Figure 4D):
Intuitively, one does not play chess, or even search for sex, on an empty stomach. As some examples, calcium deprivation reduces the appetite for phosphorus, and hunger inhibits sexual behavior (Dickinson & Balleine, 2002).
Finally, the theory naturally captures the riskaversive nature of behavior. The rewarding value in our model is a concave function of the corresponding outcome magnitude:
It is well known that the concavity of the economic utility function is equivalent to risk aversion (MasColell et al., 1995). Indeed, simulating the model shows that when faced with two options with equal expected payoffs, the model learns to choose the more certain option as opposed to the risky one (Figure 5; See Figure 5—source data 1 for simulation details). This is because frequent small deviations from the setpoint are preferable to rare drastic deviations. In fact, our theory suggests the intuition that when the expected physiological instability caused by two behavioral options are equal, organisms do not choose the risky option, because the severe, though unlikely, physiological instabilities that it can cause might be lifethreatening.
Our unified explanation for the above four behavioral patterns suggests that they may all arise from the functional form of the mapping from the physiological to the motivational state. In this sense, we propose that these behavioral phenomena are signatures of the coupling between the homeostatic and the associative learning systems. We will discuss later that m, n, and H^{*} can be regarded as free parameters of an evolutionary process, which eventually determine the equilibrium density of the species.
Note that the equations in this section hold only when the internal state remains below the setpoint. However, the drive function is symmetric with respect to the setpoint and thus, analogous conclusions can be derived for other three quarters of the homeostatic space.
Stepping back from the brink
Since learning requires experience, learning whether an action in a certain internal state decreases or increases the drive (i.e. is rewarding or punishing, respectively) would require our model to have experienced that internal state. Living organisms, however, cannot just experience internal states with extreme and life threatening homeostatic deviations in order to learn that the actions that cause them are bad. For example, once the body temperature goes beyond 45°C, the organism can never return.
We now show how our model manages this problem; that is, it avoids voluntarily experiencing extreme homeostatic deviations and hence ensures that the animal does not voluntarily endanger its physiological integrity (simulations in Figure 6). In the simplest case, let us assume that the model is tabula rasa: it starts from absolute ignorance about the value of state–action pairs, and can freely change its internal state in the homeostatic space. In a onedimensional space, it means that the agent can freely increase or decrease the internal state (Figure 6—figure supplement 1). As the value of ‘increase’ and ‘decrease’ actions at all internal states are initialized to zero, the agent starts by performing a random walk in the homeostatic space. However, the probability of choosing the same action for $z$ times in a row decreases exponentially as z increases ($p\left(z\right)={2}^{z}$): for example, the probability of choosing ‘increase’ is 2^{−1} = 0.5, the probability of choosing two successive ‘increases’ is 2^{−1} = 0.25, the probability of choosing three successive ‘increases’ is 2^{−3} = 0.125, and so on. Thus, it is highly likely for the agent to return at least one step back, before getting too far from its starting point. When the agent returns to a state it had previously experienced, going in the same deviationincreasing direction will be less likely than the first time (i.e., than 50–50), since the agent has already experienced the punishment caused by that state–action pair once. Repetition of this process results in the agent gradually getting more and more attracted to the setpoint, without ever having experienced internal states that are beyond a certain limit (i.e. the brink of death).
Simulating the model in a onedimensional space shows that even after starting from a rather deviated internal state (initial state = 30, setpoint = 0), the agent never visits states with a deviation of more than 40 units after ${10}^{6}$ trials (every action is assumed to change the state by one unit) (Figure 6A; See Figure 6—figure supplements 1,2, and Figure 6—source data 1 for simulation details). Also, simulating 10^{5} agents over 1500 trials (starting from state 30) shows that the mean value of the internal state across all agents converges to the setpoint (Figure 5C), and its variance converges to a steadystate level (Figure 5D). This shows that all agents stay within certain bounds around the setpoint (The maximum deviation from the setpoint among all the 10^{5} agents over the 1500 trials was 61). Also, this property of the model is shown to be insensitive to the parameters of the model, like the initial internal state (Figure 6—figure supplement 3), the rate of exploration (Figure 6—figure supplement 4), m and n (Figure 6—figure supplement 5), or the discount factor (Figure 6—figure supplements 6,7). These parameters only affect the rate of convergence or the distribution over visited states, but not the general property of nevervisitingdrasticdeviations (existence of a boundary). Moreover, this property can be generalized to multidimensional homeostatic spaces. Therefore, our theory suggests a potential normative explanation for how animals (who might be a priori naïve about potential dangers of certain internal states) would learn to avoid extreme physiological instability, without ever exploring how good or bad they are.
Orosensorybased approximation of postingestive effects
As mentioned, we hypothesize that orosensory properties of food and water provide the animal with an estimate, ${\widehat{K}}_{t}$, of their true postingestive effect, ${K}_{t}$, on the internal state. Such association between sensory and postingestive properties could have been developed through prior learning (Swithers et al., 2009; Swithers et al., 2010; Beeler et al., 2012) or evolutionary mechanisms (Breslin, 2013). Based on this sensory approximation, the only information required to compute the reward (and thus the reward prediction error) is the current physiological state (${H}_{t}$) and the sensorybased approximation of the nutritional content of the outcome (${\widehat{K}}_{t}$):
Clearly, the evolution of the internal state itself depends only on the actual (${K}_{t}$) postingestive effects of the outcome. That is ${H}_{t+1}={H}_{t}+{K}_{t}$.
According to Equation 10, the reinforcing value of food and water outcomes can be approximated as soon as they are sensed/consumed, without having to wait for the outcome to be digested and the drive to be reduced. This proposition is compatible with the fact that dopamine neurons exhibit instantaneous, rather than delayed, burst activity in response to unexpected food reward (Schneider, 1989; Schultz et al., 1997). Moreover, it might provide a formal explanations for the experimental fact that intravenous injection (and even intragastric intubation, in some cases) of food is not rewarding even though its drive reduction effect is equal to when it is ingested orally (Miller & Kessen, 1952) (See also Ren et al., 2010). In fact, if the postingestive effect of food is estimated by its sensory properties, the reinforcing value of intravenously injected food that lacks sensory aspects will be effectively zero. In the same line of reasoning, the theory suggests that animals' motivation toward palatable foods, such as saccharine, that have no caloric content (and thus no needreduction effect) is due to erroneous overestimation of their drivereduction capacity, misguided by their taste or smell. Note that the rationality of our theory, as shown in Equation 5, holds only as long as ${\widehat{K}}_{t}$ is an unbiased estimation of ${K}_{t}$. Otherwise, pathological conditions could emerge.
Last but not least, the orosensorybased approximation provides a computational hypothesis for the separation of reinforcement and satiation effects. A seminal series of experiments (McFarland, 1969) demonstrated that the reinforcing and satiating (i.e., need reduction) effects of drinking behavior, dissociable from one another, are governed by the orosensory and alimentary components of the water, respectively. Two groups of waterdeprived animals learned to press a green key to selfadminister water orally. After this pretraining session, pressing the green key had no consequence anymore, whereas pressing a novel yellow key resulted in the oral delivery of water in one group, and intragastric (through a fistula) delivery of water in the second group. Results showed that the green key gradually extinguished in both groups (Figure 7A,B). During this time, responding on the yellow key in the oral group initially increased but then gradually extinguished (risefall pattern; Figure 7A). The second group, however, showed no motivation for the yellow key (Figure 7B). This shows that only oral, but not intragastric, selfadministration of water is reinforcing for thirsty animals. Our model accounts for these behavioral dynamics.
Simulating the model shows that the agent's subjective probability of receiving water upon pressing the green key gradually decreases to zero in both groups (Figure 8C,D). As this predicted outcome (alimentary content) decreases, its approximated thirstreduction effect (equal to reward in our framework) decreases as well, resulting in the extinction of pressing the green key (Figure 8A,B). As for the yellow key, the oral agent initially increases the rate of responding (Figure 8A) as the subjective probability of receiving water upon pressing the yellow key increases (Figure 8C). Gradually, however, the internal state of the animal reaches the homeostatic setpoint (Figure 8E), resulting in diminishing motivation (thirstreduction effect) of seeking water (Figure 8A). Thus, our model shows that whereas the ascending limb of the response curve represents a learning effect, the descending limb is due to mitigated homeostatic imbalance (i.e., unlearning vs. satiation). Notably, classical RL models only explain the ascending, and classical HR models only explain the descending pattern.
In contrast to the oral agent, the fistula agent never learns to press the yellow key (Figure 8B). This is because the approximated alimentary content attributed to this response remains zero (Figure 8D) and so does its drivereduction effect. Note that as above, the sensorybased approximation (${\widehat{K}}_{t}$) of the alimentary effect of water in the oral and fistula cases is assumed to be equal to its actual effect (${K}_{t}$) and zero, respectively (See Figure 8—figure supplements 1,2, and Figure 8—source data 1 for simulation details).
Our theory also suggests that in contrast to reinforcement (above), satiation is independent of the sensory aspects of water and only depends on its postingestive effects. In fact, experiments show that when different proportions of water were delivered via the two routes in different groups, satiation (i.e., suppression of responding) only depended on the total amount of water ingested, regardless of the delivery route (McFarland, 1969).
Our model accounts for these data (Figure 9), since the evolution of the internal state only depends on the actual water ingested. For example, whether water is administered completely orally (Figure 9, left column) or halforallyhalfintragastrically (Figure 9, right column), the agent stops seeking water when the setpoint is reached. As only oral delivery is sensed, the subjective outcome magnitude converges to 1 (Figure 9C) and 0.5 (Figure 9D) units for the two cases, respectively. When the setpoint is reached, consuming more water results in overshooting the setpoint (increasing homeostatic deviation) and thus, is punishing. Therefore, both agents selfadminister the same total amount of water, equal to what is required for reaching the setpoint.
However, as the sensed amount of water is bigger in the completelyoral case, waterseeking behavior is approximated to have a higher thirstreduction effect. As a result, the reinforcing value of waterseeking is higher in the oral case (as compared to the halforalhalf intragastric case) and thus, the rate of responding is higher. This, in turn, results in faster convergence of the internal state to the setpoint (compare Figure 9E,F). In this respect, we predict that the oral/fistula proportion affects the speed of satiation: the higher the proportion is, the faster the satiety state is reached and thus, the faster the descending limb of responding emerges.
Discussion
Theories of conditioning are founded on the argument that animals seek reward, while reward may be defined, at least in the behaviorist approach, as what animals seek. This apparently circular argument relies on the hypothetical and outofreach axiom of rewardmaximization as the behavioral objective of animals. Physiological stability, however, is an observable fact. Here, we develop a coherent mathematical theory where physiological stability is put as the basic axiom, and reward is defined in physiological terms. We demonstrated that reinforcement learning algorithms under such a definition of physiological reward lead to optimal policies that both maximize reward collection and minimize homeostatic needs. This argues for behavioral rationality of physiological integrity maintenance and further shows that temporal discounting of rewards is paramount for homeostatic maintenance. Furthermore, we demonstrated that such integration of the two systems can account for several behavioral phenomena, including anticipatory responding, the risefall pattern of foodseeking response, riskaversion, and competition between motivational systems. Here we argue that our framework may also shed light on the computational role of the interaction between the brain reward circuitry and the homeostatic regulation system; namely, the modulation of midbrain dopaminergic activity by hypothalamic signals.
Neural substrates
Homeostatic regulation critically depends on sensing the internal state. In the case of energy regulation, for example, the arcuate nucleus of the hypothalamus integrates peripheral hormones including leptin, insulin, and ghrelin, whose circulating levels reflect the internal abundance of fat, abundance of carbohydrate, and hunger, respectively (Williams & Elmquist, 2012). In our model, the deprivation level has an excitatory effect on the rewarding value of outcomes (Equation 7) and thus on the reward prediction error (RPE). Consistently, recent evidence indicates neuronal pathways through which energy statemonitoring peptides modulate the activity of midbrain dopamine neurons, which supposedly carry the RPE signal (Palmiter, 2007).
Namely, orexin neurons, which project from the lateral hypothalamus area to several brain regions including the ventral tegmental area (VTA) (Sakurai et al., 1998), have been shown to have an excitatory effect on dopaminergic activity (Korotkova et al., 2003; Narita et al., 2006), as well as feeding behavior (Rodgers et al., 2001). Orexin neurons are responsive to peripheral metabolic signals as well as to the animal's deprivation level (Burdakov et al., 2005), as they are innervated by orexigenic and anorexigenic neural populations in the arcuate nucleus where circulating peptides are sensed. Accordingly, orexin neurons are suggested to act as an interface between internal states and the reward learning circuit (Palmiter, 2007). In parallel with the orexinergic pathway, ghrelin, leptin and insulin receptors are also expressed on the VTA dopamine neurons, providing a further direct interface between the HR and RL systems. Consistently, whereas leptin and insulin inhibit dopamine activity and feeding behavior, ghrelin has an excitatory effect on them (See Palmiter, 2007 for a review).
The reinforcing value of food outcome (and thus RPE signal) in our theory is not only modulated by the internal state, but also by the orosensory information that approximates the needreduction effects. In this respect, endogenous opioids and μopioid receptors have long been implicated in the hedonic aspects of food, signaled by its orosensory properties. Systemic administration of opioid antagonists decreases subjective pleasantness rating and affective responses for palatable foods in humans (Yeomans & Wright, 1991) and rats (Doyle et al., 1993), respectively. Supposedly through modulating palatability, opioids also control food intake (Sanger & McCarthy, 1980) as well as instrumental foodseeking behavior (Cleary et al., 1996). For example, opioid antagonists decrease the breakpoint in progressive ratio schedules of reinforcement with food (Barbano et al., 2009), whereas opioid agonists produce the opposite effect (Solinas & Goldberg, 2005). This reflects the influence of orosensory information on the reinforcing effect of food. Consistent with our model, these influences have mainly been attributed to the effect of opiates on increasing extracellular dopamine levels in the Nucleus Accumbens (NAc) (Devine et al., 1993) through its action on μopioid receptors in the VTA and NAc (Noel & Wise, 1993; Zhang & Kelley, 1997).
Such orosensorybased approximation of nutritional content, as discussed before, could have been obtained through evolutionary processes (Breslin, 2013), as well as through prior learning (Beeler et al., 2012; Swithers et al., 2009, 2010). In the latter case, approximations based on orosensory or contextual cues can be updated so as to match the true nutritional value, resulting in a rational neural/behavioral response to food stimuli (de Araujo et al., 2008).
Irrational behavior: the case of overeating
Above, we developed a normative theory for rewardseeking behaviors that lead to homeostatic stability. However, animals do not always follow rational behavioral patterns, notably as exemplified in eating disorders, drug addiction, and many other psychiatric diseases. Here we discuss one prominent example of such irrational behavior within the context of our theory.
Binge eating is a disorder characterized by compulsive eating even when the person is not hungry. Among the many risk factors of developing binge eating, a prominent one is having easy access to hyperpalatable foods, commonly defined as those loaded with fat, sugar, or salt (Rolls, 2007). As an attempt to explain this risk factor, we discuss one of the points of vulnerability of our theory that can induce irrational choices and thus, pathological conditions.
Overseeking of hyperpalatable foods is suggested to be caused by motivational systems escaping homeostatic constraints, supposedly as a result of the inability of internal satiety signals in blocking the opioidbased stimulation of DA neurons (M. Zhang & Kelley, 2000). Stimulation of μopioid receptors in the NAc, for example, is demonstrated to preferentially increase the intake of highfat food (Glass et al., 1996; Zhang & Kelley, 2000), and hyperpalatable foods are shown to trigger potent release of DA into the NAc (Nestler, 2001). Moreover, stimulation of the brain reward circuitry (Will et al., 2006), as well as DA receptor agonists (Cornelius et al., 2010) are shown to induce hedonic overeating long after energy requirements are met, suggesting the hyperpalatability factor to be driveindependent.
Motivated by these neurobiological findings, one way to formulate the overriding of the homeostatic satiety signals by hyperpalatable foods is to assume that the drivereduction reward for these outcomes is augmented by a driveindependent term, T (T > 0 for palatable foods, and T = 0 for ‘normal’ foods):
In other words, even when the setpoint is reached and thus, the drivereduction effect of food is zero or even negative, the term T overrides this signal and results in further motivation for eating (See ‘Materials and methods’ for alternative formulations of Equation 11). Simulating this hypothesis shows that when a deprived agent (initial internal state = −50) is given access to normal food, the internal state converges to the setpoint (Figure 10C). When hyperpalatable food with equal caloric content (K is the same for both types of food) is made available instead, the steady level of the internal state goes beyond the setpoint (Figure 10C). Moreover, the total consumption of food is higher in the latter case (Figure 8D), reflecting overeating. In fact, the inflated hedonic aspect of the hyperpalatable food causes it to be sought and consumed to a certain extent, even after metabolic demands are fulfilled. One might speculate that such persistent overshoot would result in excess energy storage, potentially leading to obesity.
Simulating the model in another condition where the agent has ‘concurrent’ access to both types of foods shows significant preference of the hyperpalatable food over the normal food (Figure 10E), and the internal state again converges to a higherthansetpoint level (Figure 10F). This is in agreement with the evidence showing that animals strongly prefer highly palatable to less palatable foods (McCrory et al., 2002). (See Figure 10—source data 1 for simulation details)
Relationship to classical drivereduction theory
Our model is inspired by the drive reduction theory of motivation, initially proposed by Clark Hull (Hull, 1943), which became the dominant theory of motivation in psychology during the 1940s and 1950s. However, major criticisms have been leveled against this theory over the years (McFarland, 1969; Savage, 2000; Berridge, 2004; Speakman et al., 2011). Here we propose that our formal theory alleviates some of major faults of the classical drivereduction. Firstly, the classical drivereduction does not explain anticipatory responding in which animals paradoxically voluntarily increase (rather than decrease) their drive deviation, even in the absence of any physiological deficit. As we demonstrated, such apparently maladaptive responses are optimal in terms of both rewardseeking and ensuring physiological stability, and are thus acquired by animals.
Secondly, the drive reduction could not explain how secondary reinforcers (e.g., money, or a light that predicts food) gain motivational value, since they do not reduce the drive per se. Because our framework integrates an RL module with the HR reward computation, the drive reductioninduced reward of primary reinforcers can be readily transferred through the learning process to secondary reinforcers that predict them (i.e., Pavlovian conditioning) as well as to behavioral policies that lead to them (i.e., instrumental conditioning).
Finally, the original Hull's theory is in contradiction with the fact that intravenous injection of food is not rewarding, despite its drivereduction effect. As we showed, this could be due to the orosensorybased approximation mechanism required for computing the reward.
Despite its limitations (discussed later), we would suggest that our modern reformulation of the drivereduction theory subject to specific assumptions (i.e., orosensory approximation, connection to RL, drive form) can serve as a framework to understand the interaction between internal states and motivated behaviors.
Relationship to other theoretical models
Several previous RLbased models have also tried to incorporate the internal state into the computation of reward by proposing that reward increases as a linear function of deprivation level. That is, $r=w\overline{r}$, where $\overline{r}$ is a constant and $w$ is proportional to the deprivation level.
Interestingly, a linear approximation of our proposed drivereduction reward is equivalent to assuming that the rewarding value of outcomes is equal to the multiplication of the deprivation level and the magnitude of the outcome. In fact, by rewriting Equation 2 for the continuous case we will have:
Using Taylor expansion, this reward can be approximated by:
Where ∇ is the gradient operator, and ${\nabla}^{2}$ is the Laplace operator. Thus, a linear approximation of our proposed drivereduction reward is equivalent to assuming that the rewarding value of outcomes is linearly proportional to their needreduction capacity (${K}_{t}$), as well as a function (the gradient of drive) of the deprivation level. In this respect, our framework generalizes and provides a normative basis to multiplicative forms of deprivationmodulated reward (e.g., decision field theory (Busemeyer et al., 2002), intrinsically motivated RL theory (Singh et al., 2010), and MOTIVATOR theory (Dranias et al., 2008)), where reward increases as a linear function of deprivation level. Moreover, those previous models cannot account for the nonlinearities arising from our model; that is the inhibitory effect of irrelevant drives and risk aversion.
Whether the brain implements a nonlinear drivereduction reward (as in Equation 2) or a linear approximation of it (as in Equation 13) can be examined experimentally. Assuming that an animal is in a slightly deprived state (Figure 11A), a linear model predicts that as the magnitude of the outcome increases, its rewarding value will increase linearly (Figure 11B). A nonlinear reward, however, predicts an inverted Ushaped economic utility function (Figure 11B). That is, the rewarding value of a large outcome can be negative, if it results in overshooting the setpoint.
A more recent framework that also uses a multiplicative form of deprivationmodulated reward is the incentive salience theory (Berridge, 2012; Zhang et al., 2009). However, in contrast to the previous models and our framework, this model assumes that the rewarding value of outcomes and conditioned stimuli is learned as if the animal is in a reference internal state ($\psi =1$). Let's denote this reward by $r(s,\psi =1)$ for state s. At the time of encountering state s in the future, the animal uses a factor, ${\psi}_{t}$, related to its current internal state, to modulate the realtime motivation of the animal: $r\left(s,{\psi}_{t}\right)={\psi}_{t}.r(s,\psi =1)$. In the case of conditioned tolerance to hypothermic agents, however, heatproducing response is motivated at the time of cue presentation, when the hypothermic agent is not administered yet. At this time, the animal's internal state is not deviated and thus, the motivational element, ${\psi}_{t}$, in the incentive salience theory does not provoke the tolerance response. Therefore, in our reading and unlike our framework, the incentive salience theory cannot give a computational account of anticipatory responding.
Another approach to integrate responsiveness to both internal and external states appeals to approximate inference techniques from statistical physics. The free energy theory of brain (Friston, 2010) proposes that organisms optimize their actions in order to minimize ‘surprise’. Surprise is an informationtheoretic notion measuring how inconceivable it is to the organism to find itself in a certain state. Assume that evolutionary pressure has compelled a species to occupy a restricted set of internal states, and $p\left({H}_{t}\right)$ indicates the probability of occupying state ${H}_{t}$, after the evolution of admissible states has converged to an equilibrium density. Surprise is defined as the negative logprobability of ${H}_{t}$ occurring; $\mathrm{ln}\text{\hspace{0.17em}}p\left({H}_{t}\right)$.
We propose that our notion of drive is equivalent to surprise as utilized in the free energy (Friston, 2010) and interoceptive inference (Seth, 2013) frameworks. In fact, we propose that an organism has an equilibrium density, $p(.)$, with the following functional form:
In order to stay faithful to this probability density (and ensure the survival of genes by remaining within physiological bounds), the organism minimizes surprise, which is equal to $\mathrm{ln}\text{\hspace{0.17em}}p\left({H}_{t}\right)=\sqrt[m]{{\sum}_{i=1}^{N}{\left{h}_{i}^{\mathrm{*}}{h}_{i,t}\right}^{n}}$. This specific form of surprise is equivalent to our definition of drive (Equation 1). The equivalency of reward maximization and physiological stability objectives in our model (Equation 5) shows that optimizing either homeostasis or sum of discounted rewards corresponds to prescribing a principle of least action applied to the surprise function.
Although our homeostatic RL and the freeenergy theory are similar in spirit, several major differences can be mentioned. Most importantly, the two frameworks should be understood at different levels of analysis (Marr, 1982): the freeenergy theory is a computational framework, whereas our theory fits in the algorithmic/representational level. In the same line, the two theories use different mathematical tools as their optimization techniques. The free energy approach uses variational Bayes inference. Thus, rationality in that model is bounded by the simplifying assumptions for doing ‘approximate’ inference (namely, factorization of the variational distribution over some partition of the latent variables, Laplace approximation, etc.). Our approach, however, depends on tools from optimal control theory and thus, rationality is constrained by the capabilities and weaknesses of the variants of the RL algorithm being used (e.g. modelbased vs. modelfree RL). In this sense, while the notion of reward is redundant in the free energy formulation, and physiological stability is achieved through gradient descent function, homeostasis in our model can only be achieved through computing reward. In fact, the associative learning component in our model critically depends on receiving the approximated reward from the upstream regulatory component. As a result, our model remains faithful to and exploits the welldeveloped conditioning literature in behavioral psychology, with its strengths and weaknesses.
A further approach toward adaptive homeostatic regulation is the predictive homeostasis (otherwise known as allostasis) model (Sterling, 2012) where the classical negativefeedback homeostatic model is coupled with an inference system capable of anticipating forthcoming demands. In this framework, anticipated demands increase current homeostatic deviation (by adjusting the setpoint level) and thus, prepare the organism to meet the predicted need. Again, the concept of reward is redundant in this model and motivated behaviors are directly controlled by homeostatic deviation, rather than by a priori computed and reinforced rewarding values.
As alternative to the homeostatic regulation theories phrased around maintenance of setpoints, another theoretical approach toward modeling regulatory systems is the ‘settling point’ theory (Wirtshafter & Davis, 1977; Berridge, 2004; Müller et al., 2010; Speakman et al., 2011). According to this theory, by viewing organisms as dynamical systems, what looks like a homeostatic setpoint is just the stable state of the system caused by a balance of different opposing effectors on the internal variables. However, one should notice that mathematically, such dynamical systems can be reformulated as a homeostatically regulated system, by writing down a potential functional for the system (or an energy function). Such an energy function is equivalent to our drive function whose setpoint corresponds to the settling point of the dynamical system formulation. Thus, there is equivalence between the two methods, and the setpoint approach summarizes the outcome of the underlying dynamical system on the regulated variables. Note that nothing precludes our framework to treat the setpoint conceptually as maintained internally by an underlying system of effectors and regulators. However, the setpoint/drivefunction formulation conveniently allows us to derive our normative theory.
Predictions
Here we list the testable predictions of our theory, some of which put our model to test against alternative proposals. Firstly, as mentioned before (Figure 9), our theory predicts that the oral vs. fistula proportion in the water selfadministration task (McFarland, 1969) affects the speed of satiation: the higher the oral portion is, the faster the setpoint will be reached.
Secondly, as discussed before, our model predicts an inverted Ushaped utility function (Figure 11A,B). This is in contrast to the multiplicative formulations of deprivationmodulated reward.
Thirdly, our model predicts that if animals are offered with two outcomes where one outcome reduces the homeostatic deviation and the other increases the deviation, the animal chooses to first take the deviationreducing and then the deviationincreasing outcome (Figure 11C, green sequence), but not the other way around (Figure 11C, red sequence). This is due to the fact that future deviations (and rewards) are discounted. Thus, the animal tries to postpone further deviations and expedite drivereducing outcomes.
Fourthly, as explained earlier, we predict that animals are capable of learning not only Pavlovian, but also instrumental anticipatory responding. This is in contrast to the prediction of the predictive homeostasis theory (Woods & Ramsay, 2007; Sterling, 2012; Stephen C ).
Finally, our theory predicts that upon reducing the magnitude of the outcome, a transitory burst of responding should be observed. We simulate both our model (Figure 12, left) and classical homeostatic regulation models (Figure 12, right) in an artificial environment where pressing a lever results in the agent receiving a big outcome (1 g) during the first hour, and a significantly smaller outcome (0.125 g) during the second hour of the experiment. According to the classical models, the corrective response (leverpress) is performed when the internal state drops below the setpoint. Thus, during the first hour, the agent responds with a stable rate (Figure 12E,F) in order maintain the internal state above the setpoint (Figure 12D). Upon decreasing the dose, the agent waits until the internal state again drops below the setpoint. Thereafter, the agent presses the lever with a new rate, corresponding to the new dose. Therefore, according to this class of models, response rate switches from a stable low level to a stable high level, with no burst phase in between (Figure 12F).
According to our model, however, when the unit dose decreases from 1 g to 0.125 g, the agent requires at least some new experiences with the outcome in order to realize that this change has happened (i.e., in order to update the expected outcome associated with every action). Thus, right after the dose is decreased, the agent still expects to receive a big outcome upon pressing the lever. Therefore, as the objective is to minimize deviation from the setpoint (rather that staying above the setpoint), the agent waits for a period equal to the normal interinfusion interval of the 1 g unitdose. During this period, the internal state reaches the same lower bound as in previous trials (Figure 12A). Afterward, when the agent presses the lever for the first time, it receives an unexpectedly small outcome, which is not sufficient for reaching the setpoint. Thus, several further responses will be needed to reach the setpoint, resulting in a burst of responding after decreasing the unit dose (Figure 12B,C). After the setpoint is achieved, the agent presses the lever with a lower (thanburst) rate, in order to keep the internal state close to the setpoint. In sum, in contrast to the classical HR models, our theory predicts a temporary burst of selfadministration after dose reduction (See Figure 12—source data 1 for simulation details).
Limitations and future directions
From an evolutionary perspective, physiological stability and thus survival may themselves be seen as means of guaranteeing reproduction. These intermediate objectives can be even violated in specific conditions and be replaced with parental sacrifice. Still, we believe that homeostatic maintenance can explain a significant proportion of motivated behaviors in animals. It is also noteworthy that our theory only applies to rewards that have a corresponding regulatory system. How to extend our theory to rewards without a corresponding homeostatic regulation system (e.g., social rewards, noveltyinduced reward, etc.) remains a key challenge for the future.
In order to put forth our formal theory we had to put forward several key constraints and assumptions. As further future directions, one could relax several constraining assumptions of our formal setup of the theory. For example, redesigning the model in a partially observable condition (as opposed to the fullyobservable setup we used) where the internal state observation is susceptible to noise could have important implications for understanding some psychiatric diseases and selfperception distortion disorders, such as anorexia nervosa. Also, relaxing the assumption that the setpoint is fixed and making it adaptive to the animal's experiences could explain tolerance (as elevated perception of desired setpoint) and thus, drug addiction and obesity. Furthermore, relaxing the restrictive functional form of the drive function and introducing more general forms could explain behavioral patterns that our model does not yet account for, like asymmetric riskaversion toward gains vs. losses (Kahneman & Tversky, 1979).
Conclusion
In a nutshell, our theory incorporates a formal physiological definition of primary rewards into a novel homeostatically regulated reinforcement learning theory, allowing us to prove that economically rational behaviors ensure physiological integrity. Being inspired by the classic drivereduction theory of motivation, our mathematical treatment allows for quantitative results to be obtained, predictions that make the theory testable, and logical coherence. The theory, with its set of formal assumptions and proofs, does not purport to explain the full gamut of animal behavior, yet we believe it to be a credible step toward developing a coherent mathematical framework to understand behaviors that depend on motivations stemming from internal states and needs of the individual. Furthermore, this work puts forth a metahypothesis that a number of apparently irrational behaviors regain their rationality if the internal state of the individual is taken into account. Among others, the relationship between our learningbased theory and evolutionary processes that shape animal a priori preferences and influence behavioral patterns remains a key challenge.
Materials and methods
Rationality of the theory
Here we show analytically that maximizing rewards and minimizing deviations from the setpoint are equivalent objective functions.
Definition:
A ‘homeostatic trajectory’, denoted by $p=\{{K}_{0},{K}_{1},{K}_{2},\dots \}$, is an ordered sequence of transitions in the $v$dimensional homeostatic space. Each ${K}_{i}$ is a $v$dimensional vector, determining the length and direction of one transition. We also define $\mathcal{P}\left({H}_{0}\right)$ as the set of all trajectories that if start from ${H}_{0}$, will end up at ${H}^{*}$.
Definition:
For each homeostatic trajectory p that starts from the initial motivational state ${H}_{0}$ and consists of $w$ elements, we define $SD{D}_{p}\left({H}_{0}\right)$ as the ‘sum of discounted drives’ through that trajectory:
Where $\gamma $ is the discount factor, and $D(.)$ is the drive function. Also, starting from ${H}_{0}$, the internal state evolves by ${H}_{t+1}={H}_{t}+{K}_{t}$.
Definition:
Similarly, for each homeostatic trajectory p that starts from the initial motivational state ${H}_{0}$ and consists of $m$ elements, we define $SD{R}_{p}\left({H}_{0}\right)$ as the ‘sum of discounted rewards’ through that trajectory:
Proposition:
For any initial state ${H}_{0}$, if $\gamma <1$, we will have:
Roughly, this means that a policy that minimizes deviation from the setpoint, also maximizes acquisition of reward, and vice versa.
Proof:
Assume that ${p}_{i}\in \mathcal{P}\left({H}_{0}\right)$ is a sample trajectory consisting of ${w}_{i}$ transitions. As a result of these transitions, the internal state will take a sequence like: $\{{H}_{i,0}={H}_{0},\hspace{0.17em}{H}_{i,1},\hspace{0.17em}{H}_{i,2},\dots ,{H}_{i,w}={H}^{*}\}$. Denoting $D\left({H}_{x}\right)$ by ${D}_{X}$ for the sake of simplicity in notation, the drive value will take the following sequence: $\{{D}_{i,0}={D}_{0},\hspace{0.17em}{D}_{i,1},\hspace{0.17em}{D}_{i,2},\dots ,{D}_{i,w}={D}^{*}=0\}$. We have:
We also have:
Since ${D}_{0}$ has a fixed value and $\gamma 1<0$, it can be concluded that if a certain trajectory from $\mathcal{P}\left({H}_{0}\right)$ maximizes $SDR\left({H}_{0}\right)$, it will also minimize $SDD\left({H}_{0}\right)$, and vice versa. Thus, the trajectories that satisfy these two objectives are identical.
Hyperpalatability effect
For the especial case that m/n = 1, Equation 11 can be rewritten as follows:
This means that the effect of T is equivalent to having a simple HRL system (without term T) whose drive function is shifted such that the new setpoint is equal to ${H}^{*}+\frac{T}{2{K}_{t}}$, where ${H}^{*}$ is the setpoint of the original system. This predicts that the bigger the hyperpalatability factor T is, the higher the new steady state is, and the higher the real nutritional content ${K}_{t}$ of the food outcome is, the less divergence of the new setpoint from the original setpoint is.
Equation 5 can also be rewritten as:
This can be interpreted as the effect of T being equivalent to a simple HRL system (without term T) whose internal state ${H}_{t}$ is underestimated by $\frac{T}{2{K}_{t}}$ units. That is, hyperpalatability makes the behavior look like as if the subject is hungrier than what they really are.
References
 1

2
Taste uncoupled from nutrition fails to sustain the reinforcing properties of foodThe European Journal of Neuroscience 36:2533–2546.https://doi.org/10.1111/j.14609568.2012.08167.x

3
Lectures on the physiological properties and the pathological alternations of the liquids of the organism: Third lectureIn: LL Langley, editors. Homeostasis: Origins of the concept, 1973. Stroudsberg, PA: Dowden, Hutchinson & Ross, Inc. pp. 89–100.

4
Motivation concepts in behavioral neurosciencePhysiology & Behavior 81:179–209.https://doi.org/10.1016/j.physbeh.2004.02.004

5
From prediction error to incentive salience: mesolimbic computation of reward motivationThe European Journal of Neuroscience 35:1124–1143.https://doi.org/10.1111/j.14609568.2012.07990.x

6
An evolutionary perspective on food and human tasteCurrent Biology 23:R409–R418.https://doi.org/10.1016/j.cub.2013.04.010
 7

8
Motivational underpinnings of utility in decision making: decision field theory analysis of deprivation and satiationIn: S Moore, M Oaksford, editors. Emotional cognition: from brain to behaviour. Amsterdam: John Benjamins. pp. 197–218.
 9
 10

11
Choice and delay of reinforcementJournal of the Experimental Analysis of Behavior 10:67–74.https://doi.org/10.1901/jeab.1967.1067

12
Naloxone effects on sucrosemotivated behaviorPsychopharmacology 126:110–114.https://doi.org/10.1007/BF02246345

13
Every good regulator of a system must be a model of that systemInternational Journal of Systems Science 1:89–97.

14
Impulse control disorders with the use of dopaminergic agents in restless legs syndrome: a casecontrol studySleep 33:81–87.
 15

16
Mesolimbic dopamine neurotransmission is increased by administration of muopioid receptor antagonistsEuropean Journal of Pharmacology 243:55–64.https://doi.org/10.1016/00142999(93)90167G

17
The role of learning in motivationIn: CR Gallistel, editors. Volume 3 of Steven's Handbook of experimental psychology: learning, motivation, and Emotion (3rd edition). New York: Wiley. pp. 497–533.

18
Morphine enhances hedonic taste palatability in ratsPharmacology, Biochemistry, and Behavior 46:745–749.https://doi.org/10.1016/00913057(93)90572B
 19

20
The freeenergy principle: a unified brain theory?Nature Reviews Neuroscience 11:127–138.https://doi.org/10.1038/nrn2787

21
Potency of naloxone's anorectic effect in rats is dependent on diet preferenceThe American Journal of Physiology 271:R217–R221.

22
Tolerance to hypothermia induced by ethanol depends on specific drug effectsPsychopharmacology 89:45–51.https://doi.org/10.1007/BF00175187
 23

24
Principles of behavior: an introduction to behavior theoryNew York: AppletonCenturyCrofts.

25
Prospect theory: an Analysis of Decision under riskEconometrica 47:263–291.https://doi.org/10.2307/1914185

26
Excitation of ventral tegmental area dopaminergic and nondopaminergic neurons by orexins/hypocretinsThe Journal of Neuroscience 23:7–11.
 27

28
Conditioning and extinction of tolerance to the hypothermic effect of ethanol in ratsJournal of Comparative and Physiological Psychology 94:962–969.https://doi.org/10.1037/h0077824
 29
 30
 31

32
Biobehavioral influences on energy intake and adult weight gainThe Journal of Nutrition 132:3830S–3834S.

33
Separation of satiating and rewarding consequences of drinkingPhysiology & Behavior 4:987–989.https://doi.org/10.1016/00319384(69)900547

34
Reward effects of food via stomach fistula compared with those of food via mouthJournal of Comparative and Physiological Psychology 45:555–564.https://doi.org/10.1037/h0060113
 35

36
Is there evidence for a set point that regulates human body weight?F1000 Medicine Reports 2:59.https://doi.org/10.3410/M259
 37

38
Molecular basis of longterm plasticity underlying addictionNature Reviews Neuroscience 2:119–128.https://doi.org/10.1038/35053570
 39

40
Is dopamine a physiologically relevant mediator of feeding behavior?Trends in Neurosciences 30:375–381.https://doi.org/10.1016/j.tins.2007.06.004

41
Regulation of dietary choice by the decisionmaking circuitryNature Neuroscience 16:1717–1724.https://doi.org/10.1038/nn.3561

42
A framework for studying the neurobiology of valuebased decision makingNature Reviews Neuroscience 9:545–556.https://doi.org/10.1038/nrn2357

43
Nutrient selection in the absence of taste receptor signalingThe Journal of Neuroscience 30:8012–8023.https://doi.org/10.1523/JNEUROSCI.574909.2010

44
SB334867, a selective orexin1 receptor antagonist, enhances behavioural satiety and blocks the hyperphagic effect of orexinA in ratsThe European Journal of Neuroscience 13:1444–1452.https://doi.org/10.1046/j.0953816x.2001.01518.x
 45
 46
 47

48
Artificial motives: A review of motivation in artificial creaturesConnection Science 12:211–277.https://doi.org/10.1080/095400900750060131

49
Orosensory selfstimulation by sucrose involves brain dopaminergic mechanismsAnnals of the New York Academy of Sciences 575:307–319.https://doi.org/10.1111/j.17496632.1989.tb53252.x
 50

51
Interoceptive inference, emotion, and the embodied selfTrends in Cognitive Sciences 17:565–573.https://doi.org/10.1016/j.tics.2013.09.007

52
State Space Approach to Motivation, Motivational Control System AnalysisAcademic Press.

53
Intrinsically motivated reinforcement learning: an evolutionary perspectiveIEEE Transactions on Autonomous Mental Development 2:70–82.https://doi.org/10.1109/TAMD.2010.2051031
 54
 55
 56
 57

58
Allostasis: a model of predictive regulationPhysiology & Behavior 106:5–15.https://doi.org/10.1016/j.physbeh.2011.06.004
 59
 60

61
Highintensity sweeteners and energy balancePhysiology & Behavior 100:55–62.https://doi.org/10.1016/j.physbeh.2009.12.021
 62
 63

64
Set points, settling points, and the control of body weightPhysiology & Behavior 19:75–78.https://doi.org/10.1016/00319384(77)901627

65
The eating paradox: how we tolerate foodPsychological Review 98:488–505.https://doi.org/10.1037/0033295X.98.4.488

66
Hunger and energy homeostasisIn: CR Gallistel, editors. Volume 3 of Steven's Handbook of experimental psychology: learning, motivation, and Emotion (3rd edition). New York: Wiley. pp. 633–668.
 67

68
Unraveling the brain regulation of appetite: lessons from geneticsNature Neuroscience 15:1343–1349.https://doi.org/10.1038/nn.3211
 69

70
A neural computational Model of incentive saliencePLOS Computational Biology 5:e1000437.https://doi.org/10.1371/journal.pcbi.1000437
 71
 72
Decision letter

Eve MarderReviewing Editor; Brandeis University, United States
eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.
[Editors’ note: this article was originally rejected after discussions between the reviewers, but the authors were invited to resubmit after an appeal against the decision.]
Thank you for choosing to send your work entitled “Collecting reward to defend homeostasis: A homeostatic reinforcement learning theory” for consideration at eLife. Your full submission has been evaluated by a Senior editor and 3 peer reviewers, and the decision was reached after discussions between the reviewers. We regret to inform you that your work will not be considered further for publication at this stage.
While the topic of your manuscript is potentially very interesting, the reviewers and BREs consulted and had a number of substantive issues that enter into this decision.
During our initial BRE discussion, one pointed out that humans (and many animals) indulge in many behaviors that are not in the animal's best interest, and violate the premises of physiological homeostasis. For example, obesity and drugtaking behavior, and many of our other activities are clearly not physiologically homeostatic. This appears to be an issue that calls into question some of the fundamental assumptions of this work? If not, how does this come into play?
eLife does not allow supplemental material as is common by many journals. Instead, all material of substance should be incorporated into the main text, and less important material omitted. eLife has no specific length limitations, and this policy is to support papers that present an integrated story.
We are including the reviews in entirety below for your information. eLife welcomes theoretical work when it can add new insight into interesting biological problems, which is why we chose to review it. We only recommend revision when there is a straightforward path that we foresee could lead to a successful outcome, which is not obvious in the case of this manuscript. Consequently, we are returning it to you so that you can submit it elsewhere, either as is, or benefiting from this review. Because the reviewers find this work potentially interesting, if you feel you can successfully craft a new manuscript that addresses the issues raised in this review, we would be willing to consider it as an entirely new submission, which would be reviewed either by the same or different reviewers, and which would not be guaranteed to be successful.
Reviewer #1:
I enjoyed reading this interesting discussion reinforcement learning in the setting of homeostasis. I thought your treatment was both formal and scholarly. It usefully highlights the fact that reinforcement learning or optimal control can be applied to homeostatic regulation. Having said this, as the author of the free energy principle, I find the notion that optimal control (e.g. Dynamic programming or reinforcement learning) can be applied to physiological homeostasis a little selfevident. I think the deeper challenge is to provide a principled explanation for why optimal control emerges from, or is mandated by, homeostasis.
If I understand your idea correctly, you are saying that applying optimal control to the deviation of homeostatic variables from their set point accounts for some key empirical findings in behavioural psychology. I think this is a perfectly fine thing to say and a useful contribution. However, your focus on reinforcement learning is a bit colloquial. One could equally suggest that applying predictive coding to homeostatic deviations (prediction errors about their set point) is a plausible explanation for the empirical findings. This is because one can formulate optimal control as a Kalman filter (exactly for linear quadratic control) and predictive coding is a biologically plausible implementation of Bayesian filtering.
This is important because unlike reinforcement learning, predictive coding provides a process theory or mechanistic explanation for how the brain works. In other words, it is not just a normative model but makes mechanistic predictions. I stress predictive coding because there is a lot of work on homeostasis and interoceptive inference based upon predictive coding (within the larger setting of minimising variational free energy). For example, Anil Seth has a several articles that you might want to refer to. I think you also need to discuss the intellectual background to your work in cybernetics (and more recently synergetics). A key example here would be the good regulator theorem stemming from the work of Ross Ashby on self organisation and his notion of a homeostat. A more recent (and possibly heuristic) formulation of these ideas can be found in the literature on allostasis. I notice that you refer to allostasis in the last sentence of your paper (and also cite Peter Sterling). However, you never define the distinction between allostasis and homeostasis and how this relates to the anticipatory aspects of your formulation (e.g., temporal discounting within control theory).
The other issue I think you need to discuss and qualify is the status of your normative model. Temporal discounting coefficients and other arbitrary parameters (for example the n and m in Equation 1) are characteristic of reinforcement learning models, which undermines their normative status. I am assuming here that normative means that one can describe a process in terms of optimising an objective function. However, adding ad hoc parameters to the function destroys any uniqueness or optimality properties of the normative explanation. An example of this is the temporal discounting factor that your treatment emphasises. In normative Bayesian accounts, this reflects the precision of random fluctuations in hierarchical or volatility models of contingencies. Crucially, for any given outcomes, there is a Bayesoptimal temporal discounting that renders the Bayesian description truly normative.
My main point here is that I think you need to discuss the broader church of theoretical approaches to homeostasis and self organisation and highlight why reinforcement learning might provide a useful focus. I would not be shy about emphasising its shortcomings and pointing to outstanding conceptual challenges. In one sense, making reinforcement learning accountable to homeostatic imperatives is one step in this direction, as illustrated by the importance of temporal discounting and converting a homeostasis into allostasis (and highlighting the fact that reinforcement learning is not context sensitive). As a minimum, I think you should discuss the good regulator theorem and early cybernetic formulations. I think you should also mention the notion of interoceptive inference or prediction as a relevant example of more generic Bayesian approaches. You may find useful references in the following:
Seth AK. Interoceptive inference, emotion, and the embodied self. Trends Cogn Sci. 2013 Nov;17(11):56573.
http://en.wikipedia.org/wiki/The_free_energy_principle
http://en.wikipedia.org/wiki/Allostasis
http://en.wikipedia.org/wiki/Homeostat
1) You introduce reinforcement learning. It might be useful to comment upon its biological plausibility. Are there any detailed models of how neuronal circuits would implement reinforcement learning in the context of homoeostasis?
2) In Equation 1, you should highlight the fact that n and m are free parameters It would be useful to say something like:
Note that m and n (both greater than 2) are free parameters that have an important nonlinear effect on the mapping between homeostatic deviations and their motivational consequences. Later, we will consider this mapping in terms of risk and classical utility theory (risk aversion).
3) I thought that your argument was confused (or written in a confusing way). You seem to imply that in the absence of discounting, the value of a policy depends only on the initial and final states regardless of its trajectory. This is not the case in standard applications of optimal control or reinforcement learning. The quantity that is optimised is the expected reward at each point (in free energy formulations this would be the path integral). This means that the path becomes important in determining the expected reward or value. Furthermore, there is no final state unless you are considering finite horizon problems. Perhaps you meant there is no temporal integration into the future at all?
Reviewer #2:
The authors describe a model in which reward is operationally defined and computed according to the degree to which an outcome reduces the distance between the animal's current state and an idealized homeostatic state (setpoint). The authors characterize this as a drivereduction model and revisit several points of controversy from the 1940s/50s surrounding Hull's drivereduction theory. The authors seem to suggest that (a) their model addresses all the criticism of drivereduction and (b) that more contemporary views of motivation, which have largely rejected Hull's model, have been hasty and that we should reconsider the drivereduction modelrescue Hull, as the authors put it.
Kudos to the authors for tackling an interesting and really important problem: how is a stimulus evaluated to determine whether it is rewarding (or aversive) in the first place, an issue almost entirely ignored in contemporary reinforcement learning models. To address the question of how are stimuli evaluated and to place that into a homeostatic framework in which reward is contingent upon the state of the animal is an important endeavor and I was excited about the authors' model.
However, several issues severely temper my enthusiasm, as follows:
1) Framing the entire model, not just the Introduction and Discussion but the Results, around revisiting and apparently rekindling a halfcentury old debate between drive and incentive theories of motivation seems unfortunate. The result is that the manuscript reads more like a review, mostly comprised of arguments that border on polemical. This leads to several problems:
(a) The Results section does not read like a Results section but like a Discussion section. For each of the subsections in the Results I would expect the question/problem to be defined clearly, then the details of how this was addressed in a simulation and then the results of the simulation (for example, one expects results to be substantively organized around/referencing figures). The 'results', as the manuscript stands, are essentially presented in brief in the figure legends. Less argument and 'defending' the model and more illustrating its function.
(b) Instead of framing this in a half century old debate, why not anchor it in contemporary scientific issues and questions? How are TD models being developed and applied, what are their shortcomings in particular applications, how does this model improve upon that state of the art?
(c) The manuscript ends up with a cavalier quality of 'our model has just solved all these problems' but the solution is superficial (see below).
(d) The manuscript is too ambitious, attempting to implement a simple computational model of drivereduction and then, in effect, wanting to 'put to rest' what comprised decades of (much still unresolved) controversy. Again, the result reads more like a polemical review, both cavalier and superficial. A more focused, more empirical, less argumentative approach might be more successful.
2) Several arguments are not convincing. Example; extinction burst. The intuitive explanation for extinction bursts is that the animal expects a reward and when it does not arrive, continues to work/press because prior learning does not unravel (or update) that quickly, especially if reward was stochastic in the first place (and, well, the animal is still hungry). The suggestion that a mouse that does not get a reward after pressing the lever 10 times would, as a result, be in a physiological state significantly further from setpoint that this increases their responding seems unlikely. If the mouse presses the lever 10x in 2 minutes (slow for a welltrained mouse), then the mouse's hunger state would only be increased by 2 min metabolism, but mouse behavior is not that regimented nor does receiving a pellet significantly reduce pressing when not in extinction (would you expect to see increased responding if you put the mouse in 2 min later today than yesterday?)
3) Confusion on drivereduction vs. incentive. In the anticipatory responding section, the authors argue that their model explains why an animal would respond in the absence of a deficit state: the animal has previously learned that a cue or action is rewarding, i.e., that it is associated with relieving a deficit state. Thus, the authors argue, that learning induces the animal to respond even when not in a deficit state. But this is, in fact, precisely what incentive salience is: that animals learn the value of stimuli and that this learned value induces responding in the absence of an actual deficit state. And, in fact when animals respond for such learned value, they are not motivated but a current deficit state (as drivereduction would have) but by learned incentive, which is causing the behavior in the absence of such need. The authors have merely reinvented the incentivesalience wheel and don't appear to realize it.
4) Anatomical substrates. There is much data suggesting that metabolic state and nutritional information is signaled in the brain (how could it not be?) and, moreover, that this impacts the dopamine system. However, the authors make a jump here and assume, without evidence, that this is a homeostatic system. Because metabolic information is signaled does not necessitate homeostasis. Equally important, as the authors are discussing the effect of metabolic state on dopamine signaling, a distinction between tonic and phasic needs to be made. So for example, tonic dopamine goes up and remains up for a period of time during a meal, how does this fit in with what should be decreased signaling as the state of the system moves toward the setpoint (i.e., with each bite)? In extinction bursts, the authors seem to suggest that this sort of micro, moment to moment variation in state is signaled. Is there evidence to support that hunger will change the magnitude of the reward?
Requested brevity of review limits detailed critique, but each and every section essentially presents the same sorts of problems (do not make a compelling case, don't actually solve the problem). More generally, what have the authors really done? They have added into a TD model a computational mechanism for determining what is and is not rewarding. That, I think, could represent a substantial contribution if the authors could focus specifically on that and develop it in the context of contemporary models/problems. When they attempt to resolve one of the largest, most complex controversies in all of the history of psychology in one short manuscript, they fall into trouble. From the point of view of computational modeling, is this not simply a form of utility? How is it different from other models that have incorporated utility? This seems the more relevant context for developing and presenting the model. As for drivereduction and rescuing Hull, surely Hull and his followers meant 'ideal_state MINUS current_state = motivation'? Do the authors really believe that if only Hull had had RL learning theory in which to embed his drivereduction, the history of psychology would have been different? And now it can be corrected?
I think the effort and intent is to be commended: how reward is evaluated and how learning and motivation is statedependent is a critical question not well developed in RL models. And starting with a simple drivereduction model is a reasonable starting point, but I would suggest that rather than implementing a drive reduction model and saying 'viola, a half century of controversy resolved', the authors might make a far better contribution by asking what problems are solved and what problems remain or are created; that is, to see this model as a starting point of something really important, not an end point.
Reviewer #3:
The authors implement, in a simple mathematical manner, an elaboration of Hull's “drive reduction theory” (1943), a theory that reinforces a behavior if it reduces any deviation of an organism from homeostasis. As far as I can tell, the specific elaboration is to incorporate more modern theories of reinforcement learning as the mechanism to reduce the “drive” and in so doing manage to alleviate a couple of criticisms of Hull's work (e.g., how secondary reinforcers such as money can operate) as well as account for a number of behavioral phenomena. The impressive aspect of this paper is that it provides a unifying theory of motivation that explains several behavioral phenomena; some of them the authors claim have no other explanation to date.
Unfortunately I find the paper rather abstract, with the concrete details that are presented being a “convenient fiction” (such as “distance of the internal state from the set point” in a fictional “homeostatic space”) that do not directly correspond to any biological property. I focus on one particular result by way of example:
The authors claim that “temporal discounting” (the reduced value of a behavior as its benefit is less immediate) is explained within their theory (for the first time) as “in order to maintain internal stability, it is necessary to discount future rewards.” However, the derivation underlying this argument seems to rely on the assumption that all paths in homeostatic space are possible – e.g., one could raise one's body temperature to 45C then return it back to 37C, so in order to avoid such a path the positive value of returning from 45C to 37C should be discounted compared to the preceding negative value of rising from 37C to 45C. In my mind such an argument seems bizarre, since once reaching 45C the organism can never return to its “sweet spot”; mathematically the “homeostasis space” defined by the authors is not really “curlfree” or without bifurcations, because routes matter, homeostatic fixed points change etc. So the solution seems to explain a problem that only arises in the proposed formalism. Alternative factors – such as the environment being less and less predictable as one moves to the future and that uncertainties in the consequences of behaviors increase as one projects into the future – seem a much more plausible explanation of temporal discounting.
In general, I question whether the theory is falsifiable as I do not see specific testable predictions and the quantitative results presented appear to be produced purely by fine tuning of unconstrained parameters. Also, the authors make claims such as “the major result of our theory, which is that the rationality of behavioral patterns is geared toward maintaining physiological stability”, but that was Hull's theory and is hardly novel or a surprising idea. Since the authors add a mathematical framework to prior theories and their combination, the paper would greatly benefit (and in my view could be acceptable) if it could make some concrete quantitative predictions of the results of behavioral experiments that could be tested, or at a minimum, when numbers in their model are fit to one data set, they show they can reproduce other data without extra fitting.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for sending your work entitled “Collecting reward to defend homeostasis: A homeostatic reinforcement learning theory” for consideration at eLife. Your article has been favorably evaluated by Eve Marder (Senior editor) and 3 reviewers.
As you will see below, all of the reviewers find this version vastly improved over your previous submission, and all of the reviewers are quite positive about this work at this point. Nonetheless, each of them has some specific suggestions for editorial revision, mostly to do with emphasis and presentation. Because these are wellarticulated in the actual reviews, I am taking the unusual tactic of enclosing these reviews in their entirety, as they were meant constructively. I hope that you will find them helpful in making a considerably improved piece of work more transparent and accessible.
Reviewer #1:
The authors have embarked on the valuable task of producing a computational framework that combines theories of reinforcement learning with those of homeostasis and drive reduction. This is a worthwhile goal and the authors have several examples of behaviors that arise within their framework as well as predictions. I do think the manuscript reads a bit as though come of the ideas of combining reinforcement learning and homeostasis are novel to the authors, whereas in reality their contribution is to add a mathematical/computational framework which allows for quantitative predictions to be made and suggests what could/should be observed in any neural mechanism.
While overall the writing is very clear, I think the manuscript would be served by the authors being more careful to tone down statements that suggest the idea of combining homeostasis and reinforcement learning is their own. After all, everyone knows that when one is out on a cold winter's day a hot drink is rewarding, whereas in the middle of a hot summer's day a cold drink has greater rewarding value. The authors deserve credit for developing a mathematical scheme (the first I think?) where such results fall out, and I think they now have enough quantitative results and predictions that make the scheme testable.
In a similar vein, some statements to motivate the work are exaggerated, for example in the first line of Discussion the authors’ state:
“Theories of conditioning are founded on the argument that animals seek reward, while reward is defined as what animals seek.”
I think that while these definitions can be found, to state simply “reward is defined” without adding “by some” or “can be defined” or “has been defined by some” is too bold and general. One can find plenty of definitions of reward, in which “primary reward” is “that which aids survival” or “helps propagate the species” or simply in general English, reward is something that is good for you!
In a couple of places (including the Abstract) the authors state that they:
“prove analytically that rewardseeking and physiological stability are two sides of the same coin” and “Our theory mathematically proves that seeking rewards is equivalent to the fundamental objective of physiological stability“ whereas in fact through their definition of drive;”we define the “drive” as the distance of the internal state from the setpoint“ the authors assume this to be the case and develop a mathematical theory where this result is true. One must be careful in mathematical proofs as to what are the premises. Since the rewards associated with sexual desire are outside the model (as the authors comment) it is clear that it is only within their theory that the mathematical “proof” holds.
Reviewer #2:
The authors have dramatically transformed this paper from the last submission. I like it a lot and think it has some important things to say. It feels much more anchored in the modern RL literature and the discussion of Hull is much more nuanced and realistic. However, there are still some comments:
1) The authors make a comment early on that equates reward/reinforcer/utility. Given the obvious sophistication of the authors, this is unfortunate. In particular, to make clear the relationship between prior treatments of utility and the authors’ proposal would be helpful. Notably, the authors do describe other approaches to this, but even a sentence or two early on that clarifies rather than lumps together the difference between reinforcer/utility. Specifically because the authors are essentially arguing that homeostatic utility determines reinforcement properties.
2) The authors make a comment about 'erroneous estimation of error' and later in the manuscript talk at length about, essentially, taste serving as cues. Three lines of investigation that the authors might find useful in this discussion: (1) Beeler et al Eur J Neuroscience 2012 'taste uncoupled from nutrition fails to sustain the rewarding properties of . . . ' (2) the work of Swithers with artificial sweeteners:
Swithers, S.E. & Davidson, T.L. (2008) A role for sweet taste: calorie predictive relations in energy regulation by rats. Behav. Neurosci., 122, 161 173.
Swithers, S.E., Baker, C.R. & Davidson, T.L. (2009) General and persistent effects of highintensity sweeteners on body weight gain and caloric compensation in rats. Behav. Neurosci., 123, 772780.
Swithers, S.E., Martin, A.A. & Davidson, T.L. (2010) Highintensity sweeteners and energy balance. Physiol. Behav., 100, 5562.
Finally, the authors cite one paper by de Araujo, but he has significantly developed the notion that the DA cells specifically serve as a metabolic sensor.
Other than that, I think there are many things that one could nitpick about, especially with regards to the endless details and nuances of the model (eg., I am not sure the authors have fully addressed the question the other reviewer had regarding the 'shortest distance between two points' idea). However, I think the paper is interesting, brings up some very good points, is well done and, as the authors point out, targets the mutual weakness of HR and RL models and brings them together nicely.
Reviewer 3:
This is an improved version of a previous submission. I see merit in the ideas behind this work. However, I think the authors still could communicate their thoughts in a more structured way, and have made some suggestions below.
This is a much improved version of a previous submission to eLife. It basically connects homeostatic imperatives with classical (utilitarian and control theoretic) formulations of adaptive behaviour. There is a central technical result that links homeostasis to discounted future reward, which the authors exploit to explain a number of phenomena in the reinforcement learning literature. The authors have contextualised their contribution in relation to other (theoretical) frameworks. There are some outstanding issues with the way that the authors structure their paper.
Major points:
1) Scientifically, I think you need to highlight and unpack the major result in the appendix. At an appropriate point in the main text, I would include a paragraph of the following sort:
“In summary, we have established a formal link between the homeostatic imperatives to keep physiological states near some set point and the maximisation of temporally discounted reward (or minimisation of some loss function). This is an important and nontrivial result. The appendix provides a formal proof; however, the underlying idea is fairly simple. Imagine you had to plan a hill walk, during which you wanted to maximise the height (altitude or reward) averaged over the path you take. If someone dropped you at the bottom of the hill, the optimum path would be to ascend the hill and spend as long as possible at the top before returning to your pick up point. Notice that this entails ascending the hill (reward function) before descending. Implicit in this strategy is a maximisation of temporally discounted reward. In other words, going up the hill first and then coming down is better than going down and then coming back up. It is this fundamental (variational) phenomenon that connects homeostasis with classical temporal discounting.
Furthermore, as indicated above, if the homeostatic cost (negative reward) is cast as a log probability then it can be treated as (free) energy. Crucially, the time average or path integral of energy is called action. This means that both the homeostasis and temporally discounted reward are ways of prescribing a principle of least action. From this perspective, one can regard the adaptive behaviours that we are trying to link as necessary and emergent properties of all dynamical systems that comply with (Hamilton's) principle of least action. We will return to this perspective in the Discussion.”
2) The second major point is about the format of your paper. It is still unclear where the reader can find the details of your simulations. I also note that you have included supplementary figures. Can I suggest that you remove all supplementary material and place it in the main text (or discard it and refer to it as results not shown). I think you should prepare the reader for the slightly unusual scientific presentation with a paragraph at the beginning of the paper along the following lines:
”We will develop our theoretical results by appealing to simulations. These simulations are described in figures (and accompanying tables) and are called upon when necessary. All the simulations in this paper followed the same procedure: first we define a model that captures the problem of interest in terms of a Markov decision process. The ensuing behaviour is then optimised using classical reinforcement learning procedures (Qlearning) to define a value function. Actions are then selected using a softmax function of the value of allowable actions or choices. For each simulation we present the graphical model or Markov decision process in the figures, along with the ensuing behaviour. Each figure is accompanied by a table specifying the parameters of the Markovian process, the Qlearning and softmax functions used to simulate behaviour.”
Note that I am suggesting, for every simulation you present, a figure and table. Whenever you refer to results that are not presented in this format I would say so explicitly so the reader does not have to wonder whether they have missed something.
3) You might want to refer to the notion of beliefs or probability distributions over excursions. In other words, the risk sensitive behaviour can also be interpreted in terms of the probability of extreme events that render the beliefs sub Gaussian; assuming the homeostatic deviation is interpreted as a log probability.
https://doi.org/10.7554/eLife.04811.031Author response
Animals indulge in many behaviors that violate the premises of physiological homeostasis, like obesity and drugtaking behavior. This appears to be an issue that calls into question some of the fundamental assumptions of this work.
We thank the BREs for prompting us to address this issue. In fact we already performed the simulations showing how irrational behaviors might arise within our theory, yet did not include it in the previous version of the manuscript as we previously felt that a fullblown treatment of irrational behaviors is beyond the scope of this paper and would merit a further publication. To address the BREs’ concern, in the present manuscript we added a subsection titled “Irrational behavior: the case of overeating.” to illustrate (with simulations) one of the points of vulnerability of our theory that can induce irrational choices. Moreover, in the subsection “Limitations and future works” we discuss on how to approach other pathologies including drugaddiction and anorexia, as results of other mechanisms of our framework going awry. Also, as communicated to the editor previously, modeling drug addiction within our “Homeostatic Reinforcement Learning” framework has been the topic of another entire line of research in our group and we are preparing a further publication on that.
Reviewer 1:
The relation of the theory to the freeenergy framework, as well as the allostasis and the good regulator theorem to be explained. Also, the advantage of using optimal control as the optimization techniques to be discussed.
We agree with the reviewer that our theory has significant connections with the freeenergy framework. We added a subsection titled “Previous theoretical models” in order to discuss all these and other issues in detail.
Limitations of the theory to be discussed.
We added a subsection titled “Limitations and future works”, and detailed several limitations of the model, as well as constraining assumptions that could be eventually relaxed.
Reviewer 2:
Framing the entire model around rekindling the Hull’s theory is unfortunate. Instead of framing this in a halfcenturyold debate, why not anchor it in contemporary scientific issues and questions? The results section does not read like a results section but like a discussion section.
We thank the reviewer for this suggestion. While we stand behind our theory’s explanatory power, we do agree that the present manuscript is only a first step in addressing the modern literature on the links between motivation and the internal physiological states. Following the reviewer’s advice, we restructured the manuscript to make it less polemic and more resultoriented, and tried to clearly delimit its scope. Rather than framing the whole manuscript around the Hull’s theory, we now only discuss that theory and its relevancy to our model in a subsection of the Discussion titled “Relationship to classical drivereduction theory”. Also, in that section, we withdrew the claim of “rescuing Hall”, and instead discuss the differences between our formal elaboration and the original theory (Namely, orosensorybased approximation of drivereduction, and integration with an RL module). Also, for the sake of caution, we only claim that our model addresses “a number of significant criticisms”, rather than “all”, criticisms against the Hull’s theory.
The Results section does not read like a Results section but like a Discussion section. For each of the subsections in the Results I would expect the question/problem to be defined clearly, then the details of how this was addressed in a simulation and then the results of the simulation (for example, one expects results to be substantively organized around/referencing figures). The 'results', as the manuscript stands, are essentially presented in brief in the figure legends. Less argument and 'defending' the model and more illustrating its function.
In response to this concern, we transferred all the other discussionlike issues to the Discussion section (i.e., Neural substrate, predictions, previous models) and greatly expanded the Results section. In the rest of the manuscript, we only focus on the behavioral/neurobiological pattern being addressed, and mechanisms by which our model explains them. We tried to explain the replicated experiments, our simulation results, and the mechanisms of the model with much more details and clarity. Furthermore, we included the full proof of the rationality and the normativity of temporal discounting in the Methods section. In fact we considered including the full proof in the main body of the paper but felt that the paper might become too cumbersome. This is a point we are ready to discuss and if eLife might allow for a “mathbox” to be embedded within the paper, which could be a good solution.
The manuscript ends up with a cavalier quality of 'our model has just solved all these problems' but the solution is superficial. The manuscript is too ambitious, attempting to implement a simple computational model of drivereduction and then, in effect, wanting to 'put to rest' what comprised decades of (much still unresolved) controversy.
Again, the result reads more like a polemical review, both cavalier and superficial. A more focused, more empirical, less argumentative approach might be more successful.
In general, following the points raised by the reviewer we rewrote the paper in a more empirical and less argumentative way. We further pointed out in the manuscript, which issues are resolved by our model and what are the limitations (new sections added). Furthermore, we tried to clarify in the manuscript that our modeling framework is not only a simple model of drivereduction, but gives a normative computational theory for the interplay between the former and reinforcement learning theories of motivation. Indeed the present manuscript is only a starting point for a further development of the theory, and we tried to point this out in the manuscript. We sincerely hope that the cavalier quality of our paper has been rectified.
The explanation for extinction bursts is not convincing.
We see the point raised by the reviewer and understand that our suggestion is way too speculative. We decided to remove the “Extinction burst” section from the manuscript and reconsider this issue in the future.
With respect to anticipatory responding, the proposed model is equivalent to the “incentive salience” theory.
We thank the reviewer for pointing out this apparent lack of clarity in our paper. In the new manuscript, we discuss the differences between the two models in detail, in the subsection “Relationship to other theoretical models”.
We would also like to briefly address the specific comment of the reviewer. In our theory the value of a response depends on the internal state at the time of learning and is built from a reward definition that is based on the ability of the response to produces a drivereducing outcome. We give a precise mathematical formulation of how this should be done (normative reinforcement learning framework). And indeed the response after learning is driven by the value as opposed to by the direct drive reduction. Previously it was controversially argued that “value” in the RL algorithms is equivalent to motivational “incentive salience” (1). However, as we could best understand, the recent computation model of incentive salience separates value learning from influences of the internal state. The value is learned as in the standard RL algorithms (with respect to a reference state and based on externally defined rewards). The internal state at the time of the response then modifies the learned value. As we now argue in the manuscript such a formulation differs from our framework and is unable to account for anticipatory responses.
Just because metabolic information is signaled into the hypothalamus does not necessitate that it is a homeostatic system.
We thank the reviewer for pointing out this issue. Although discussing the neural evidence for the hypothalamus being a homeostatic system is a topic of merit, we feel that in this manuscript, we would better limit the discussion to the neural evidence that is relevant to the novel contributions of our model (i.e. the “integration” of homeostatic and learning systems). We felt that a full discussion of the substrates of the two individual systems is beyond the scope of our manuscript. However, in the new manuscript, we cited further recent review articles that point to the role of hypothalamus in homeostatic regulation.
Furthermore, in the new manuscript, we have explained that from a mathematical point of view, any regulatory system can be formulated either as a dynamical system (interaction of many effectors) or as a homeostatic regulation system. We explain that these formulations can be readily transformed into one another; particularly, the stable equilibrium (settling point) of the dynamical system is equivalent to the setpoint of the homeostatic formulation. Thus, we have tried to make it clear that setpoint vs. settlingpoint formulation is only a matter of the point of view.
Last but not least, one example where we could have discussed the evidence supporting the homeostatic role of the hypothalamus is in thermoregulation (see the text below). However, we did not see how to include it in the manuscript, without straying too far from issues relevant to the novel contributions of the paper.
A particularly prominent example for the role of the hypothalamus in homeostatic regulation has come over the years in the human and animal thermoregulation literature. Interestingly, the concept of an internally regulated set point appears prominently in that body of literature. The classical review by Benzinger (2) establishes experimental evidence for a thermal set point as a physiological property and points out the role of the “preopticspraoptic region of the hypothalamus” in central regulation of the body temperature. The suggestion that hypothalamic circuits play a role in maintenance of thermal homeostasis by translating sensed temperature into neural activity was formalized by Hammel (3) in a model proposing hypothalamic circuitry where integration of thermosensitive and thermoinsensitive neuronal activities could lead to dynamic encoding of the thermal setpoint. Populations of heatsensitive neurons have been identified in the hypothalamus (4, 5): they increase firing rate with increasing body temperature. The heatsensitive (HS) and heatinsensitive (HI) neurons synaptically innervate two sets of effector neurons. Heatloss effectors are excited by the heatsensitive cells and inhibited by the heatinsensitive cells in a manner that balances these inputs at 37C body temperature. Heatproduction effector neurons are in turn inhibited by the HS neurons and excited by the HI neurons. Over the years, heatloss/production effector neurons have been electrophysiologically identified (6–8) and anatomically mapped. The regulatory loop is closed by the thermosensory afferents from the periphery to the HS (but not the HI) neurons (e.g. see (9)). Manipulations of the POA induce body temperature changes (e.g. see (10)) and the effector neurons have been implicated in control of organismal thermoregulatory responses (e.g. as reviewed in Morrison and Nakamura (11) and Boulant (12)) including shivering (13). Furthermore, there is experimental evidence that the regulated thermal temperature point is influenced (or dynamically set) by signals that not in themselves directly related to temperature (e.g. hormonal levels, inputs from joint mechanoreceptors receptors) and varies from individual to individual (see (12) for review).
A distinction between tonic and phasic dopamine activity needs to be made.
To make a distinction between tonic and phasic dopamine, we clearly stated in the new manuscript that our model, as in the classical RL model, only addresses the burst (i.e. phasic) activity pattern of dopamine neurons. How changes in the tonic DA levels might be incorporated into our theory is an active topic of current research in our group.
Reviewer 3:
The theory is based on abstracted concepts like homeostatic set point and distances in the homeostatic space that do not directly correspond to any biological properties.
Indeed, our mathematical theory, as any mathematical theory, requires several constructs to be defined. Above, in the response to reviewer 2, we discussed how the concepts used in our framework relate to ideas of dynamicallymaintained internal equilibrium and potentials of dynamical systems. The functional equivalency of the two approaches establishes a correspondence between their neurobiological implementation. Thus, we respectfully beg to differ with the opinion of the reviewer, we believe that concepts we use do have connections to biological properties. For example the homeostatic space is simply a coordinate system where the various physiologically regulated quantities are represented: temperature, glucose levels, etc. Also, the setpoint is just equivalent to the stable equilibrium of the underlying dynamical system.
Let us take the example of temperature. Without going into details, as discussed in the text in response to the second reviewer, there are multiple classes of temperature receptors peripherally, and temperature sensitive neurons in the hypothalamus. There have been datadriven suggestions in the literature that activity of such neurons, informed by peripheral afferents, together with temperature insensitive neurons in the hypothalamus, encode the thermal setpoint (approx. 37 degrees) (3–5). There is further evidence that inputs from such neurons, create coldproducing and heatproducing effector neurons (6–8). Modern work on human thermoregulation experimentally suggests an existence of temperature space and “energy functional”, or in our terms, drive function (14).
The model assumes that organisms experience all paths in the homeostatic space, and only then can choose the shortest path. However, once the animal reaches an extreme homeostatic deviation, it can never return (due to death).
Indeed we thank the reviewer for pointing out that we needed to clarify this point. We added a new subsection titled “Stepping back from the brink”, and addressed this issue in detail. In fact our model predicts that animals should learn to act preventively to avoid states with drastic deviations (even without experiencing them directly).
The authors claim to provide a normative explanation for temporal discounting for the first time. However, alternative factors such as the environment being less and less predictable as one moves to the future seem to be a plausible explanation of temporal discounting.
Indeed we agree with the reviewer that temporal discounting intuitive sense for a number of reasons including uncertainty of outcomes in the future, changing environments, etc. However the point we attempted to make was more formal and mathematical: if we were not to include discounting, behavioral policies that maximized rewards did not necessarily minimize the total deviation from the homeostasis and hence could endanger the animal. Hence lack of discounting did not result in equivalence of reward maximization of homeostatic defense. Temporal discounting ensured that such did not happen and ensured the rationality of defending homeostasis. In view of the reviewers comments, we realized that our claim was overreaching and withdrew the claim that our normative explanation for temporal discounting is the only possible explanation. Though, we have not been able to find any alternative formal mathematical explanation.
Quoting form the review: I question whether the theory is falsifiable, as I do not see specific testable predictions.
We thank the reviewer for pointing us toward this point. We added a subsection titled “predictions”, and listed five testable predictions of the model.
Throughout the results I would have preferred either more incorporation of the equations or stronger references to the methods.
We tried to incorporate more formal details in the text, particularly in the development of the theory. At the same time, we felt that including the full mathematical proofs in the main text of the paper would make it too cumbersome. These are now in the Materials and Methods section.
When free parameters in the model are fit to one data set, they should show they can reproduce other data without extra fitting.
We thank the reviewer for this comment. It should be mentioned that the different experimental data we have replicated in the paper come from different species (rat in the anticipatory responding task, and pigeon in the oral/fistula waterseeking task). Thus, it is not surprising that the free parameters have different values for different experimental data sets. For every individual dataset, however, the value of free parameters are chosen to replicate the first part of data, and then the same values have successfully predicted the second part. That is, for the case of anticipatory responding simulations, the free parameters are derived according to the training days (the first 8 days of the experiment), and then are used for predicting the extinction days, as well as the reacquisition day. Similarly, for the case of oral/fistula waterseeking experiment, the free parameters are chosen to best explain the reinforcement experiment (Figure 7), and are then used for predicting the satiation experiment (Figure 8).
It is also noteworthy that although free parameters are different across different experiments (different species), the essential patterns of simulation results hold for a wide range of free parameters, and the specific values used in every experiment are only to replicate that specific data.
References:
1) McClure SM, Daw ND, Montague PR (2003) A computational substrate for incentive salience. Trends in Neurosciences 26:423–428.
2) Benzinger TH (1961) The diminution of thermoregulatory sweating during coldreception at the skin. Proceedings of the National Academy of Sciences of the United States of America 47:1683–8.
3) Hammel H (1965) in Physiological Controls and Regulations, eds Yamamoto W, Brobeck J (Saunders, Philadelphia, PA), pp 71–97.
4) Makayama T, Elisenman JS, Hardy JD (1961) Single unit activity of anterior hypothalamus during local heating. Science 134:560–1.
5) Griffin JD, Kaple ML, Chow AR, Boulant JA (1996) Cellular mechanisms for neuronal thermosensitivity in the rat hypothalamus. The Journal of physiology 492 ( Pt 1:231–42.
6) Edinger HM, Eisenman JS (1970) Thermosensitive neurons in tuberal and posterior hypothalamus of cats. The American journal of physiology 219:1098–103.
7) Curras MC, Kelso SR, Boulant JA (1991) Intracellular analysis of inherent and synaptic activity in hypothalamic thermosensitive neurones in the rat. The Journal of physiology 440:257–71.
8) Dean JB, Boulant JA (1989) Effects of synaptic blockade on thermosensitive neurons in rat diencephalon in vitro. The American journal of physiology 257:R65–73.
9) Cliffer KD, Burstein R, Giesler GJ (1991) Distributions of spinothalamic, spinohypothalamic, and spinotelencephalic fibers revealed by anterograde transport of PHAL in rats. The Journal of neuroscience :theofficial journ al of the Society for Neuroscience 11:852–68.
10) Chen XM, Hosono T, Yoda T, Fukuda Y, Kanosue K (1998) Efferent projection from the preoptic area for the control of nonshivering thermogenesis in rats. The Journal of physiology 512 ( Pt 3:883–92.
11) Morrison SF, Nakamura K (2011) Central neural pathways for thermoregulation. Frontiers in bioscience (Landmark edition) 16:74–104.
12) Boulant JA (2006) Neuronal basis of Hammel’s model for setpoint thermoregulation. Journal of applied physiology (Bethesda, Md: 1985).100:1347–54.
13) Zhang YH, YanaseFujiwara M, Hosono T, Kanosue K (1995) Warm and cold signals from the preoptic area: which contribute more to the control of shivering in rats? The Journal of physiology 485 ( Pt 1:195–202.
14) Kingma BR, Frijns AJ, Schellen L, Van Marken Lichtenbelt WD (2014) Beyond the classic thermoneutral zone: Including thermal comfort. Temperature 1:142–149.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Reviewer #1:
The authors have embarked on the valuable task of producing a computational framework that combines theories of reinforcement learning with those of homeostasis and drive reduction. This is a worthwhile goal and the authors have several examples of behaviors that arise within their framework as well as predictions. I do think the manuscript reads a bit as though come of the ideas of combining reinforcement learning and homeostasis are novel to the authors, whereas in reality their contribution is to add a mathematical/computational framework which allows for quantitative predictions to be made and suggests what could/should be observed in any neural mechanism.
While overall the writing is very clear, I think the manuscript would be served by the authors being more careful to tone down statements that suggest the idea of combining homeostasis and reinforcement learning is their own. After all, everyone knows that when one is out on a cold winter's day a hot drink is rewarding, whereas in the middle of a hot summer's day a cold drink has greater rewarding value. The authors deserve credit for developing a mathematical scheme (the first I think?) where such results fall out, and I think they now have enough quantitative results and predictions that make the scheme testable.
In response to the reviewer’s suggestion, we added to the paragraph where we first talk about the contributions of the paper (in Introduction):
Given this evident coupling of homeostatic and learning processes, here, we propose a formal hypothesis for what computations, at an algorithmic level, may be performed in this biological integration of the two systems. More precisely, inspired by previous descriptive hypotheses on the interaction between motivation and learning (Hull, 1943; Mowrer, 1960; Spence, 1956), we suggest a principled model for how the rewarding value of outcomes is computed as a function of the animal’s internal state, and of the approximated needreduction ability of the outcome…
Also, we added the below sentence to the conclusion section:
Being inspired by the classic drivereduction theory of motivation, our mathematical treatment allows for quantitative results to be obtained, predictions that make the theory testable, and logical coherence.
In a similar vein, some statements to motivate the work are exaggerated, for example in the first line of Discussion the authors’ state:
“Theories of conditioning are founded on the argument that animals seek reward, while reward is defined as what animals seek.”
I think that while these definitions can be found, to state simply “reward is defined” without adding “by some” or “can be defined” or “has been defined by some” is too bold and general. One can find plenty of definitions of reward, in which “primary reward” is “that which aids survival” or “helps propagate the species” or simply in general English, reward is something that is good for you!
In response to the reviewer’s concern, we added the phrase “at least in the behaviorist approach” to the mentioned sentence:
Theories of conditioning are founded on the argument that animals seek reward, while reward is defined, at least in the behaviorist approach, as what animals seek.
In a couple of places (including the Abstract) the authors state that they:
“prove analytically that rewardseeking and physiological stability are two sides of the same coin” and “Our theory mathematically proves that seeking rewards is equivalent to the fundamental objective of physiological stability” whereas in fact through their definition of drive;“we define the “drive” as the distance of the internal state from the setpoint” the authors assume this to be the case and develop a mathematical theory where this result is true. One must be careful in mathematical proofs as to what are the premises. Since the rewards associated with sexual desire are outside the model (as the authors comment) it is clear that it is only within their theory that the mathematical “proof” holds.
In response to the reviewer’s concern, we added the phrase “Within this framework,” in the Abstract:
Within this framework, we mathematically prove that seeking rewards is equivalent to the fundamental objective of physiological stability, defining the notion of physiological rationality of behavior.
Furthermore we added the phrase “On the basis of the proposed computational integration of the two systems” into the sentence below, in the Introduction section:
On the basis of the proposed computational integration of the two systems, we prove analytically that rewardseeking and physiological stability are two sides of the same coin, and also provide a normative explanation for temporal discounting of reward.
Reviewer #2:
1) The authors make a comment early on that equates reward/reinforcer/utility. Given the obvious sophistication of the authors, this is unfortunate. In particular, to make clear the relationship between prior treatments of utility and the authors’ proposal would be helpful. Notably, the authors do describe other approaches to this, but even a sentence or two early on that clarifies rather than lumps together the difference between reinforcer/utility. Specifically because the authors are essentially arguing that homeostatic utility determines reinforcement properties.
We thank the author for pointing out this issue. By “utility”, we mean “economic utility” (as it is defined in Economics) rather than “homeostatic utility”. In economics, the utility of a commodity is a fixed value, without taking the internal state of individuals into account. This is the same problem as with reinforcer/reward value in psychology. In order to resolve this misunderstanding, we now use the term “economic utility” rather than “utility”, in the manuscript.
2) The authors make a comment about 'erroneous estimation of error' and later in the manuscript talk at length about, essentially, taste serving as cues. Three lines of investigation that the authors might find useful in this discussion: (1) Beeler et al Eur J Neuroscience 2012 'taste uncoupled from nutrition fails to sustain the rewarding properties of . . . ' (2) the work of Swithers with artificial sweeteners:
Swithers, S.E. & Davidson, T.L. (2008) A role for sweet taste: calorie predictive relations in energy regulation by rats. Behav. Neurosci., 122, 161 173.
Swithers, S.E., Baker, C.R. & Davidson, T.L. (2009) General and persistent effects of highintensity sweeteners on body weight gain and caloric compensation in rats. Behav. Neurosci., 123, 772780.
Swithers, S.E., Martin, A.A. & Davidson, T.L. (2010) Highintensity sweeteners and energy balance. Physiol. Behav., 100, 5562.
Finally, the authors cite one paper by de Araujo, but he has significantly developed the notion that the DA cells specifically serve as a metabolic sensor.
We found these references very helpful in supporting some aspects of our theory. In this respect, we added the below paragraph to the end of the subsection “Neural substrates”:
Such orosensorybased approximation of nutritional content, could have been obtained through evolutionary processes (Breslin, 2013), as well as through prior learning (Beeler et al., 2012; Swithers et al., 2009, 2010). In the latter case, approximations based on orosensory or contextual cues can be updated so as to match the true nutritional value, resulting in a rational neural/behavioral response to food stimuli (De Araujo et al., 2008).
The last sentence suggests a probable mechanism for the tasteindependent adaptation of dopamine response to the true caloric value of food.
Other than that, I think there are many things that one could nitpick about, especially with regards to the endless details and nuances of the model (eg., I am not sure the authors have fully addressed the question the other reviewer had regarding the 'shortest distance between two points' idea). However, I think the paper is interesting, brings up some very good points, is well done and, as the authors point out, targets the mutual weakness of HR and RL models and brings them together nicely.
Reviewer 3:
1) Scientifically, I think you need to highlight and unpack the major result in the appendix. At an appropriate point in the main text, I would include a paragraph of the following sort:
“In summary, we have established a formal link between the homeostatic imperatives to keep physiological states near some set point and the maximisation of temporally discounted reward (or minimisation of some loss function). This is an important and nontrivial result. The appendix provides a formal proof; however, the underlying idea is fairly simple. Imagine you had to plan a hill walk, during which you wanted to maximise the height (altitude or reward) averaged over the path you take. If someone dropped you at the bottom of the hill, the optimum path would be to ascend the hill and spend as long as possible at the top before returning to your pick up point. Notice that this entails ascending the hill (reward function) before descending. Implicit in this strategy is a maximisation of temporally discounted reward. In other words, going up the hill first and then coming down is better than going down and then coming back up. It is this fundamental (variational) phenomenon that connects homeostasis with classical temporal discounting.
Furthermore, as indicated above, if the homeostatic cost (negative reward) is cast as a log probability then it can be treated as (free) energy.
Thanks to the reviewer’s suggestion, we now explain the importance of temporal discounting more clearly by adding the paragraph below (modified version of the paragraph suggested by the reviewer) in the middle of the section “Normative role of temporal discounting”:
Imagine you had to plan a 1hr hill walk from a droppoint toward a pickup point, during which you wanted to minimize the height (equivalent to drive) summed over the path you take. In this summation, if you give higher weights to your height in the near future as compared to later times, the optimum path would be to descend the hill and spend as long as possible at the bottom (i.e. homeostatic setpoint) before returning to the pickup point. Equation 5 shows that this optimization is equivalent to optimizing the total discounted rewards along the path, given that descending and ascending steps are defined as being rewarding and punishing, respectively (equation 2).
In contrast, if at all points in time you give equal weights to your height, then the summed height over path only depends on the drop and pickup points, since every ascend can be compensated with a descend at any time.
We chose not to include the second part of the suggested paragraph: with all due gratitude for the reviewers support of our work and appreciation for the efforts of the reviewer to help us improve the clarity of the paper, we felt that launching into a short discussion of the freeenergy principle early in our manuscript, before we sowed out the major results of the paper, would be distracting to the reader. We give ample discussion of the relationship between our theory and the freeenergy principle in the Discussion where we point out exactly what the reviewer urges us to highlight.
Crucially, the time average or path integral of energy is called action. This means that both the homeostasis and temporally discounted reward are ways of prescribing a principle of least action. From this perspective, one can regard the adaptive behaviours that we are trying to link as necessary and emergent properties of all dynamical systems that comply with (Hamilton's) principle of least action. We will return to this perspective in the Discussion.”
We thank the reviewer for the suggested texts to be added to the manuscript. We used some of the notions mentioned by the reviewer (particularly the principle of least action), and discussed them in the manuscript. For example we added the below text after equation 14:
The equivalency of reward maximization and physiological stability objectives in our model (equation 5) shows that optimizing either homeostasis or sum of discounted rewards corresponds to prescribing a principle of least action applied to the surprise function.
2) The second major point is about the format of your paper. It is still unclear where the reader can find the details of your simulations. I also note that you have included supplementary figures. Can I suggest that you remove all supplementary material and place it in the main text (or discard it and refer to it as results not shown). I think you should prepare the reader for the slightly unusual scientific presentation with a paragraph at the beginning of the paper along the following lines:
“We will develop our theoretical results by appealing to simulations. These simulations are described in figures (and accompanying tables) and are called upon when necessary. All the simulations in this paper followed the same procedure: first we define a model that captures the problem of interest in terms of a Markov decision process. The ensuing behaviour is then optimised using classical reinforcement learning procedures (Qlearning) to define a value function. Actions are then selected using a softmax function of the value of allowable actions or choices. For each simulation we present the graphical model or Markov decision process in the figures, along with the ensuing behaviour. Each figure is accompanied by a table specifying the parameters of the Markovian process, the Qlearning and softmax functions used to simulate behaviour.”
Note that I am suggesting, for every simulation you present, a figure and table. Whenever you refer to results that are not presented in this format I would say so explicitly so the reader does not have to wonder whether they have missed something.
In order to give a better outline of the structure of the paper, we changed the last paragraph of the Introduction section to this:
The paper is structured as follows: After giving a heuristic sketch of the theory, we show several analytical, behavioral, and neurobiological results. On the basis of the proposed computational integration of the two systems, we prove analytically that rewardseeking and physiological stability are two sides of the same coin, and also provide a normative explanation for temporal discounting of reward. Behaviorally, the theory gives a plausible unified account for anticipatory responding and the risefall pattern of the response rate. We show that the interaction between the two systems is critical in these behavioral phenomena and thus, neither classical RL nor classical HR theories can account for them. Neurobiologically, we show that our model can shed light on recent findings on the interaction between the hypothalamus and the rewardlearning circuitry, namely, the modulation of dopaminergic activity by hypothalamic signals.
Furthermore, we show how orosensory information can be integrated with internal signals in a principled way, resulting in accounting for experimental results on consummatory behaviors, as well as the pathological condition of overeating induced by hyperpalatability.
Finally, we discuss limitations of the theory, compare it with other theoretical accounts of motivation and internal state regulation, and outline testable predictions and future directions.
Furthermore, we moved “Figure 4–figure supplements 2, 3 and 4” in the previous manuscript into the main text in the current version of the manuscript (merged together in Figure 4).
Also, in order to provide more details of the simulations and to have the same format for all presented results (i.e., problem definition, simulation results, simulated environment (MDP), free parameters of the model), we added four tables (Figure 5–figure supplement 1; Figure 6–figure supplement 2; Figure 10–figure supplement 1; Figure 12–figure supplement 1) and one Markov Decision Process (Figure 12–figure supplement 2) in the figure supplements.
https://doi.org/10.7554/eLife.04811.032Article and author information
Author details
Funding
Gatsby Charitable Foundation
 Mehdi Keramati
National Research University Higher School of Economics (Basic Research Program)
 Boris Gutkin
Institut national de la santé et de la recherche médicale (INSERM U960, France)
 Boris Gutkin
Center for Research and Interdisciplinary (Frontiers du Vivant)
 Mehdi Keramati
Agence Nationale de la Recherche (ANR10LABX0087 IEC, France)
 Boris Gutkin
Agence Nationale de la Recherche (ANR10IDEX000102 PSL, France)
 Boris Gutkin
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Peter Dayan, Amir Dezfouli, Serge Ahmed, and Mathias Pessiglione for critical discussions, and Peter Dayan and Oliver Hulme for commenting on the manuscript. The authors acknowledge partial funding from ANR10LABX0087 IEC (BSG), ANR10IDEX000102 PSL* (BSG), CNRS (BSG), INSERM (BSG), and FRM (MK). Support from the Basic Research Program of the National Research University Higher School of Economics is gratefully acknowledged by BSG.
Reviewing Editor
 Eve Marder, Reviewing Editor, Brandeis University, United States
Publication history
 Received: September 18, 2014
 Accepted: November 3, 2014
 Version of Record published: December 2, 2014 (version 1)
Copyright
© 2014, Keramati and Gutkin
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,646
 Page views

 226
 Downloads

 10
 Citations
Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.
Download links
Downloads (link to download the article as PDF)
Download citations (links to download the citations from this article in formats compatible with various reference manager tools)
Open citations (links to open the citations from this article in various online reference manager services)
Further reading

 Neuroscience

 Neuroscience