A stochastic world model on gravity for stability inference
Abstract
The fact that objects without proper support will fall to the ground is not only a natural phenomenon, but also common sense in mind. Previous studies suggest that humans may infer objects’ stability through a world model that performs mental simulations with a priori knowledge of gravity acting upon the objects. Here we measured participants’ sensitivity to gravity to investigate how the world model works. We found that the world model on gravity was not a faithful replica of the physical laws, but instead encoded gravity’s vertical direction as a Gaussian distribution. The world model with this stochastic feature fit nicely with participants’ subjective sense of objects’ stability and explained the illusion that taller objects are perceived as more likely to fall. Furthermore, a computational model with reinforcement learning revealed that the stochastic characteristic likely originated from experiencedependent comparisons between predictions formed by internal simulations and the realities observed in the external world, which illustrated the ecological advantage of stochastic representation in balancing accuracy and speed for efficient stability inference. The stochastic world model on gravity provides an example of how a priori knowledge of the physical world is implemented in mind that helps humans operate flexibly in openended environments.
eLife assessment
In this valuable study, the authors present findings that suggest that people do not faithfully replicate the physics of the real world but rather have a stochastic world model, specifically a stochastic representation of gravity. This contrasts with prior accounts that suggested a potentially noisy Newtonian model where the noise arises from perceptual uncertainty or (inferred) external perturbations. The experimental evidence is generally solid, with all experiments and model simulations being consistent with the proposed account. In the revision, the authors also added a number of control experiments that address some of the most pressing concerns of the original submission.
https://doi.org/10.7554/eLife.88953.3.sa0Introduction
About 2000 years ago, Confucius warned his disciples that a wise man should not stand next to a collapsing wall. We, wise or not, can easily judge whether a wall is stable or collapsing in a fraction of a second (Battaglia et al., 2013; Kubricht et al., 2017; McCloskey, 1983). This astonishing performance is unlikely to have been achieved by previous visual experience alone. Taking a stack consisting of 10 blocks as an example (Figure 1), we can quickly report its stability with a satisfactory accuracy of 70% on average (Bear et al., 2021; Zhang et al., 2016), but the universal cardinality of possible configurations is at least 1.14 × 10^{50} (Figure 1—figure supplement 1), which is much larger than the total number of sand grains on Earth (est. 7.5 × 10^{18}) (Blatner, 2013). Contrary to this intuition, 4monthold infants, who have little visual experience of the physical world, expect a box to fall if it loses contact with a support platform (Baillargeon, 2004; Baillargeon, 1994). Our minds may therefore have devised a mechanism that differs from the widely used discriminative approach in artificial neural networks, which relies on the extensive visual experience of objects and feedback about their stability (Bear et al., 2021; Li et al., 2016; Zhang et al., 2016).
Indeed, both behavioral and neuroimaging studies have suggested that humans possess a priori knowledge of Newton’s law of physics in the mind. For example, infants as young as 7 months expect a downwardmoving object to accelerate and an upwardmoving object to decelerate (Friedman, 2002; Kim and Spelke, 1999), and adults can estimate the remaining time to catch a moving ball (McIntyre et al., 2001; Zago and Lacquaniti, 2005) even in the absence of visual information (Lacquaniti and Maioli, 1989; Zago et al., 2009). Further fMRI studies have revealed the parietoinsular vestibular cortex in the brain as the neural basis for gravitybased stability inference, suggesting that this knowledge is encapsulated as a cognitive module (Fischer et al., 2016; Indovina et al., 2005; Pramod et al., 2022). Accordingly, our brain is proposed as a set of generative machines that actively predict future events of the everchanging physical world through mental simulation with a priori knowledge acting upon the world (Battaglia et al., 2013; Hegarty, 2004; Huang and Rao, 2011; Tenenbaum et al., 2011; Ullman et al., 2017). For this reason, the generative machine is also called the world model (Land, 2014; Tenenbaum et al., 2011).
Recently, the idea of the world model has become popular to explain the predictive nature of the brain (Friston et al., 2021) and improve the generality and robustness of the artificial neural networks (Matsuo et al., 2022). However, how a priori knowledge is implemented in the world model remains to be determined. A prevailing theory suggests that the world model in the brain accurately mirrors the physical laws of the world (Allen et al., 2020; Battaglia et al., 2013; Zhou et al., 2022). For example, the direction of gravity encoded in the world model, a critical factor in stability inference, is assumed to be straight downward, aligning with its manifestation in the physical world. To explain the phenomenon that tall and thin objects are subjectively perceived as more unstable compared to short and fat ones (Figure 1—figure supplement 2), external noise, such as imperfect perception and assumed external forces, is introduced to influence the output of the model. However, when the brain actively transforms sensory data into cognitive understanding, these data can become distorted (Kriegeskorte and Douglas, 2019; Naselaris et al., 2011), hereby introducing uncertainty into the representation of gravity’s direction. In this scenario, the world model inherently incorporates uncertainty, eliminating the need for additional external noise to explain the inconsistency between subjective perceptions of stability and the actual stability of objects. Note that this distinction of these two theories is nontrivial: the former model implies a deterministic representation of the external world, while the latter suggests a stochastic approach. Here, we investigated these two alternative hypotheses regarding the construction of the world model in the brain by examining how gravity’s direction is represented in the world model when participants judged object stability. Here, we investigated these two alternative hypotheses for the construction of the world model in the brain by examining how gravity’s direction was represented in the world model when participants judged the stability of objects.
To do this, we measured participants’ sensitivity to gravity’s direction in a stability inference task (Battaglia et al., 2013) and found that gravity’s direction was encoded in a Gaussian distribution, with the vertical direction as the maximum likelihood. This stochastic parameter was then built into the world model to simulate the displacement of blocks in a stack under the force of gravity, and the simulation result fits nicely with participants’ judgment of stacks’ stability and explains the daily illusion that taller objects are perceived as more likely to fall. A computational model with a reinforcement learning (RL) algorithm was devised to reveal its origin through interactions with the physical world. Finally, we explored the ecological advantage of the stochastic feature of the world model.
Results
The direction of gravity in the world model
The direction of gravity is perpendicular to the ground surface. Here, we first tested humans’ sensitivity to gravity’s direction to investigate how faithfully our gravity is represented in the world model compared to gravity in the physical world. To do this, we used Pybullet (Coumans and Bai, 2016), a forward physics simulator, to manipulate gravity’s direction. Then, we asked the participants to judge whether the collapse trajectories of unstable stacks were normal (Figure 1a, Figure 1—video 1). The direction of simulated gravity was measured by a parameter pair $\left(\theta ,\phi \right)$ (Figure 1b), which determines the deviation of the direction of simulated gravity from the direction of gravity in the physical world. Specifically, $\theta $ is the vertical component of the direction that affects the degree of collapse, and $\phi$ is the horizontal component that determines the orientation of collapse. We collected participants’ judgment of the normality of collapse trajectories while varying $\theta $ from 0° to 45° and $\phi $ from 0° to 360° across the force space, and the confidence level of the judgment for each angle pair was used to index participants’ sensitivity to gravity’s direction (Figure 1c). As expected, when $\theta $ is equal to 0 (i.e., the direction of the simulated gravity is the direction of the natural gravity), the participants were likely to report that the collapse trajectory was normal (accuracy: 91.0%, STD: 8.0%). Then, the critical question is how participants’ subjective sense of the normal degree of collapse trajectories changes as a function of $\theta $. If our world model on gravity is a faithful replica of the physical reality, we should expect the immediate detection of abnormality when $\theta $ is away from 0.
Contrary to this intuition, the subjective sense of the abnormality was not immediately apparent as $\theta $ moved away from 0; instead, the confidence level of reporting normal trajectories decreased gradually as a function of $\theta $, which was the best fit by a Gaussian function with $\sigma =19.9$ (Figure 1d, left). That is, the participants were 50.9% confident in reporting a normal collapse trajectory when the vertical offset of $\theta $ was 19.9°. In addition, accuracy in detecting the abnormality was not affected by $\phi $ (Figure 1—figure supplement 3), consistent with the uniformly distributed gravitational field in the physical world. This pattern was observed for all participants tested, with $\sigma $ varying from 11.1 to 37.1 (Figure 1—figure supplement 3), and remained unchanged with the addition of a wall on one side to block potential external disturbances from wind (Figure 1—figure supplement 4). Therefore, the world model on gravity is unlikely to be a faithful replica of the physical world; instead, it encodes gravity’s direction as a Gaussian distribution with the vertical direction as the maximum likelihood (Figure 1d, right).
To further test whether the world model on gravity, once established, is encapsulated from visual experience and task context, we inverted the virtual environment upside down with gravity’s direction pointing upward, and then asked the same group of participants to judge whether collapse trajectories were normal (Figure 1e, Figure 1—video 2). We found that the confidence level also decreased gradually as a function of $\theta $ (Figure 1f, $\sigma $ = 17.2; see Figure 1—figure supplement 5 for each participant), which was not significantly different from that in the environment with gravity pointing downward. Indeed, each participant’s $\sigma $ in the upright condition was in high agreement with the $\sigma $ in the upsidedown condition (r = 0.91, p<0.01). That is, the visual experience and task context apparently did not cognitively penetrate humans’ world model on gravity, suggesting that it is likely encapsulated as a cognitive module.
How does the stochastic gravity’s direction in the world model affect our inference on objects’ stability? To answer this question, we recruited an independent group of participants to estimate the stability of 60 stacks of different configurations (Figure 2a), half of which were stable. During the experiment, the participants were required to judge how stable each stack was on a 0–7 scale without feedback, which was used to index their subjective sense about stacks’ stability. Two world models were constructed for comparison. One world model was equipped with a vertically downward direction of gravity without any stochastic variance. This deterministic model is intended to simulate how the stacks fell in the real world, and is therefore called a natural gravity simulator (NGS) (Figure 2b, top). The other model is the same as the NGS, except that the deterministic direction of gravity in the NGS was replaced by the stochastic direction obtained from the previous psychophysical experiment. This model is thus called the mental gravity simulator (MGS, Figure 2c, top). Both models were used to quantify the degree of stability by measuring the proportion of unmoved blocks after the collapse, where the proportion of unmoved blocks after the simulation was used to estimate the stability of the stacks.
NGSestimated stability was significantly correlated with participants’ subjective sense (Figure 2b, bottom; r = 0.70, p<0.01), consistent with previous findings (Battaglia et al., 2013). However, the participants showed an obvious bias towards predicting a collapse for stacks regardless of their actual stability, as the dots in Figure 2b are more concentrated on the lower side of the diagonal line. This phenomenon is referred to as the inference bias, which was indexed as the difference in stability estimates between the participants and the NGS (inference bias = –0.31, p<0.01; see ‘Methods’). In other words, the participants were unlikely to infer stacks’ stability from simulations with a deterministic direction of gravity pointing vertically downward. In contrast, the MGS randomly sampled pairs of $({\theta}_{s},{\phi}_{s})$ from the Gaussian distribution as gravity’s directions 100 times, and the estimated stability of a stack was the averaged stability of simulations with different angle pairs. Aside from a similar magnitude of the correlation in the stability estimates between the participants and the MGS (Figure 2c, bottom; r = 0.75, p<0.01), the MGS, in contrast to the NGS, more precisely reflected participants’ judgments of stability because the points were evenly distributed along the diagonal line (inference bias = 0.04, p>0.05; see Figure 2—figure supplement 1 for the agreement when the MGS was implemented with different Gaussian functions). In other words, the magnitude of the correlation coefficients is not the only indicator to evaluate the model’s fitness. In short, the world model that represents gravity’s direction as a Gaussian distribution around the vertical direction properly explains our tendency to judge stacks as more prone to collapse.
The stochastic world model illustrated by the MGS that led to participants’ inference bias may explain the daily illusion that we perceive taller objects to be more unstable than shorter ones (Figure 2d, left). An intuitive explanation from physics is that a tall object has a higher center of gravity, and thus an external perturbation makes it more likely to collapse. Our stochastic world model, on the other hand, provides an alternative explanation without introducing external perturbations because the center of gravity in taller objects is more susceptible to influence when gravity deviates slightly from a strictly downward direction during humans’ internal simulations. To test this conjecture, we constructed a set of stacks with different heights and estimated the degree of stacks’ stability with the MGS and the NGS, respectively. Because the MGS was considered to be the world model implemented in the brain, the inference bias here was calculated as the difference in stability estimates between the MGS and the NGS, with negative values indicating a tendency to judge a stable stack as an unstable one. Consistent with the inference bias found in humans, the MGS found stacks of all heights to be more prone to collapse (Figure 2d, right; inference bias <0, p<0.01 for all heights). Critically, the bias increased monotonically with increasing height, consistent with the illusion that taller objects are considered more prone to collapse (see Figure 2—figure supplement 2 for the inference bias when the MGS was equipped with different levels of deviation). In short, the stochastic world model on gravity provides a more concise explanation for the daily illusion that taller objects are perceived as more likely to collapse, without assuming external perturbations.
The origin of the stochastic feature of the world model
A deterministic model that combines gravity’s veridical direction with external perturbations, such as an external force or perceptual uncertainty (Allen et al., 2020; Battaglia et al., 2013; Lake et al., 2017; Smith and Vul, 2013), is theoretically equivalent to our stochastic model that represents gravity’s direction in a Gaussian distribution; therefore, it also fits well with humans’ inference on stability by finetuning the parameters of external perturbations. While the cognitive impenetrability and the selfconsistency observed in this study, without resorting to an external perturbation, favor the stochastic model over the deterministic one, the origin of this stochastic feature of the world model is unclear.
Here we used an RL framework to unveil this origin because our intelligence emerges and evolves under the constraints of the physical world. Therefore, the stochastic feature may emerge as a biological agent interacts with the environment, where the mismatches between external feedback from the environment and internal expectations from the world model are in turn used to finetune the world model (Friston et al., 2021; MacKay, 1956; Matsuo et al., 2022). Note that a key aspect of the framework is determining whether the stochastic nature of the world model on gravity emerges through this interaction, even in the absence of external noise. To simulate this process, here we designed an RL framework to model this interactive process to illustrate how the world model on gravity evolves (Figure 3a). Specifically, an agent perceived a stack in the environment, which was then acted upon by a simulated gravity with direction parameters (i.e., $\theta $ and $\phi $) sampled from a spherical direction space. The initial probabilities for the sampling directions were identical (Figure 3b, left). The final state of the stack served as the agent’s expectation under the effect of the simulated gravity. The mismatch between the expectation and the observed final state of the stack under the natural gravity was used to update the sampling probability of the direction space, with a larger discrepancy leading to a larger decrease in probabilities through RL. Within this RL framework, we constructed abundant stacks of 2–15 blocks to train the world model on gravity. As the training progressed, the probabilities of the direction space gradually converged downward (Figure 3b, middle; see Figure 3—figure supplement 1 for the training trajectory). Although gravity’s direction in the environment was vertical, the distribution of updated probabilities in the direction space was gradational ($\sigma $ = 21.6; Figure 3b, right), which is close to gravity’s direction represented in the world model derived from the psychophysics experiment on human participants. Therefore, the world model representing gravity’s direction in a Gaussian distribution can emerge automatically as the agent interacts with the environment, without the need for any external perturbation.
To further illustrate the idea that the environment constrains the form of intelligence, we systematically manipulated the appearance of the physical world while holding the natural gravity constant. Specifically, we constructed 14 worlds, each containing stacks of the same number of blocks, but with different configurations. The number of blocks ranged from 2 to 15. We trained the world model on gravity under the same RL framework for each world and found that all world models represented gravity’s direction in a Gaussian distribution (Figure 3c, left; see Figure 3—figure supplement 2 for all world models). However, the width of the distribution, indexed by the parameter of $\sigma $, decreased monotonically as the number of blocks increased (Figure 3c, right). This phenomenon was shown because in general stacks containing more blocks were more likely to be affected by forces whose directions were not perpendicular to the ground surface, which provided more information about gravity, and thus resulted in a more accurate representation of gravity’s direction in the world model. In short, the world model on gravity resonates with not only the physical law governing the environment, but also the specific regularities of the environment the agent encountered.
The ecological advantage of the stochastic world model
When passing a cliff face, we have to be constantly aware of the stability of the rocks on the cliff. The ideal response would be both accurate and fast, but accuracy and speed are often difficult to achieve simultaneously. Here we investigated how the world model on gravity balances these two factors with its stochastic feature. To answer this question, we used a linear classifier (i.e., logistic regression) to model humans’ decisionmaking behavior at different stages of the mental simulation. Specifically, we collected all the position coordinates of a stack’s blocks at different stages of the simulation. The position difference between the intermediate states of the stack and the initial state provides information about the stability of the stack. For example, a stable stack should have no difference in the positions of the component blocks at all simulation stages, and an unstable stack should have a gradually increasing position difference. If the linear classifier detected the difference in positions sufficient for the classification at any stage, it classified the stack as unstable, otherwise stable (Figure 4a). The classification accuracy gradually increased as the simulation progressed until it reached the asymptote.
As expected, for the NGS (i.e., the world model with the deterministic direction of gravity), the accuracy at the plateau was close to 100% (95.3% on average, Figure 4b, top red box), significantly higher than that for the MGS (80.1% on average, Figure 4b, top blue box) (t = 19.59, p<0.001), simply because of the stochastic feature of gravity’s direction. However, while the initial growth rates of both models were comparable, the MGS reached the plateau crucial for decisionmaking sooner than the NGS (response time, indexed by the ratio between the time to reach the plateau and the time to reach the final stage: 27.1% vs 75.2%, t = 15.58, p<0.001) (Figure 4b middle). The same pattern was also observed with different variances of the Gaussian distribution (Figure 4—figure supplement 1). That is, the stochastic world model prioritized speed over accuracy, echoing the basic principle of survival: fleeing potential danger as quickly as possible, rather than making a perfect decision with a dreadful delay. In addition, by integrating the prediction accuracy and the response time as a measure of efficiency, we found that the stochastic world model provided a better balance between accuracy and speed, with an efficiency significantly higher than that provided by the NGS (3.49 vs 1.32, t = 9.12, p<0.001; Figure 4b, bottom).
On the other hand, if time permits, multiple simulations with the MGS can significantly reduce the variance introduced by the stochastic representation of gravity’s direction (Figure 4c). To explore whether humans adopted this strategy of performing multiple simulations before making a decision, we ran simulations with the MGS at different numbers of times and then matched them with humans’ performance. We found that the variance of humans’ inference on stability best matched that of the MGS after three simulations (Figure 4d; see Figure 4—figure supplement 2 for the modelbehavior correspondence under different numbers of simulations). Therefore, humans are likely to run simulations a limited number of times to infer stacks’ stability.
Discussion
In this study, we investigated how the physical law of gravity is embodied in the brain as a world model that guides inferences on objects’ stability. A series of psychophysics experiments showed that the world model on gravity is not a faithful replica of the physical world, but rather a stochastic model that captures the essence of the vertically downward direction of gravity as the maximum likelihood of a Gaussian distribution. The stochastic feature of the world model not only fits humans’ stability inference behavior better than the deterministic model, but also provides new insight into the daily illusion that taller objects are perceived as more likely to collapse. We further illustrated how the stochastic feature evolved through interactions with the environment using RL, and wellbalanced accuracy and speed to produce a unique ecological advantage for our survival in the physical world.
About 300 years ago, the philosopher Immanuel Kant proposed the intuition of space and time as a priori knowledge in the mind for us to understand the physical world (Kant, 1781), but only until recently have researchers investigated how the intuition is implemented in the brain as intuitive physics (Kubricht et al., 2017; McCloskey, 1983). In the Noisy Newtonian Framework, intuitive physics is depicted as a combination of Newtonian physics and uncertainty generated by noise (Battaglia et al., 2013; Kubricht et al., 2017; Sanborn et al., 2013). The introduction of uncertainty helps to reconcile the misconception occurring under unfavorable conditions, such as unfamiliar events or static scenes (Kaiser et al., 1992; Kaiser et al., 1986; Kim and Spelke, 1999; McCloskey, 1983; Smith and Vul, 2013), which was once thought to support Aristotelian physics (Disessa, 1982; Halloun and Hestenes, 1985). The noise in previous studies was thought to originate from sources such as perceptual uncertainty or external perturbations of forces, rather than from the intuitive physical engine itself, which is thought to be a deterministic system. Our study extends these deterministic models by showing a stochastic world model that the noise instead came from the representation of gravity’s direction under Gaussian distribution. The inherent stochastic feature of gravity’s direction did not need to rely on external noise to explain the illusory instability of taller objects. In addition, it was also confirmed by the cognitive impenetrability of the Gaussian distribution of gravity’s direction when gravity’s direction in the physical world was reversed (Pylyshyn, 1980).
With an RL framework, we further proposed a possible origin of the stochastic feature of the world model through interactions with the physical world. In contrast to summarizing statistical patterns from experience (Bear et al., 2021; Li et al., 2016; Zhang et al., 2016), this framework was designed to simulate how an agent constructed the world model on gravity through agentenvironment interactions. Specifically, a world model with undifferentiated directions of gravity generated a prediction on the stability of an object, and the mismatches between the prediction and the observation of the object from the physical world were used to finetune the distribution of the directions in the world model. This process is similar to how humans update their internal knowledge by comparing simulated expectations (Hegarty, 2004; Ullman et al., 2017) with actual observations (Baillargeon, 2004; Baillargeon, 1994; Kotovsky and Baillargeon, 2000). After several generations of error minimization, a Gaussian distribution of gravity’s direction with the vertically downward direction as the maximum likelihood was similar to that observed in the human world model. Interestingly, when the physical worlds that the agent interacted with changed their appearance with stacks of different heights, the world models maintained their general patterns, but the stochastic representation of gravity’s direction changed accordingly. This finding not only supports the robustness of the active inference (Hegarty, 2004; Ullman et al., 2017), which efficiently encodes critical features under different physical worlds, but also resonates with the idea that intelligence develops under the constraints of the physical world. Taken together, the finding from the RL framework implies that the world model on gravity in humans may also be constructed in the same way, possibly through the mechanism of the predictive coding in a generative process (Friston, 2018; Huang and Rao, 2011).
Our world model on gravity provides an example of the world model theory that emphasizes the predictive nature of generative neural networks implemented with a priori knowledge of the physical world (Friston et al., 2021; Land, 2014; Matsuo et al., 2022). In contrast to traditional discriminative neural networks that learn statistical patterns for stability from gigantic amounts of labeled stacks, generative models equipped with the physics laws governing the physical world rely much less on experience. Importantly, the stochastic feature of the model further enhances the efficiency by balancing accuracy and speed, which improves our chances of better survival (Cosmides and Tooby, 1997) and adaptation to novel environments (e.g., astronauts in outer space; Wang et al., 2022). Indeed, the close link between human cognition and the physical world through interaction may shed light on the development of a new generation of AI with humanlike intelligence that can work flexibly in openended environments (Marcus, 2020; Marcus, 2018).
Methods
Creating stacks with different configurations
We designed a blockstacking procedure in a physical simulation platform (PyBullet) to generate stacks with different configurations. All stacks used in this study were generated using this procedure with the same parameters listed below.
The blockstacking procedure includes three steps (Figure 1—figure supplement 1a): (1) defining the designated area, (2) stacking blocks, and (3) finetuning block positions. The first step is to designate a restricted place area. All blocks of a stack were required to be placed within the designated area. The designated area controls the aggregation level of blocks, with a small area clustering blocks closer than a large area. The designated area is determined by two horizontal parameters x and y, which separately represent the size of the area in two horizontal directions. Therefore, when the block number is fixed, a smaller area in general constructs a higher stack. After designating the area, in step 2 we stacked blocks in random horizontal positions within the area one by one. If no block was positioned under a new block, the new block would be directly placed on the ground; otherwise, it would stack on the positioned block. The horizontal position of each block was independently sampled from a uniform distribution, with lower and upper bounds being x and +x, or y and +y separately (x and y were all independently sampled from a uniform distribution $U\left(0.2,\phantom{\rule{thinmathspace}{0ex}}2.0\right)$). The first two steps allow us to generate a large number of configurations within the designated area, which is the only restriction of the blockstacking procedure. To better control the physical stability of each stack, in step 3 we finetuned blocks in the stack by adjusting overlaps between every neighboring one, which was randomly sampled from a uniform distribution $U\left(0.2,\phantom{\rule{thinmathspace}{0ex}}0.8\right)$. Smaller overlap between neighboring blocks is more likely to construct unstable stacks, whereas more extensive overlap results in more stable stacks. The overlap of neighboring blocks without contact is set to 0. Note that the overlap between neighboring blocks is not the only factor determining a stack’s stability, and step 3 is used to generate stacks without consuming too many computational resources.
The size of each block has a 3D aspect ratio of 3:1:1 (length:width:height), with an arbitrary unit of 1.2:0.4:0.4. This constitutes three types of blocks (length, width, or height is 1.2, respectively, see Figure 1—figure supplement 1b). Each block of a stack was randomly selected as one of the three types of blocks. The mass of each block is set to 0.2 kg, and the friction coefficients and the coefficients of restitution between blocks are set to 1 and 0, respectively.
Estimating the stability of a stack
The stability of a stack was obtained by a rigidbody forward simulation under the natural gravity environment (i.e., NGS). The direction of the natural gravity points downward (i.e., $\overrightarrow{G}$ = (0, 0,–9.8)), and all blocks of a stack are affected by the same gravity. Gravity is the only factor for changing the state of each block, and no external force is added during the simulation. Within each simulation, we recorded 500 simulation stages. In each stage, the center position of each block was collected to measure the stability of the stack. If the position of any block does not change during the simulation, the stack is considered stable, otherwise unstable. We formulate the stack’s state according to the below criteria:
where $t$ is a simulation stage, m is the block number of a stack, ${P}_{tm}$ is the position of the block m at stage t, and $\epsilon $ is the just noticeable difference (i.e., j.n.d) of the perception, which is set to 0.01.
The stability of a stack is further calculated by measuring the proportion of displaced blocks, which is formulated as follows:
where M is the total number of blocks of a stack, and T is the final stage of the simulation (i.e., T = 500). $\mathbb{I}(\cdot )=1$ when $\left{P}_{Tb}{P}_{0b}\right<\epsilon$, which denotes that the stack is stable.
Measuring participants’ sensitivity to gravity’s direction
We decomposed gravity’s direction into three independent components (Figure 1b):
where g is the magnitude of gravity (g = 9.8), which was fixed in this study. $\theta $ represents the vertical component, $\phi $ represents the horizontal component, and x, y, and z are three mutually perpendicular axes. The direction of the gravity was determined by the angle pair $(\theta ,\phi )$, where $\theta $ affects the extent of the collapse, and $\phi $ affects the orientation of the collapse. When $\theta $ is 0, gravity’s direction is vertical.
We performed a psychophysics experiment to measure humans’ sensitivity to gravity’s direction. In this experiment, 10 participants (five females, age range: 21–28 y) from Tsinghua University were recruited to finish four runs of the behavioral experiment, which measured their ability to detect the abnormality of stacks’ collapse trajectories. The experiment was approved by the Institutional Review Board of Tsinghua University (2022 no. 34), and informed consent was obtained from all participants before the experiment.
The collapse trajectory of a stack was solely determined by gravity with different directions, where larger values of $\theta $ and $\phi $ made the trajectories more abnormal. A pilot experiment showed that almost all ${\theta}_{s}$ greater than 45° made the collapse trajectory abnormal to most participants, and therefore in the experiment, $\theta $ ranges from 0° to 45° with a step of 3°. $\phi $ ranges from 0° to 360° with a step of 24°. Therefore, $\theta $ and $\phi $ consists of 16 values, respectively, which were randomly combined into 96 pairs of ($\theta ,\phi $), with each value repeating six times in each run. In a trial, an unstable stack was constructed, and then the camera rotated one circle to show the 3D configuration of the stack to participants (Figure 1—video 1). The configuration was randomly selected from a dataset with more than 2000 unstable stacks, which was generated with the blockstacking procedure before the experiment. Each stack in the database was constructed with 10 blocks, and the color of each block was randomly rendered. There was a 1 s delay after the rotation, during which the participants were instructed to infer the collapse trajectory based on the configuration. Then, simulated gravity with a direction determined by an angle pair ($\theta ,\phantom{\rule{thinmathspace}{0ex}}\phi$) was applied to the stack, and the stack started to collapse. If the collapse trajectory met participants’ expectations, they were instructed to choose ‘normal,’ otherwise ‘abnormal.’ Once the judgment was made, the subsequent trial started immediately. Each trial lasts about 10 s, taking 16 min for a run.
In addition, to test if participants’ sensitivity to gravity’s direction is encapsulated from visual experience and task context, we flipped gravity’s direction upside down by inverting the camera’s view, and the rest procedure remained the same.
To calculate participants’ sensitivity to gravity’s direction, we converted their behavioral judgment into confidence levels about the normal trajectory, which is the percentage that a trajectory was judged as normal, which was calculated as below:
where ${n}_{\theta ,\phi}$ is the number of trajectories that were judged as ‘normal’ with the angle pair ($\theta ,\phi $), and ${N}_{\theta ,\phi}$ is the total number of trajectories with the same angle pair. Because the angle pairs tested were a subset of all possible angle pairs, we used the average ratio along $\phi $ as the ratio of angle pairs untested (Figure 1c) to acquire each participant’s tuning curve. Finally, we calculated participants’ sensitivity by fitting their confidence levels at different $\theta $ to a Gaussian distribution.
where ${Ratio}_{\theta}$ is the confidence level of $\theta $, which was calculated by averaging the confidence level along all ${\phi}_{s}$, A is the magnitude of the Gaussian curve, and $\sigma $ is the variance of the Gaussian curve. The bestfitted $\sigma $ was used to index participants’ sensitivity to gravity’s direction, and a larger $\sigma $ indicates a lower sensitivity.
Measuring participants’ ability on stability inference
Another group of 11 participants (five females, age range: 21–32 y) from Tsinghua University completed a behavioral experiment for judging the stability of 60 stacks. The experiment was approved by the Institutional Review Board of Tsinghua University, and informed consent was obtained from all participants before the experiment. One male participant (age: 25 y) was excluded from further analyses because his judgment showed an extremely weak correlation with the actual stability of stacks (r_{s} < 0.30 for all experimental runs) compared to the rest of the participants.
The stacks contained 26 unstable and 34 stable stacks, which were randomly interleaved in each run. The participants were instructed to judge stacks’ stability on an 8point Likert scale, with 0 referring to ‘definitely unstable’ and 7 to ‘definitely stable.’ There was no feedback after each judgment. The participants completed six runs, within which the same group of stacks was presented but the sequence, blocks’ colors, and camera’s perspective were all randomized. After the experiment, only two participants reported that they suspected a few stacks were repeated in different runs, but they could not locate the stacks they suspected. Besides, their behavioral performance was not significantly different from other participants.
Participants’ stability judgment was rescaled to 0 and 1 to match the scale of the stacks’ stability. The participants’ inference bias (IB) was indexed as the difference in stability judgment between the participants and the NGS, shown as
Negative IB indicates that participants tended to consider a stable stack as an unstable one.
Estimating the stability of stacks based on the stochastic world model on gravity
The actual stability of a stack can be calculated with a onetime simulation of NGS ($\overrightarrow{G}$ = (0, 0,–9.8)). In contrast, the stochastic nature of mental gravity requires a multipletime simulation with different gravity’s directions. Specifically, we first randomly sampled several angle pairs $({\theta}_{s},{\phi}_{s})$ from the Gaussian distribution of gravity’s directions in humans. The distribution was the average of two distributions acquired from the real world (i.e., gravity’s direction is downward) and the inverted world (the direction is upward), with angles having larger confidence levels more likely being sampled. We then applied the simulated gravity with these sampled directions to the stack and used the averaged stability with these directions as the stability of the stack estimated by the MGS. Similar to the IB between the participants and the NGS, the IB between the MGS and NGS was calculated as
Stacks of different heights were created to investigate whether the stochastic world model on gravity results in the illusion that tall objects are considered less stable than short ones. The height of a stack was correlated with the size of the designated area, with a smaller area size corresponding to taller stacks. Therefore, we designated several square areas with different sizes. The side length of the squares ranged from 0.2 to 2.0, with an increase of 0.1. For each square, we used the blockstacking procedure to generate 100 stable and 100 unstable stacks consisting of 10 blocks. The height of each stack was the height of the highest block.
Investigating the origin of the stochastic world model on gravity
An RL framework was used to simulate the development of the stochastic nature of the world model on gravity. To do this, we first created stacks whose block number ranged from 2 to 15 with the blockstacking procedure, and initialized a spherical force space, where $\theta $ ranged from 0° to 180° and $\phi $ from 0° to 360°, separately divided them into 61 sampling angles across the spherical force space (i.e., the angle density). The spherical space covered all possible force directions, with the initial probability of being sampled by the MGS identical. During the training, three angle pairs $({\theta}_{s},{\phi}_{s})$ were sampled according to the probability of the spherical space, and then applied to a stack for simulating its collapse trajectory, which was divided into 500 stages. We optimized the sampling probability of gravity’s direction by comparing the estimated stability (i.e., expectation) with the actual stability (i.e., observation) as a Q value, with a higher Q value suggesting that the sampled gravity’s direction more likely mismatched the actual gravity’s direction. The Q value was calculated as
where ${P}_{m,(\theta ,\phi )}$ is the final position of block m with gravity’s direction $(\theta ,\phi )$, ${P}_{m}$ is the final position of block m with NGS, M is the block number of the stack, and the j.n.d. $\epsilon $ is set to 0.01. The mismatch between the expectation and the observation was used to update the sampling probability of the angle pair using a temporal difference optimization
where $\gamma $ = 0.15 is the learning rate. This process was iterated to update the sample probability of angle pairs $({\theta}_{s},{\phi}_{s})$ until the training stopped. The angle density and learning rate are two factors that affect the learning speed. A larger angle density prolongs the time to reach convergence but enables a more detailed force space; a higher learning rate accelerates convergence but incurs larger variance during training. To balance speed and convergence, we utilized 100,000 configurations for the training.
Evaluating the ecological advantage of the model
To investigate how the world model on gravity balances response accuracy and speed, we trained a linear classifier (i.e., logistic regression) to model humans’ decisionmaking process at different simulation stages. During the simulation, the same stack was separately simulated using the NGS and MGS, and we collected the position coordinates of all blocks at each stage. Differences in the positions of the blocks between the intermediate stage and the initial stage provided information about the stability of a stack, with more displaced blocks suggesting the lower stack’s stability. As the simulation proceeded, differences in position gradually accumulated for unstable stacks, otherwise unchanged for stable stacks. The linear classifier was trained to judge whether a stack is stable with differences in position as inputs.
We used the blockstacking procedure to create stacks consisting of 2–10 blocks and estimated their stabilities with the NGS for simulation in 500 stages. For each block number, there were 100 stable and 100 unstable stacks to train the linear classifier, and its prediction accuracy was measured with another group of 100 stable and 100 unstable stacks at every simulation stage.
The difference in positions of each block between the intermediate and initial stages was used as the input of the linear classifier. Specifically, we collected all vertex positions of a block during the simulation to acquire the difference in position, which included eight coordinate points for each block in each stage. We did not collect the central position as previously used in the stability estimation simply because it did not provide information on the shape and size of the block. We separately performed the simulation using the MGS and NGS, calculated the difference in position between the intermediate stage and the initial stage, and then flattened the difference to generate 24 position features for each block (i.e., eight positions per block in threedimensional space). Therefore, for a 10block stack as an example, 240 position features were prepared as the input of the linear classifier.
Prediction accuracy at each stage was estimated by evaluating whether a stack tested was stable with the MGS or with the NGS. The highest accuracy in the whole simulation stages was used as the prediction accuracy. Accordingly, the first simulation stage to reach the maximum accuracy provided information on response speed: reaching the maximum accuracy with a smaller number of stages indicates the classifier model accomplishes stability inference in a shorter amount of time (i.e., quick response). Therefore, we measured the response speed by estimating the steps to reach the accuracy plateau
where ${Accuracy}_{t}$ is the accuracy of stage t, $\hat{t}$ is the stage that a linear classifier acquires the maximum accuracy for the first time, and T is the total stage number of each simulation (T = 500). Higher values indicate longer response time (i.e., slower response). Finally, the efficiency of the stability inference, which is the balance between accuracy and speed, was found by dividing the prediction accuracy by the response time.
Appendix 1
Estimate the lower bound of the possible number of configurations
A configuration is a structure composed of several contact blocks. To simplify the computation of estimating the number of possible configurations, here we constrained the shape of blocks and the position where the blocks were placed.
The shape constraint: the blocks used to form a configuration are all uniform rectangular blocks with the same aspect ratio.
The position constraint: only one block is allowed to be placed on the same layer of the configuration.
Thus, the problem is then simplified to estimate the possible number of configurations when only one rectangular block with the aspect ratio of (i.e., the shape constraint) is allowed to place in one layer (i.e., the position constraint). Note that the constraints significantly reduce the number of estimated configurations.
We illustrated our solution by starting with a simple case: the aspect ratio of blocks is $\alpha :\alpha :\alpha$.
The condition when the aspect ratio of blocks is $\alpha :\alpha :\alpha$
The block with the aspect ratio of is a cube (Appendix 1—figure 1a). The side length of the cube is defined as α. Consider a configuration with two stacking blocks, the upper block needs to be placed in a $3\alpha \times 3\alpha$ area to ensure contact with the bottom block (Appendix 1—figure 1b). To estimate the possible number of this simple situation, we defined a visual acuity ν, which is the minimum resolution to distinguish two stacks (i.e., j.n.d.). Note that ν is a small value and here we set it as ν = 0.01 to match the minimal position difference for stability estimation in the simulation platform (please see Methods). Therefore, the possible number of the configuration containing two cubic blocks is
Where $N}_{C2$ indicates the possible number of configurations containing two cubic blocks.
We further consider the situation with more cubic blocks. For a stack that contains three cubic blocks, it can be viewed as placing a cubic block on a twoblock stack (Appendix 1—figure 1c). Therefore, the total possible number of configurations is the multiplication of two twoblock configurations, which is formulated as
Similarly, the possible number of configurations for stacks containing four cubic blocks is
Accordingly, the possible number of configurations with M cubic blocks is
Now, we have introduced the basic idea of calculating the number of configurations using a block with an α: α: α aspect ratio as a special case. Then we generalized the idea to estimate the possible number when the block is rectangular with the aspect ratio as $\alpha :\beta :\beta$.
The condition when the aspect ratio of blocks is $\alpha :\beta :\beta$
A block with the aspect ratio of $\alpha :\beta :\beta$ has three types, corresponding to the sides of length, width and height are and the rest sides are β ($\alpha :\beta :\beta$, $\beta :\alpha :\beta$, and $\beta :\beta :\alpha$; see Appendix 1—figure 2a). For simplicity, we label the three basic blocks as A, B and C. The three types of blocks can generate 9 (i.e., 3^{2}) twoblock configurations in total (Appendix 1—figure 2b). We calculate each of the possible numbers of twoblock configurations below.
The possible number of configurations for stacks containing two rectangular blocks with the aspect ratio of $\alpha :\beta :\beta$ is
For a configuration containing three blocks, it can be viewed as a block stacked on a twoblock stack (Appendix 1—figure 2c). Therefore,
Where $N}_{\cdot \cdot A$ indicates the possible number when block A stacked at the upper layer, and each term can be expanded as below.
Combining Equations A4–A6, we have
And
Therefore,
Following a similar logic, the possible number of configurations containing M blocks with an aspect ratio of $\alpha :\beta :\beta$ is
The aspect ratio of blocks is $\alpha :\beta :\gamma$
We further generalize the problem by considering the aspect ratio of blocks as $\alpha :\beta :\gamma$. This forms six different types: $\alpha :\beta :\gamma$, $\alpha :\gamma :\beta$, $\beta :\alpha :\gamma$, $\beta :\gamma :\alpha$, $\gamma :\alpha :\beta$, $\gamma :\beta :\alpha$, for each type the three proportional values corresponding to length, width and height, respectively. We label the six types of blocks as A, B, C, D, E, F, and G for simplicity.
Following the similar logic as above, different types of blocks generated 36 (i.e., 6^{2}) twoblock configurations in total, and the possible number of each twoblock configuration is
The possible number of configurations for stacks with M blocks with an aspect ratio $\alpha :\beta :\gamma$ is
Therefore, we can estimate the possible number of configurations when only one rectangular block with the aspect ratio of $\alpha :\beta :\gamma$ is allowed to place in each layer using the formula Equations A9; A10.
Finally, in this study we chose blocks with an aspect ratio of 3:1:1 as building blocks for stacks whose stability was evaluated. Specifically, for stacks consisting of 10 blocks and j.n.d. of ν = 0.01, the number of configurations can be estimated with formula Equation A10, which is 1.14 × 10^{50}.
Data availability
All analyses are included in the manuscript. The data and code are freely available from Figshare (https://doi.org/10.6084/m9.figshare.25591104.v1).

figshareA stochastic world model on gravity for stability inference.https://doi.org/10.6084/m9.figshare.25591104.v1
References

How do infants learn about the physical world?Current Directions in Psychological Science 3:133–140.https://doi.org/10.1111/14678721.ep10770614

Infants’ physical worldCurrent Directions in Psychological Science 13:89–94.https://doi.org/10.1111/j.09637214.2004.00281.x

WebsitePybullet, a python module for physics simulation for games, robotics and machine learningAccessed April 23, 2024.

Arrows of time in infancy: the representation of temporalcausal invariancesCognitive Psychology 44:252–296.https://doi.org/10.1006/cogp.2001.0768

Does predictive coding have a future?Nature Neuroscience 21:1019–1021.https://doi.org/10.1038/s4159301802007

World model learning and inferenceNeural Networks 144:573–590.https://doi.org/10.1016/j.neunet.2021.09.011

Common sense concepts about motionAmerican Journal of Physics 53:1056–1065.https://doi.org/10.1119/1.14031

Mechanical reasoning by mental simulationTrends in Cognitive Sciences 8:280–285.https://doi.org/10.1016/S13646613(04)001007

Predictive codingWiley Interdisciplinary Reviews. Cognitive Science 2:580–593.https://doi.org/10.1002/wcs.142

Intuitive reasoning about abstract and familiar physics problemsMemory & Cognition 14:308–312.https://doi.org/10.3758/BF03202508

Influence of animation on dynamical judgmentsJournal of Experimental Psychology. Human Perception and Performance 18:669–689.https://doi.org/10.1037//00961523.18.3.669

Perception and understanding of effects of gravity and inertia on object motionDevelopmental Science 2:339–362.https://doi.org/10.1111/14677687.00080

Reasoning about collisions involving inert objects in 7.5‐month‐old infantsDevelopmental Science 3:344–359.https://doi.org/10.1111/14677687.00129

Interpreting encoding and decoding modelsCurrent Opinion in Neurobiology 55:167–179.https://doi.org/10.1016/j.conb.2019.04.002

Intuitive physics: current research and controversiesTrends in Cognitive Sciences 21:749–759.https://doi.org/10.1016/j.tics.2017.06.002

Adaptation to suppression of visual information during catchingThe Journal of Neuroscience 9:149–159.https://doi.org/10.1523/JNEUROSCI.090100149.1989

Building machines that learn and think like peopleThe Behavioral and Brain Sciences 40:e253.https://doi.org/10.1017/S0140525X16001837

Do we have an internal model of the outside world?Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 369:20130045.https://doi.org/10.1098/rstb.2013.0045

BookThe Epistemological Problem for automataAutomata Studies.(AM34)Princeton University Press.

Deep learning, reinforcement learning, and world modelsNeural Networks 152:267–275.https://doi.org/10.1016/j.neunet.2022.03.037

Intuitive physicsScientific American 248:122–130.https://doi.org/10.1038/scientificamerican0483122

Encoding and decoding in fMRINeuroImage 56:400–410.https://doi.org/10.1016/j.neuroimage.2010.07.073

Computation and cognition: issues in the foundations of cognitive scienceBehavioral and Brain Sciences 3:111–132.https://doi.org/10.1017/S0140525X00002053

Reconciling intuitive physics and Newtonian mechanics for colliding objectsPsychological Review 120:411–437.https://doi.org/10.1037/a0031912

Sources of uncertainty in intuitive physicsTopics in Cognitive Science 5:185–199.https://doi.org/10.1111/tops.12009

Mind games: game engines as an architecture for intuitive physicsTrends in Cognitive Sciences 21:649–665.https://doi.org/10.1016/j.tics.2017.05.012

Modulation of biological motion perception in humans by gravityNature Communications 13:2765.https://doi.org/10.1038/s4146702230347y

Visual perception and interception of falling objects: a review of evidence for an internal model of gravityJournal of Neural Engineering 2:S198–S208.https://doi.org/10.1088/17412560/2/3/S04

Visuomotor coordination and internal models for object interceptionExperimental Brain Research 192:571–604.https://doi.org/10.1007/s0022100816913
Article and author information
Author details
Funding
Beijing Municipal Science & Technology Commission and Administrative Commission of Zhongguancun Science Park (Z221100002722012)
 Jia Liu
Tsinghua University Guoqiang Institute (2020GQG1016)
 Jia Liu
Tsinghua University Qiyuan Laboratory
 Jia Liu
Beijing Academy of Artificial Intelligence
 Jia Liu
The Shimu Tsinghua Scholar Program
 Taicheng Huang
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank all the members of the Liu Lab for their valuable comments.
Ethics
Human subjects: The experiment was approved by the Institutional Review Board of Tsinghua University (2022 No. 34), and informed consent was obtained from all participants before the experiment.
Version history
 Preprint posted:
 Sent for peer review:
 Reviewed Preprint version 1:
 Reviewed Preprint version 2:
 Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.88953. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Huang and Liu
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 743
 views

 53
 downloads

 0
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.