Q-Learning to navigate turbulence without a map

Marco Rando; Martin James; Alessandro Verri; Lorenzo Rosasco; Agnese Seminara

doi:10.7554/eLife.102906.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Gordon Berman
Emory University, Atlanta, United States of America
Senior Editor
Aleksandra Walczak
CNRS, Paris, France

Reviewer #1 (Public review):

Overall I found the approach taken by the authors to be clear and convincing. It is striking that the conclusions are similar to those obtained in a recent study using a different computational approach (finite state controllers), and lend confidence to the conclusions about the existence of an optimal memory duration. There are a few points or questions that could be addressed in greater detail in a revision:

(1) Discussion of spatial encoding

The manuscript contrasts the approach taken here (reinforcement learning in a grid world) with strategies that involve a "spatial map" such as infotaxis. The authors note that their algorithm contains "no spatial information." However, I wonder if further degrees of spatial encoding might be delineated to better facilitate comparisons with biological navigation algorithms. For example, the gridworld navigation algorithm seems to have an implicit allocentric representation, since movement can be in one of four allocentric directions (up, down, left, right). I assume this is how the agent learns to move upwind in the absence of an explicit wind direction signal. However, not all biological organisms likely have this allocentric representation. Can the agent learn the strategy without wind direction if it can only go left/right/forward/back/turn (in egocentric coordinates)? In discussing possible algorithms, and the features of this one, it might be helpful to distinguish
(1) those that rely only on egocentric computations (run and tumble),
(2) those that rely on a single direction cue such as wind direction,
(3) those that rely on allocentric representations of direction, and
(4) those that rely on a full spatial map of the environment.

(2) Recovery strategy on losing the plume

While the approach to encoding odor dynamics seems highly principled and reaches appealingly intuitive conclusions, the approach to modeling the recovery strategy seems to be more ad hoc. Early in the paper, the recovery strategy is defined to be path integration back to the point at which odor was lost, while later in the paper, the authors explore Brownian motion and a learned recovery based on multiple "void" states. Since the learned strategy works best, why not first consider learned strategies, and explore how lack of odor must be encoded or whether there is an optimal division of void states that leads to the best recovery strategies? Also, although the authors state that the learned recovery strategies resemble casting, only minimal data are shown to support this. A deeper statistical analysis of the learned recovery strategies would facilitate comparison to those observed in biology.

(3) Is there a minimal representation of odor for efficient navigation?

The authors suggest (line 280) that the number of olfactory states could potentially be reduced to reduce computational cost. This raises the question of whether there is a maximally efficient representation of odors and blanks sufficient for effective navigation. The authors choose to represent odor by 15 states that allow the agent to discriminate different spatial regimes of the stimulus, and later introduce additional void states that allow the agent to learn a recovery strategy. Can the number of states be reduced or does this lead to loss of performance? Does the optimal number of odor and void states depend on the spatial structure of the turbulence as explored in Figure 5?

https://doi.org/10.7554/eLife.102906.1.sa1

Reviewer #2 (Public review):

Summary:

The authors investigate the problem of olfactory search in turbulent environments using artificial agents trained using tabular Q-learning, a simple and interpretable reinforcement learning (RL) algorithm. The agents are trained solely on odor stimuli, without access to spatial information or prior knowledge about the odor plume's shape. This approach makes the emergent control strategy more biologically plausible for animals navigating exclusively using olfactory signals. The learned strategies show parallels to observed animal behaviors, such as upwind surging and crosswind casting. The approach generalizes well to different environments and effectively handles the intermittency of turbulent odors.

Strengths:

(1) The use of numerical simulations to generate realistic turbulent fluid dynamics sets this paper apart from studies that rely on idealized or static plumes.

(2) A key innovation is the introduction of a small set of interpretable olfactory states based on moving averages of odor intensity and sparsity, coupled with an adaptive temporal memory.

(3) The paper provides a thorough analysis of different recovery strategies when an agent loses the odor trail, offering insights into the trade-offs between various approaches.

(4) The authors provide a comprehensive performance analysis of their algorithm across a range of environments and recovery strategies, demonstrating the versatility of the approach.

(5) Finally, the authors list an interesting set of real-world experiments based on their findings, that might invite interest from experimentalists across multiple species.

Weaknesses:

(1) The inclusion of Brownian motion as a recovery strategy, seems odd since it doesn't closely match natural animal behavior, where circling (e.g. flies) or zigzagging (ants' "sector search") could have been more realistic.

(2) Using tabular Q-learning is both a strength and a limitation. It's simple and interpretable, making it easier to analyze the learned strategies, but the discrete action space seems somewhat unnatural. In real-world biological systems, actions (like movement) are continuous rather than discrete. Additionally, the ground-frame actions may not map naturally to how animals navigate odor plumes (e.g. insects often navigate based on their own egocentric frame).

(3) The lack of accompanying code is a major drawback since nowadays open access to data and code is becoming a standard in computational research. Given that the turbulent fluid simulation is a key element that differentiates this paper, the absence of simulation and analysis code limits the study's reproducibility.

https://doi.org/10.7554/eLife.102906.1.sa0

Author response:

We thank the Editor and Reviewers for their work on our manuscript, and are happy to receive their positive comments, as well as their questions and suggestions. We are currently revising the manuscript and are planning to de-emphasize Brownian recovery as a simple yet biologically irrelevant benchmark and include comparisons with other biologically inspired strategies suggested by the reviewers. As for sharing the code and data: we completely agree: dataset 1 is already public and we will share the other dataset as well as the code. In a nutshell, we will be addressing the referee’s suggestions as follows:

(1) As Referee 1 points out, even if the algorithm does not require a map of space, the agent is still required to tell apart North, East, South and West relative to the wind direction which is implicitly assumed known. We will better clarify the spatial encoding required to implement these strategies.

(2) Referee 1 remarks that the learned recovery strategy works best and suggests to give it a more prominent role and better characterize it. We agree that what is done in the void state is definitely key and more work is needed to understand it. In the revised manuscript, we are planning to further substantiate the statistics of the learned recovery by repeating training several times and comparing several trajectories. Note that this strategy is much more flexible than the others and could potentially mix aspects of recovery to aspects of exploitation: we defer a more in-depth analysis that disentangles these two aspects elsewhere.

(3) Referee 1 asks whether an optimal, minimal representation of the olfactory states exists. Q learning defines the olfactory states prior to training and does not allow to systematically optimize odor representation for the task. Given the odor features, we can however discretize them in more or less olfactory states. We expect that decreasing the number of olfactory states provides less positional information and potentially degrades performance, although loss in performance may be overshadowed by noise or by efficient recovery. We are planning to re-train our model with a smaller numer of non-void states and will provide the comparison. The number of void states does not need further testing: we chose 50 void states because it matches the time agents typically remain in the void and indeed achieves very high performance (less than 50 void states results in no convergence and more than 50 introduces states that are rarely visited)

(4) Both reviewers correctly remark that Brownian motion is not biologically relevant. We will make sure to further clarify that this is a rather simple --but biologically irrelevant-- benchmark. We are planning to include results with both circling and zigzaging as biologically inspired recovery strategies.

(5) We agree with reviewer 2 that animal locomotion does not look like a series of discrete displacements on a checkerboard. However, to overcome this limitation, one has to first focus on a specific system to define actions in a way that best adheres to a species’ motor controls. Second, these actions are likely continuous, which makes reinforcement learning notoriously more complex. While we agree that more realistic models are definitely needed for a comparison with real systems, this remains outside the scope of the current work.

(6) We agree with the referees and editor that it is important to publish the code and data alongside with the manuscript. It was already planned and we will make sure to share the links within the revised version of the manuscript.

https://doi.org/10.7554/eLife.102906.1.sa3

Q-Learning to navigate turbulence without a map

Peer review process

Editors

Be the first to read new articles from eLife