(A) Summary of the algorithm used by the actor. (B) Identifying an action based on a gradient of . The panel shows an example of a dependence of on , and we wish to take the value maximizing . To find the action, we let to change over time in proportion to the gradient of over (Equation 4.2, where the dot over denotes derivative over time). For example, if the action is initialized to .5, then the gradient of at this point is positive, so is increased (Equation 4.2), as indicated by a green arrow on the x-axis. These changes in continue until the gradient is no longer positive, i.e. when is at the maximum. Analogously, if the action is initialized to , then the gradient of is negative, so is decreased until it reaches the maximum of .