niv2009reinforcement: Reinforcement learning in the brain

Feb 21, 2025

\( \newcommand{\states}{\mathcal{S}} \newcommand{\actions}{\mathcal{A}} \newcommand{\observations}{\mathcal{O}} \newcommand{\rewards}{\mathcal{R}} \newcommand{\traces}{\mathbf{e}} \newcommand{\transition}{P} \newcommand{\reals}{\mathbb{R}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\complexs}{\mathbb{C}} \newcommand{\field}{\mathbb{F}} \newcommand{\numfield}{\mathbb{F}} \newcommand{\expected}{\mathbb{E}} \newcommand{\var}{\mathbb{V}} \newcommand{\by}{\times} \newcommand{\partialderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\defineq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\defeq}{\stackrel{{\tiny\mbox{def}}}{=}} \newcommand{\eye}{\Imat} \newcommand{\hadamard}{\odot} \newcommand{\trans}{\top} \newcommand{\inv}{{-1}} \newcommand{\argmax}{\operatorname{argmax}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\avec}{\mathbf{a}} \newcommand{\bvec}{\mathbf{b}} \newcommand{\cvec}{\mathbf{c}} \newcommand{\dvec}{\mathbf{d}} \newcommand{\evec}{\mathbf{e}} \newcommand{\fvec}{\mathbf{f}} \newcommand{\gvec}{\mathbf{g}} \newcommand{\hvec}{\mathbf{h}} \newcommand{\ivec}{\mathbf{i}} \newcommand{\jvec}{\mathbf{j}} \newcommand{\kvec}{\mathbf{k}} \newcommand{\lvec}{\mathbf{l}} \newcommand{\mvec}{\mathbf{m}} \newcommand{\nvec}{\mathbf{n}} \newcommand{\ovec}{\mathbf{o}} \newcommand{\pvec}{\mathbf{p}} \newcommand{\qvec}{\mathbf{q}} \newcommand{\rvec}{\mathbf{r}} \newcommand{\svec}{\mathbf{s}} \newcommand{\tvec}{\mathbf{t}} \newcommand{\uvec}{\mathbf{u}} \newcommand{\vvec}{\mathbf{v}} \newcommand{\wvec}{\mathbf{w}} \newcommand{\xvec}{\mathbf{x}} \newcommand{\yvec}{\mathbf{y}} \newcommand{\zvec}{\mathbf{z}} \newcommand{\Amat}{\mathbf{A}} \newcommand{\Bmat}{\mathbf{B}} \newcommand{\Cmat}{\mathbf{C}} \newcommand{\Dmat}{\mathbf{D}} \newcommand{\Emat}{\mathbf{E}} \newcommand{\Fmat}{\mathbf{F}} \newcommand{\Gmat}{\mathbf{G}} \newcommand{\Hmat}{\mathbf{H}} \newcommand{\Imat}{\mathbf{I}} \newcommand{\Jmat}{\mathbf{J}} \newcommand{\Kmat}{\mathbf{K}} \newcommand{\Lmat}{\mathbf{L}} \newcommand{\Mmat}{\mathbf{M}} \newcommand{\Nmat}{\mathbf{N}} \newcommand{\Omat}{\mathbf{O}} \newcommand{\Pmat}{\mathbf{P}} \newcommand{\Qmat}{\mathbf{Q}} \newcommand{\Rmat}{\mathbf{R}} \newcommand{\Smat}{\mathbf{S}} \newcommand{\Tmat}{\mathbf{T}} \newcommand{\Umat}{\mathbf{U}} \newcommand{\Vmat}{\mathbf{V}} \newcommand{\Wmat}{\mathbf{W}} \newcommand{\Xmat}{\mathbf{X}} \newcommand{\Ymat}{\mathbf{Y}} \newcommand{\Zmat}{\mathbf{Z}} \newcommand{\Sigmamat}{\boldsymbol{\Sigma}} \newcommand{\identity}{\Imat} \newcommand{\epsilonvec}{\boldsymbol{\epsilon}} \newcommand{\thetavec}{\boldsymbol{\theta}} \newcommand{\phivec}{\boldsymbol{\phi}} \newcommand{\muvec}{\boldsymbol{\mu}} \newcommand{\sigmavec}{\boldsymbol{\sigma}} \newcommand{\jacobian}{\mathbf{J}} \newcommand{\ind}{\perp!!!!\perp} \newcommand{\bigoh}{\text{O}} \)

tags: Brain, Reinforcement Learning in the Brain, Reinforcement Learning
source: link
authors: Niv, Y.
year: 2009

Quotes

Most notably, much evidence suggest that the neuromodulator dopamine provides basal ganglia target structures with phasic signals that convey a reward prediction error that can influence learning and action selection, particularly in stimulus-driven habitual instrumental behavior.

That is, RL models (1) generate predictions regarding the molar and molecular forms of optimal behavior. (2) suggests a means by which optimal prediction and action selection can be achieved, and (3) expose explicitly the computations that must be realized in the service of these.

Specifically, extracellular recordings in behaving animals and functional imaging of human decision-making have revealed in the brain the existence of a key RL signal, the temporal difference reward prediction error. In this review we will focus on these links between the theory of reinforcement learning and its implementation in animal and human neural processing.

Rescorla and Wagner postulated that the associative strength of each of the conditional stimuli \(V(CS_i)\) will change according to \[ V_{new}(CS_i) = V_{old}(CS_i) + \eta \left[ \lambda_{US} - \sum_i V_{old}(CS_i) \right]. \] In this error correcting learning rule, learning is driven by the discrepancy between what was predicted (\(\sum_i V(CS_i)\) where \(i\) indexes all the CSs present in the trial) and what actually happened (\(\lambda_{US}\) whose magnitude is related to the worth of the unconditional stimulus, and which quantifies the maximal associative strength that the unconditional stimulus can support.)

At the basis of the Rescorla-Wagner model are two important (and innovative) assumptions or hypotheses: (1) learning happens only when events are not predicted, and (2) predictions due to different stimuli are summed to form the total prediction in a trial.

Issues with the Rescorla-Wagner model:

By treating the conditional and unconditional stimuli as qualitatively different, it does not extend to the important phenomenon of second order conditioning. If stimulus B predicts an affective outcome and stimulus A predicts stimulus B, then stimulus A also gains reward predictive value in second order conditioning.
The basic unit of learning is a conditioning trial as a discrete temporal object. Not only does this impose an experimenter-oriented parsing of otherwise continuous events, but it also fails to account for the sensitivity of conditioning to the different temporal relations between the conditional and the unconditional stimuli within a trial.

(Werbos 1977) in his “heuristic dynamic programming methods”, and later (Barto, Sutton, and Watkins 1989) suggested that in a “model-free” case in which we can not assume knowledge of the dynamics of the environment, the environment itself can supply this information stochastically and incrementally.

Contrary to the “dopamine equals reward” hypothesis, the disappearance of the dopaminergic response to reward delivery did not accompany extinction, but rather it followed acquisition of the conditioning relationship—as the cells ceased to respond to reward the monkeys began showing conditioned responses of anticipatory licking and arm movements to the reward-predictive stimulus.

The close correspondence between the phasic dopaminergic firing patterns and the characteristics of a temporal difference prediction error led Montague et al. (1996) to sugges the reward prediction error hypothesis of dopamine. Whithin this theoretical framework, it was immediately clear why dopamine is necessary for reward mediated learning in the basal ganglia.

While the reward prediction error hypothesis is precise, there are many open questions and challenges.

Dopaminergic Neurons do not seem to be involved in the signaling or predictions errors for aversive outcomes (Mirenowicz and Schultz 1996), (Tobler, Dickinson, and Schultz 2003), (Ungless, Magill, and Bolam 2004).
- Dopaminergic Neurons do signal negative prediction errors due to the absence of appetitive outcomes.
Dopaminergic Neurons fire to stimuli not clearly related to reward prediction, specifically in the presence of novel stimuli (Schultz 1998), although they are not (yet) predictive of any outcome, aversive or appetitive.
- (Kakade and Dayan 2002) addressed this possibility directly, and suggests that the novelty responses can function as ’novelty bonuses’—quantities that are added to other available reward (\(r^{new}_t = r_t + \text{novelty}(S_t))\) and enhance exploration of novel stimuli.
- This has been confirmed by (Wittmann et al. 2008) using fMRI readings.
Another challenge is how hierarchical structure plays a role in learning behaviors. A quintessential example of this is the everyday task of making coffee, which comprises several high-level ‘modules’ such as ‘grind beans’, ‘pour water’, ‘add sugar’, each of which, in turn, comprises many lower-level motor actions.

References

Barto, A. G., R. S. Sutton, and C. J. C. H. Watkins. 1989. “Sequential Decision Problems and Neural Networks.” In Advances in Neural Information Processing Systems. Vol. 2. Morgan-Kaufmann.

Kakade, Sham, and Peter Dayan. 2002. “Dopamine: Generalization and Bonuses.” Neural Networks 15 (4): 549–59. doi:10.1016/S0893-6080(02)00048-5.

Mirenowicz, Jacques, and Wolfram Schultz. 1996. “Preferential Activation of Midbrain Dopamine Neurons by Appetitive Rather than Aversive Stimuli.” Nature 379 (6564, 6564). Nature Publishing Group: 449–51. doi:10.1038/379449a0.

Schultz, Wolfram. 1998. “Predictive Reward Signal of Dopamine Neurons.” Journal of Neurophysiology 80 (1). American Physiological Society: 1–27. doi:10.1152/jn.1998.80.1.1.

Tobler, Philippe N., Anthony Dickinson, and Wolfram Schultz. 2003. “Coding of Predicted Reward Omission by Dopamine Neurons in a Conditioned Inhibition Paradigm.” Journal of Neuroscience 23 (32). Society for Neuroscience: 10402–10. doi:10.1523/JNEUROSCI.23-32-10402.2003.

Ungless, Mark A., Peter J. Magill, and J. Paul Bolam. 2004. “Uniform Inhibition of Dopamine Neurons in the Ventral Tegmental Area by Aversive Stimuli.” Science 303 (5666). American Association for the Advancement of Science: 2040–42. doi:10.1126/science.1093360.

Werbos, P. 1977. “Advanced Forecasting Methods for Global Crisis Warning and Models of Intelligence.” General System Yearbook, 25–38.

Wittmann, Bianca C., Nathaniel D. Daw, Ben Seymour, and Raymond J. Dolan. 2008. “Striatal Activity Underlies Novelty-Based Choice in Humans.” Neuron 58 (6): 967–73. doi:10.1016/j.neuron.2008.04.027.

Matthew Schlegel

niv2009reinforcement: Reinforcement learning in the brain

Quotes

References

Links to this note: