vanhasselt2015: Learning to Predict Independent of Span

\( \newcommand{\states}{\mathcal{S}} \newcommand{\actions}{\mathcal{A}} \newcommand{\rewards}{\mathcal{R}} \newcommand{\transition}{P} \newcommand{\reals}{\mathbb{R}} \newcommand{\naturals}{\mathbb{N}} \newcommand{\expected}{\mathbb{E}} \newcommand{\by}{\times} \newcommand{\partialderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\defineq}{\stackrel{\mathclap{\tiny\mbox{def}}}{=}} \newcommand{\defeq}{\stackrel{\mathclap{\tiny\mbox{def}}}{=}} \newcommand{\eye}{\Imat} \newcommand{\hadamard}{\odot} \newcommand{\trans}{\top} \newcommand{\inv}{{-1}} \newcommand{\argmax}{\operatorname{argmax}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\avec}{\mathbf{a}} \newcommand{\bvec}{\mathbf{b}} \newcommand{\cvec}{\mathbf{c}} \newcommand{\dvec}{\mathbf{d}} \newcommand{\evec}{\mathbf{e}} \newcommand{\gvec}{\mathbf{g}} \newcommand{\hvec}{\mathbf{h}} \newcommand{\lvec}{\mathbf{l}} \newcommand{\mvec}{\mathbf{m}} \newcommand{\nvec}{\mathbf{n}} \newcommand{\pvec}{\mathbf{p}} \newcommand{\qvec}{\mathbf{q}} \newcommand{\rvec}{\mathbf{r}} \newcommand{\svec}{\mathbf{s}} \newcommand{\uvec}{\mathbf{u}} \newcommand{\vvec}{\mathbf{v}} \newcommand{\wvec}{\mathbf{w}} \newcommand{\xvec}{\mathbf{x}} \newcommand{\yvec}{\mathbf{y}} \newcommand{\zvec}{\mathbf{z}} \newcommand{\Amat}{\mathbf{A}} \newcommand{\Bmat}{\mathbf{B}} \newcommand{\Cmat}{\mathbf{C}} \newcommand{\Dmat}{\mathbf{D}} \newcommand{\Emat}{\mathbf{E}} \newcommand{\Fmat}{\mathbf{F}} \newcommand{\Imat}{\mathbf{I}} \newcommand{\Pmat}{\mathbf{P}} \newcommand{\Umat}{\mathbf{U}} \newcommand{\Wmat}{\mathbf{W}} \newcommand{\Xmat}{\mathbf{X}} \newcommand{\thetavec}{\boldsymbol{\theta}} \newcommand{\muvec}{\boldsymbol{\mu}} \newcommand{\sigmavec}{\boldsymbol{\sigma}} \newcommand{\jacobian}{\mathbf{J}} \)

Reinforcement Learning, Theory, General Value Functions
van Hasselt, Hado, & Sutton, R. S.

This paper is dedicated to deriving algorithms which learning to make prediction “independent of span”

Span: The span of a multi-step prediction is the number of steps elapsing between when the prediction is made and when its target or ideal value is known. For example:

  • If we make predictions on each day and predict the stock market’s price in 30 days, then the span=30
  • If we make predictions on every hour and the prediction is made in 30 days, the span=24 * 30=720.

The span of a prediction may also very over time. The prediction made in 30 days has a shorter span than the prediction made several months into the future. We will consider the overall span of the prediction to be the maximal span. This can occur if predictions are made a long time-scales or with a short time between predictions.

Main Goal: This paper focuses on the construction of learning algorithms whose computational complexity per time step (in both time and memory) is constant (i.e. doesn’t scale with time) and independent of span.

  1. Derive a span-independent algorithm to update the prediction for single final outcome sec:vanhasselt2015:traces
  2. Derive span-independent updates that update the predictions online towards interim targets that temporarily stand in for the final outcome sec:vanhasselt2015:online
  3. Consider how to do these algorithms efficiently sec:vanhasselt2015:efficiency
  4. Formalize these ideas and show they lead naturally to a form of TD(\(\lambda\)) sec:vanhasselt2015:tdlambda
  5. Show how to combine all prior ideas into a single algorithm sec:vanhasselt2015:onetorulethem
  6. Two important generalizations: cumulative returns, and soft terminations sec:vanhasselt2015:generalizations
  7. Convergence of the new algorithm sec:vanhasselt2015:convergence
  8. Discussion/Conclusions sec:vanhasselt2015:discussion

Independence of span and the emergence of traces {#sec:vanhasselt2015:traces}


  • \(t\) time step with \(t=0\) as the initial point
  • \(\boldsymbol{\theta}_t\) for the feature vector produced at time step \(t\)
  • \(Z\) the final outcome occurring at time step \(T\)

Here we are considering the general case of multi-step predictions (\(T>1\)) where predictions are made at each time step. In the supervised learning case the predictor will only make a single prediction for the entire episode (or time sequence \(t=0…T\)).

  • Predictions are linear with learned weight vector at time t \(\boldsymbol{\theta}_t\) and prediction at time t \(\boldsymbol{\theta}_t^\top \boldsymbol{\phi}_t\)

When the final time is reached we can make an update assuming the LMS loss function:

\[\boldsymbol{\theta}_{t+1} \doteq \boldsymbol{\theta}_{t}+\alpha_{t} \boldsymbol{\phi}_{t}\left(Z-\boldsymbol{\phi}_{t}^{\top} \boldsymbol{\theta}_{t}\right), \quad t=0, \ldots, T-1\]

This is called the forward view as you have to look forward to the end of the episode to update the weight vector. All resources (memory and computation) scale with the length of the episode as you have to store each feature vector, and calculate each weight at the end of the episode.

With some careful analysis we can find a computationally spread way of dealing with this update. See paper for a fuller explanation. In the end we get to

\[\boldsymbol{\theta}_T=\underbrace{\mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_{0} \boldsymbol{\theta}_{0}}_{\dot{=} \boldsymbol{a}_{T-1}}+\underbrace{\left(\sum_{t=0}^{T-1} \mathbf{F}_{T-1} \mathbf{F}_{T-2} \cdots \mathbf{F}_{t+1} \alpha_{t} \phi_{t}\right)}_{\dot{\doteq} \boldsymbol{e}_{T-1}} Z\]

where \(\mathbf{F}_{t} \doteq \mathbf{I}-\alpha_{t} \boldsymbol{\phi}_{t} \boldsymbol{\phi}_{t}^{\top}\).

We can then define both updates to \(\boldsymbol{a}_t\) and \(\boldsymbol{e}_t\) in incremental terms where the eligibility vector is analogous to the conventional eligibility trace using dutch traces! This suggests that eligibility traces are not specific to TD learning at all; and are more fundamental to predictions over time. This span-independent algorithm is denoted as the backwards view and can be summarized as:

\begin{align} {\mathbf{a}_{0} \doteq \theta_{0}, \quad \text { then } a_{t+1} \doteq a_{t}+\alpha_{t} \phi_{t}\left(0-\phi_{t}^{\top} a_{t}\right), \quad t=1, \ldots, T-1} \nonumber \\\
{e_{-1} \doteq 0, \quad \text { then } e_{t} \doteq e_{t-1}+\alpha_{t} \phi_{t}\left(1-\phi_{t}^{\top} e_{t-1}\right), \quad t=0, \ldots, T-1} \\\
{\boldsymbol{\theta}_{T} \doteq \boldsymbol{a}_{T-1}+Z \boldsymbol{e}_{T-1}} \nonumber \end{align}

While the overall computation (\(O(Tn)\)) is still the same, the amount of memory has been significantly reduced \(O(n)\), meaning the memory is constant wrt the span of the prediction.

Online Updating and the emergence of TD errors {#sec:vanhasselt2015:online}

Unifying online and offline learning and the emergence of averaging {#sec:vanhasselt2015:efficiency}

Bootstrapping {#sec:vanhasselt2015:tdlambda}

Combining two notions of trust and the emergence of averaged TD(\(\lambda\)) {#sec:vanhasselt2015:onetorulethem}

Generalizing to cumulative returns and soft terminations {#sec:vanhasselt2015:generalizations}

Convergence Analysis {#sec:vanhasselt2015:convergence}

Discussion {#sec:vanhasselt2015:discussion}

This is the discussion section