11 Continuous state spaces

Author

Affiliation

Updated

March 3, 2026

In this section, we discuss various approximation schemes for continuous state spaces. For simplicity of exposition, we assume that the action spaces is finite.

To fix ideas, let’s consider a varation of the model considered in Exercise 5.4 where the state space is a subset of \(\reals\) rather than \(\integers\).

Example 11.1 (Internet of things) Assume that \(\ALPHABET S = [-B, B]\), \(\ALPHABET A = \{0, 1\}\) and the dynamics are given by \[ S_{t+1} = \begin{cases} [S_t + W_t]_{-B}^B, & \text{if }A_t = 0 \\ [W_t]_{-B}^B, & \text{if }A_t = 1 \end{cases} \] where \([x]_{-B}^B = \text{clip}(x,-B,B)\). We assume that \(\{W_t\}_{t \ge 1}\) is an i.i.d. process with known distribution \(F_W\). The per-step cost is given by \[ c(s,a) = \lambda a + (1-a)s^2. \] Find the policy that minimizes the expected total cost over a finite horizon.

As usual, the dynamic program for Example 11.1 is given by \(V_{T+1}(s) = 0\) for all \(s \in \ALPHABET S\) and for \(T \in \{T, \dots, 1\}\), \[ V_{t}(s) = \min\biggl\{ s^2 + \int V_{t+1}([s + W]_{-B}^B) F_W(dW), λ + V_{t+1}([W]_{-B}^B) F_W(dW) \biggr\}, \quad \forall s \in \ALPHABET S \]

Since the DP is continuous, it cannot be solved exactly (unless there exists a closed form solution, like in the optimal gambling. For numerical solution, we need some form of approximation that needs to tackle two conceptual difficulties:

Solve the DP recursion for all \(s \in \ALPHABET S\)
Compute the expectation with respect to a continuous random variable.

We will touch up on how to handle both these difficulties. But before that we start with the simplest approximation scheme.

11.1 State discretization or quantization

In low dimensions, the simplest approximation scheme is state discretization or quantization: partition the state space \(\ALPHABET S\) into a partition \(\{\ALPHABET C_1, \dots, \ALPHABET C_n\}\) with \(n\) “cells”, where each cell \(\ALPHABET C_i\) is represented by a single element \(\hat s_i\). This gives a finite state space \(\hat {\ALPHABET S} \coloneqq \{ \hat s_1, \dots, \hat s_n \}\).

For example, in Example 11.1 we can define a parition using boundary points \[ -B = b_1 < b_1 < \cdots < b_{n+1} = B \] and set the representative points as \(\hat s_i = (b_i + b_{i+1})/2\), \(i \in \{1, \dots, n\}\).

The next step is to define a quantized model with dynamics \(\hat P\) and per-step cost \(\hat c\) on \(\hat {\ALPHABET S} × \ALPHABET A\). The simplest way to define such a quantized model is: \[\begin{align*} \hat c(\hat s, a) &= c(s,a) \\ \hat P(\hat s_j | \hat s_i, a) &= P(\ALPHABET C_j | \hat s_i, a) = \int_{\ALPHABET C_j} P(ds'|\hat s_i, a). \end{align*}\]

Let \(\{ (\hat V_t, \hat π_t) \}_{t=1}^T\) denote the solution of the DP for the model \((\hat P, \hat c)\). Note that \(\hat π_t\) is a policy on the state space \(\hat {\ALPHABET S}\). To obtain a policy defined on \(\ALPHABET S\), define the quantization function \(φ \colon \ALPHABET S \to \hat {\ALPHABET S}\) to be a function which maps any state \(s\) to the representative of the cell containing it. Define \(\bar V_t = \hat V_t \circ φ\) and \(\bar π_t = \hat π \circ φ\), i.e., \[ \bar V_t(s) = \hat V_t(φ(s)) \quad\text{and}\quad \bar π_t(s) = \hat π(φ(s)). \] \(\{(\bar V_t, \bar π_t) \}_{t=1}^T\) may be viewed as an approximate value function and approximate policy for the model \((P,c)\).

As an illustration, we solve Example 11.1 using such a quantization scheme.

module IOT_quantization
  using Distributions, OffsetArrays, LinearAlgebra

  function solve(; λ=100.0, B=10.0, σ=0.5, T=20, n=51)
    dist_W = Normal(0, σ)
    A = 0:1

    boundaries = range(-B, B, length=n+1)
    S_hat = [(boundaries[i] + boundaries[i+1])/2 for i in 1:n]

    c(s, a) = λ * a + (1 - a) * s^2
    P_hat = OffsetArray(zeros(n, n, length(A)), 1:n, 1:n, A)
    c_hat = OffsetArray(zeros(n, length(A)), 1:n, A)

    for a in A
      for i in 1:n
        si = S_hat[i]
        c_hat[i, a] = c(si, a)

        s_post = (a == 0) ? si : 0.0

        for j in 1:n
            P_hat[i, j, a] = cdf(dist_W, boundaries[j+1] - s_post) - cdf(dist_W, boundaries[j] - s_post)
        end

        P_hat[i, 1, a] += cdf(dist_W, boundaries[1] - s_post)
        P_hat[i, n, a] += 1.0 - cdf(dist_W, boundaries[end] - s_post)
      end
    end

    V = [zeros(n) for t in 1:T+1]
    π = [zeros(Int, n) for t in 1:T]

    for t in T:-1:1
      for i in 1:n
        Q0 = c_hat[i, 0] + dot(P_hat[i, :, 0], V[t+1])
        Q1 = c_hat[i, 1] + dot(P_hat[i, :, 1], V[t+1])

        (V[t][i], π[t][i]) = (Q0 <= Q1) ? (Q0,0) : (Q1,1)
      end
    end

    return V, π, S_hat
  end
end

We solve this model with \(n = 2^k + 1\), for different values of \(k\). This sequence is chosen so that the number of quantization in \((0,B]\) doubles every iteration. The plots

Observe that the distance between the points is \(Δ = 2B/2^{k}\), so for \(k = 10\), we have \(Δ \approx 0.0195\), and the approximation looks almost continuous. Thus, we expect the corresponding approximate value function and policy to be close to the optimal value function and policy.

viewof iot_k = Object.assign(Inputs.range([5,10], {label: "k", step: 1, value: 6}), {style: '--label-width:20px'})
iot_n = 2**iot_k + 1

Plot.plot({
  grid: true,
  x: { label: "State s" },
  y: { label: "Value", domain: [0, 175] },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(iot_df.filter(d => d.n === iot_n), {x: "s", y: "value", curve: "step"})
  ]
})
Plot.plot({
  grid: true,
  x: { label: "State s" },
  y: { label: "Action", ticks: [0, 1] },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(iot_df.filter(d => d.n === iot_n), {x: "s", y: "policy", curve: "step"})
  ]
})

(a) Approx value function

(b) Approx policy

Figure 11.1: Value function \(V_1(s)\) and optimal policy \(π_1(s)\) at \(t=1\) (number of cells \(n\))

11.2 First-order hold

The above approximation can be thought of as a “zero-order hold” method of doing digital to analog conversion. A better option is to do “first-order hold”, where we do linear interpolation between the discrete points.

To do so, we start with a discretized state space \(\mathcal{S} = \{\hat s_1, \dots, \hat s_n\}\) where \[ -B = \hat s_1 < \hat s_2 < \cdots < \hat s_n = B. \] The approximate per-step cost \(\hat c(\hat s, a)\) is also constructed as before. But we construct the approximate transition matrix differently.

Suppose the system transitions to a continuous state \(s'\). If \(s'\) falls between two grid points \(\hat s_i\) and \(\hat s_{i+1}\), the zero-order hold would force us to round \(s'\) to whichever is closer. In contrast, first-order hold represents \(s'\) as a convex combination of its neighbors:

\[ s' = (1 - λ) \hat s_i + λ \hat s_{i+1}, \quad \text{where } λ = \frac{s' - \hat s_i}{\hat s_{i+1} - \hat s_i}. \]

This linear interpolation implies that a transition to the exact state \(s'\) is equivalent to transitioning to \(\hat s_i\) with “weight” \((1-λ)\) and to \(\hat s_{i+1}\) with “weight” \(λ\).

To formalize this, we define a set of triangular interpolation kernels \(K_i(s)\) for each representative point \(\hat s_i\). Assuming a uniform grid spacing of \(\Delta\), these are defined as:

\[ K_i(s) = \max\left(0, 1 - \frac{|s - \hat s_i|}{\Delta}\right). \]

If the grid is not uniform, the slopes on the left and right sides are determined by the distance to the specific neighbors:

\[ K_i(s) = \begin{cases} \frac{s - \hat s_{i-1}}{\hat s_i - \hat s_{i-1}}, & \text{if } \hat s_{i-1} \le s < \hat s_i, \\ \frac{\hat s_{i+1} - s}{\hat s_{i+1} - \hat s_i}, & \text{if } \hat s_i \le s < \hat s_{i+1}, \\ 0, & \text{otherwise}. \end{cases} \]

For the boundary points (\(i=1\) and \(i=n\)), we simply omit the undefined side (e.g., \(K_1\) only has the right-side slope descending to \(\hat s_2\)).

A key property of these kernels for any \(s \in \ALPHABET S\), \(\sum_i K_i(s) = 1\). This ensures that the interpolation is a valid convex combination of the grid points.

The approximate transition matrix is now constructed as \[ \hat P(\hat s_j | \hat s_i, a) = \EXP\bigl[\Lambda_j(S_{t+1}) \mid S_t = \hat s_i, A_t = a\bigr] = \int_{\ALPHABET S} K_j(s') P(ds' | \hat s_i, a). \]

In contrast to the zero-order hold, where we could simply compute \(\EXP[ \IND_{\ALPHABET C_j}(S_{t+1}) | \hat s_i, a]\) using CDFs, in this case we need to resort to a numerical method to compute the integral. The most common approach is Monte Carlo estimation, where we approximate the expectation using \(M\) samples. For each starting state \(\hat s_i\) and action \(a\):

Sample \(M\) next states \(s^{(1)}, \dots, s^{(M)}\) according to the dynamics \(s^{(m)} \sim P(\cdot | \hat s_i, a)\).
Evaluate the kernel \(K_j\) at each sample state.
Average the results: \[ \hat P_{ij}(a) \approx \frac{1}{M} \sum_{m=1}^M K_j(s^{(m)}). \]

Computing expectations

There are other methods to compute expectations, including :Guassian quadrature methods and specialized calculations for specific noise distributions.

As an illustration, we solve Example 11.1 using this linear interpolation scheme.

The code below implements the construction of the transition matrix using Monte Carlo sampling. For each state and action, we simulate \(M\) transitions, calculate the interpolation weights for the neighboring grid points, and average them.

module IOT_interpolation_MC
  using Distributions, OffsetArrays, LinearAlgebra

  function K(s, center, Δ)
      return max(0.0, 1.0 - abs(s - center) / Δ)
  end

  function solve(; λ=100.0, B=10.0, σ=0.5, T=20, n=51, M=1000)
    dist_W = Normal(0, σ)
    A = 0:1

    S_hat = range(-B, B, length=n)
    Δ = step(S_hat)
    
    c(s, a) = λ * a + (1 - a) * s^2

    P_hat = OffsetArray(zeros(n, n, length(A)), 1:n, 1:n, A)
    c_hat = OffsetArray(zeros(n, length(A)), 1:n, A)
    
    for a in A
      for i in 1:n
        si = S_hat[i]
        c_hat[i, a] = c(si, a)
        
        s_post = (a == 0) ? si : 0.0
        
        W_samples = rand(dist_W, M)
        for w in W_samples
          s_next = clamp(s_post + w, -B, B)

          for j in 1:n
            weight = K(s_next, S_hat[j], Δ)
            if weight > 0
              P_hat[i, j, a] += weight / M
            end
          end
        end
      end
    end

    # 3. Value Iteration
    V = [zeros(n) for t in 1:T+1]
    π = [zeros(Int, n) for t in 1:T]

    for t in T:-1:1
      for i in 1:n
        Q0 = c_hat[i, 0] + dot(P_hat[i, :, 0] , V[t+1])
        Q1 = c_hat[i, 1] + dot(P_hat[i, :, 1] , V[t+1])
        
        (V[t][i], π[t][i]) = (Q0 <= Q1) ? (Q0,0) : (Q1,1)
      end
    end
    return V, π, S_hat
  end
end

We solve this for \(n = 2^k+1\) for different values of \(k\). Observe that in this case, we get a good approximation of the value function at \(k = 8\).

viewof iot_mc_k = Object.assign(Inputs.range([3,8], {label: "k", step: 1, value: 5}), {style: '--label-width:20px'})
iot_mc_n = 2**iot_mc_k + 1

Plot.plot({
  grid: true,
  x: { label: "State s" },
  y: { label: "Value", domain: [0, 175] },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(iot_mc_df.filter(d => d.n === iot_mc_n), {x: "s", y: "value", curve: "linear"})
  ]
})
Plot.plot({
  grid: true,
  x: { label: "State s" },
  y: { label: "Action", ticks: [0, 1] },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(iot_mc_df.filter(d => d.n === iot_mc_n), {x: "s", y: "policy", curve: "step"})
  ]
})

(a) Approx value function

(b) Approx policy

Figure 11.2: Linear Interpolation (MC): Value function \(V_1(s)\) and policy \(π_1(s)\)

We can go beyond linear interpolation by using :B-splines. As an illustration of using cubic splines, see Johnson et al. (1993).

11.3 Linear function approximation

State quantization can be alternatively viewed as a linear function approximation.

As an illustration, we first consider zero-order hold. Consider the following piecewise linear model defined on the original state space. \[\begin{align*} \bar c(s,a) &= \sum_{i=1}^n \IND_{\ALPHABET C_i}(s) c(\hat s_i, a) \\ \bar P(ds'|s,a) &= \sum_{i=1}^n \IND_{\ALPHABET C_i}(s) \left[ \sum_{j=1}^n \hat P(\hat s_j | \hat s_i, a) \delta_{\hat s_j}(ds') \right] \end{align*}\] where \(\delta_{\hat s}(s)\) denotes a Dirac-delta measure at \(\hat s\).

Lemma 11.1 Let \(\{ \bar V_t\}_{t =1}^T\) and \(\{\bar Q_t\}_{t=1}^T\) be the solution of the piecewise linear model defined above. Then, for each \(t\): \[ \bar Q_t(s,a) = \sum_{i=1}^n \hat Q_t(\hat s_i,a) \IND_{\ALPHABET C_i}(s) \quad\text{and}\quad \bar V_t(s) = \sum_{i=1}^n \hat V_t(\hat s_i) \IND_{\ALPHABET C_i}(s) \]

Proof

As usual, we proceed by backward induction. The result holds trivially for \(t = T\). This forms the basis of induction. Assume that the result holds for \(t+1\) and consider time \(t\). \[\begin{align*} \bar Q_t(s,a) &= \bar c(s,a) + \int_{\ALPHABET S} \bar V_{t+1}(s') \bar P(ds'|s,a) \\ &= \sum_{i=1}^n \hat c(\hat s_i,a) \IND_{\ALPHABET C_i}(s) + \sum_{i=1}^n \left( \sum_{j=1}^n \hat P(\hat s_j | \hat s_i, a) \int_{\ALPHABET S} \hat V_{t+1}(s') \delta_{\hat s_j}(ds') \right) \IND_{\ALPHABET C_i}(s) \\ &= \sum_{i=1}^n \hat c(\hat s_i,a) \IND_{\ALPHABET C_i}(s) + \sum_{i=1}^n \left( \sum_{j=1}^n \hat P(\hat s_j | \hat s_i, a) \hat V_{t+1}(\hat s_j) \right) \IND_{\ALPHABET C_i}(s) \\ &= \sum_{i=1}^n \left( \hat c(\hat s_i,a) + \sum_{j=1}^n \hat P(\hat s_j | \hat s_i, a) \hat V_{t+1}(\hat s_j) \right) \IND_{\ALPHABET C_i}(s) \\ &= \sum_{i=1}^n \hat Q_{t}(\hat s_i,a) \IND_{\ALPHABET C_i}(s). \end{align*}\]

Now consider the value function. \[\begin{align*} \bar V_t(s) &= \max_{a \in \ALPHABET A} \bar Q_t(s,a) \\ &= \max_{a \in \ALPHABET A} \sum_{i=1}^n \hat Q_t(\hat s_i, a) \IND_{\ALPHABET C_i}(s) \\ &\stackrel{(a)}= \sum_{i=1}^n \left( \max_{a \in \ALPHABET A} \hat Q_t(\hat s_i, a) \right) \IND_{\ALPHABET C_i}(s) \\ &= \sum_{i=1}^n \hat V_{t}(\hat s_i) \IND_{\ALPHABET C_i}(s) \end{align*}\] where \((a)\) follows from the fact that for a particular \(s\), \(\IND_{\ALPHABET C_i}(s)\) is non-zero for exactly one \(i\).

Thus, the result holds for \(t\).

Therefore, we can view the solution of zero-order hold as a function in a vector space. In particular, define \[ ψ_i(s) = \IND\{ s \in \ALPHABET C_i \}, \quad i \in \{1, \dots, n\}. \] Let \(\ALPHABET V_n = \text{span}(ψ_1, \dots, ψ_n)\). Then Lemma 11.1 implies that \(\bar V_t \in \ALPHABET V_n\) and for each \(a\), \(\bar Q_t(\cdot, a) \in \ALPHABET V_n\).

General linear function approximation generalizes this idea to any basis functions. But we need one additional ingredient. To illustrate that, consider first-order hold. The approximate model is now constructed as \[\begin{align*} \bar c(s,a) &= \sum_{i=1}^n K_i(s) c(\hat s_i, a) \\ \bar P(ds'|s,a) &= \sum_{i=1}^n K_i(s) \left[ \sum_{j=1}^n \hat P(\hat s_j | \hat s_i, a) \delta_{\hat s_j}(ds') \right] \end{align*}\] where \(K_i(s)\) are the triangular interpolation kernels. As in the proof of Lemma 11.1, we can show that if \[ \bar V_{t+1}(s) = \sum_{i=1}^n \hat V_{t+1}(\hat s_i) K_i(s) \] then \(\bar Q_t(\cdot, a) \in \ALPHABET V_n\) for each \(a\). However, in general, a vector space is not closed under pointwise minimization. So, if we define \(\bar V_t(s) = \min_{a \in \ALPHABET A}Q_t(s,a)\), then in general \(\bar V_t(s) \not\in \ALPHABET V_n\).

One way to ensure that \(\bar V_t(s) \in \ALPHABET V_n\) is to define \[ \bar V_t(s) = \sum_{i=1}^n \hat V_t(\hat s_i) K_i(s). \] Thus, rather than obtaining \(\bar V_t(s)\) at each state \(s\), we obtain \(\bar V_t(\hat s_i)\) for \(\hat s_i \in \hat {\ALPHABET S}\) and then interpolate to get \(\bar V_t(s)\). However, the interpolation step is specific to first-order hold because the hat functions are nodal (i.e., \(K_i(\hat s_j) = \delta_{ij}\)). For general linear function approximation with arbitrary basis functions \(\{\phi_i\}_{i=1}^n\), such a direct mapping does not exist.

Instead, we view this step generally as a projection. The simplest way to do so is as follows. Pick \(K\) collocation points \(\{\hat s_k\}_{k=1}^K\) and construct the collocation matrix \(Ψ \in \reals^{K \times n}\) as \[ Ψ_{ki} = ψ_i(\hat s_k). \] Then, construct a target \(Y_t \in \reals^{K}\) as \[ [Y_t]_k = \min_{a \in \ALPHABET A} Q_t(\hat s_k, a). \]

Then set \[ \bar V_t(s) = \sum_{i=1}^n α_i ψ_i(s) \] where the weights \(α = (α_1, \dots, α_n)\) to are the solution to the following the least squares (or regression) problem: \[ \min_{\alpha \in \mathbb{R}^n} \| Y_t - Ψ \alpha \|_2^2. \]

The solution is given by the normal equations: \[ α = (Ψ^\TRANS Ψ)^{-1} Ψ^\TRANS Y_t. \] The matrix \(Ψ^\dagger = (Ψ^\TRANS Ψ)^{-1} Ψ^\TRANS\) is the Moore-Penrose pseudoinverse of \(Ψ\) (assuming \(Ψ\) has full column rank). In the special case of interpolation where \(K=n\) and the points are chosen such that \(Ψ\) is invertible (e.g., nodal basis), this simplifies to \(α = Ψ^{-1} Y_t\).

The following two observations can be used to improve implementation efficiency.

The Moore-Penrose pseudoinverse \(Ψ^\dagger\) can be precomputed.
Given any \(v(s) = \sum_{i=1}^n α_i ψ_i(s)\), \[ \EXP[v(S_{t+1}|S_t = \hat s_k, A_t = a ] = \sum_{i=1}^n α_i \underbrace{\EXP[ ψ_i(S_{t+1}) \mid S_t = \hat s_k, A_t = a]}_{\eqqcolon E_{kj}(a)}. \] We can pre-compute the matrix \(E \in \reals^{K × n}\) (using Monte-Carlo simulation)

As an illustration, we solve Example 11.1 by using :Legendre Polynomials as the basis functions. Recall that from Exercise 9.8, we know that the value function is even; so we only consider even Legendre polynomials. Furthermore, we choose :Chebyshev-Lobatto points as the collocation points to prevent oscillations (Gibb’s phenomenon).

module IOT_LFA

  using Distributions, LinearAlgebra, SpecialPolynomials

  function solve(; λ=100.0, B=10.0, σ=0.5, T=20, n=5, K=50, K_eval=500, M=2000)
    # Collocation points (where we solve): Chebyshev-Lobatto on [-B, B]
    cheb_nodes = K > 1 ? [-cos(π * (k - 1) / (K - 1)) for k in 1:K] : [0.0]
    S_hat = B .* cheb_nodes
    # Evaluation points (where we return V, π; can differ from K)
    S_eval = range(-B, B, length=K_eval)

    scale(s) = clamp(s / B, -1.0, 1.0)
    
    # Using even degrees (0, 2, 4...) because the value function is even
    psi = [basis(Legendre, 2*(i-1)) for i in 1:n]
    
    Psi = zeros(K, n)
    for k in 1:K, i in 1:n
      Psi[k, i] = psi[i](scale(S_hat[k]))
    end
    
    # Pre-compute Pseudo-inverse for projection
    PsiInv = pinv(Psi)

    # Basis at evaluation points (for second pass)
    Psi_eval = zeros(K_eval, n)
    for k in 1:K_eval, i in 1:n
      Psi_eval[k, i] = psi[i](scale(S_eval[k]))
    end

    # Pre-compute Expected Basis Features (Monte Carlo)
    E0 = zeros(K, n)
    E1 = zeros(K, n)
    dist_W = Normal(0, σ)

    for k in 1:K
      s = S_hat[k]
      for _ in 1:M
        w = rand(dist_W)
        
        s_next_0 = scale(s + w)
        s_next_1 = scale(w) 

        for i in 1:n
          E0[k, i] += psi[i](s_next_0)
          E1[k, i] += psi[i](s_next_1)
        end
      end
    end
    E0 ./= M
    E1 ./= M

    # Expected basis features at evaluation points (Monte Carlo)
    E_eval0 = zeros(K_eval, n)
    E_eval1 = zeros(K_eval, n)
    for k in 1:K_eval
      s = S_eval[k]
      for _ in 1:M
        w = rand(dist_W)
        s_next_0 = scale(s + w)
        s_next_1 = scale(w)
        for i in 1:n
          E_eval0[k, i] += psi[i](s_next_0)
          E_eval1[k, i] += psi[i](s_next_1)
        end
      end
    end
    E_eval0 ./= M
    E_eval1 ./= M

    # Costs at collocation and evaluation points
    C0 = [s^2 for s in S_hat] 
    C1 = [λ   for s in S_hat]
    C0_eval = [s^2 for s in S_eval]
    C1_eval = [λ   for s in S_eval] 

    # Backward pass: compute α only (no storage of V, π)
    α = [zeros(n) for t in 1:T+1]

    for t in T:-1:1
      Q0 = C0 + E0 * α[t+1]
      Q1 = C1 + E1 * α[t+1]
      V_colloc = min.(Q0, Q1)
      α[t] = PsiInv * V_colloc
    end

    # Second pass: recompute V and π at evaluation points from α to show the plot
    V_eval = [zeros(K_eval) for t in 1:T]
    π_eval = [zeros(Int, K_eval) for t in 1:T]

    for t in 1:T
      V_eval[t] = Psi_eval * α[t]
      Q0_eval = C0_eval + E_eval0 * α[t+1]
      Q1_eval = C1_eval + E_eval1 * α[t+1]
      for k in 1:K_eval
        π_eval[t][k] = Q0_eval[k] <= Q1_eval[k] ? 0 : 1
      end
    end

    return V_eval, π_eval, S_eval
  end
end

We solve this for different number \(n\) of basis vectors (recall that we are only choosing even Legendre polynomials). Observe that in this case we have very good approximation with \(n=10\) basis functions. Contrast this with the state quantization methods where we needed \(1000\) grid points to do a good approximation.

viewof iot_lfa_n = Object.assign(Inputs.range([3,15], {label: "n", step: 1, value: 5}), {style: '--label-width:20px'})

Plot.plot({
  grid: true,
  x: { label: "State s" },
  y: { label: "Value", domain: [0, 175] },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(iot_lfa_df.filter(d => d.n === iot_lfa_n), {x: "s", y: "value", curve: "linear"})
  ]
})
Plot.plot({
  grid: true,
  x: { label: "State s" },
  y: { label: "Action", ticks: [0, 1] },
  marks: [
    Plot.ruleX([0]),
    Plot.ruleY([0]),
    Plot.line(iot_lfa_df.filter(d => d.n === iot_lfa_n), {x: "s", y: "policy", curve: "step"})
  ]
})

(a) Approx value function

(b) Approx policy

Figure 11.3: Linear function approximation with Legendre polynomials: Value function \(V_1(s)\) and policy \(π_1(s)\)

Exercises

Exercise 11.1 Consider an inventory management problem with lost sales and finite capacity, where the state space is \(\mathbb S = [0, C]\) and the dynamics are given by \[ S_{t+1} = [S_t + A_t - W_t]_{0}^C \] where \([x]_{0}^C = \text{clamp}(x,0,C)\). The per-step cost is is given by \[ p A_t + h(S_t + A_t - W_t) \] where \[ h(s) = \begin{cases} c_h s & \text{if } s \ge 0 \\ -c_s s & \text{if } s < 0. \end{cases}\]

Suppose that the demand uniformly distributed on \([0,C/2]\). Numerically solve the problem for \(T = 15\), \(C = 50\), \(c_h = 2\), \(c_s = 5\), and \(p=1\) using linear function approximation using Legendre polynomials as the basis function.

Compute the solution by taking \(n=5\) and \(n=10\) basis functions and plot the value function and the optimal policy at \(t=1\).

Notes

There is a huge literature on approximation methods for continuous state spaces. We only present a brief discussion from a peadagogical point of view. For a more detailed discussion, see Bryson (1999) and Rust (1996).

Our discussion on state quantization assumed uniform grids. We can also employ non-uniform schemes, such as quadrature grids (Tauchen and Hussey 1991) or successive discretization (Chow and Tsitsiklis 1991). Generally, deterministic discretization methods suffer from the curse of dimensionality. However, choosing random grid points can break this curse (Rust 1997).

The lecture notes of Pierre-Luc Bacon provide a comprehensive discussion of linear function approximation. See Schweitzer and Seidmann (1985) for different algorithms for linear function approximations for infinite horizon MDPs.