34  Designer’s Approach

Updated

January 8, 2024

In all the discussion so far, we have always made an implicit assumption: the decision maker has perfect recall, i.e., it remembers everything that it has seen and done in the past. This assumption, is at the heart of the dynamic programming argument. For example, consider policy evaluation for a POMDP (using standard notation). The value function is given by Vtπ(ht)=Eπ[c(St,At)+τ=t+1Tc(Sτ,Aτ)|Ht=ht]=(a)Eπ[c(St,At)+Eπ[τ=t+1Tc(Sτ,Aτ)|Ht+1]|Ht=ht]=Eπ[c(St,At)+Vt+1π(Ht+1)|Ht=ht] where step (a) relies on the smoothing property of conditional expectation, which only works if the information at time t+1 is a superset of the information at time t, i.e., the agent has perfect recall.

In this section, we start with the simplest model without perfect recall: a POMDP where actions are chosen based on just the current observation. How do we find the best policy? To avoid any confusion, let’s start with a formal description of the system model.

We assume that the system has a state StS, control input AtA, and a process noise WtW. The state evolves as (1)St+1=ft(St,At,Wt). The observation YtY of the decision maker at time t is given by (2)Yt=t(St,Nt) where NtN is the observation noise. As for MPDs and POMDPs, we assume that the primitive random variables (S1,W1,,WT,N1,,NT) are defined on a common probability space and are mutually independent.

Unlike MDPs and standard POMDPs, it is assumed that the decision maker does not have access to past observations and actions and can choose the current action as a function of just the current observation, i.e., At=πt(Yt) At each time, the system incurs a cost ct(St,At) which depends on the current state and the current action. The system operates for a finite horizon T and incurs a total cost t=1Tct(St,At).

Given the above system model, we want to choose a control strategy π=(π1,,πT) to minimize the expected total cost J(π):=E[t=1Tct(St,At)]. How should we proceed?

Note that the only difference from the POMDP model is that At is a function of Yt rather than (Y1:t,A1:t1). Apart from this, the other modeling assumptions are the same. However, the lack of perfect recall implies that the entire machinery of dynamic programming, as presented for MDPs and POMDPs, cannot be applied directly.

An advantage of not having perfect recall is that the size of decision rule πt is not increasing with time. Therefore, the number of possible policies is only exponential in T (rather than double exponential, as was the case for POMDPs). Nevertheless, a brute force search is still impractical and we want to find something similar to dynamic programming where the search complexity is linear in time horizon T.

34.1 The designer’s approach

The standard “dynamic-progarmming viewpoint” is to evaluate the value function (or the cost-to-go function) from the point of view of the agent at each time. Instead, consider the problem from the point of view of the system designer before the system starts operating. The system designer knows the system model and the statistics of the primitive random variables but does not know the observation of the agent. The designer is concerned with the optimal decision rule for the agent before the system starts operating.

The designer may view the system as a stochastic input-output system: the stochastic inputs are the primitive random variables, the control inputs are the decision rules, and the output is the instantenous cost. The input-output relationship can be described consistently by (St,Yt), which represents the state of the environment and the state of the agent. However, this state cannot be used for optimization because it is not observed by the system designer. So, the optimization problem at the system designer is conceptually equivalent to a POMDP.

1 In this particular example, we could have worked with just St as the state sufficient for input-output mapping, but we carry the “state of the agent” so that the same approach continues to work for more general models.

Hence the designer can obtain a sequential decomposition by forming a belief on the state (sufficient for input-output mapping) of the system, based on all the history of observations and actions available to it (i.e., all the past decision rules, since the designer doesn’t observe anything else). This belief can be described by bt=P(St,Ytπ1:t1), which is the “conditional probabilty law” of the “state” conditioned on all the past observations and “control actions” of the designer. Technically, bt is not a conditional probability law, rather it is unconditional probability law; but this fact is a techinicality which does not affect the solution methodology.

The belief state sastisfies the usual properties of an information state.

Lemma 34.1 The belief state satisfies the following properties.

  1. Sufficient for predicting itself. There exists a linear transformation Ht(πt) such that bt+1=Ht(πt)bt.

  2. Sufficient for performance evaluation. There exists functions c¯t:[SA]R such that the expected per-step cost can be expressed as E[ct(St,At)π1:t]=c¯t(bt,πt).

Since the “belief” is an unconditional probability, it evolves similar to the probability distribution of a Markov chain. In particular, :bt+1(st+1,yt+1)=P(st+1,yt+1π1:t)=P(yt+1st+1)stSytYP(ytst)P(st+1St=st,At=πt(yt))P(stπ1:t1)=:[Ht(πt)bt](st+1,yt+1). This proves the first part of the result. Note that the transformation Ht(πt) is linear in the sense that for any belief bt given as a linear combination of two beliefs bt(1) and bt(2), i.e., bt=λbt(1)+(1λ)bt(2) we have Ht(πt)bt=λHt(πt)bt(1)+(1λ)Ht(πt)bt(2).

For the second part, note that E[ct(St,At)π1:t]=stSct(st,πt(st))P(stπ1:t1)=:c¯t(bt,πt). This proves the second part of the result.

The decision rule ψt:btπt followed by the designer is called a meta-policy (to distinguish it from the policy of the agent). The optimal meta-policy ψ=(ψ1,,ψT) can be obtained by solving the following dynamic program.

Dynamic program

Initialize VT+1(b)0. Then, for all t{T,,1}, recursively define: Qt(b,π)=c¯t(b,π)+Vt+1(Ht(π)b)Vt(b)=minπΠQt(b,π)

Let ψt(b) denote the arg min of Qt(b,π). Then, the meta-strategy ψ=(ψ1,,ψT) is optimal.

Some properties of the solution
  1. The set Π above may be restricted to all deterministic maps from S to A without any loss of optimality.

  2. As for POMDPs, we can show that the value function Vt is piecewise linear and convex.

How to determine the policy?

The dynamic program above determines a meta-policy ψ. The corresponding policy π=(π1,,πT) may be determined as follows.

  • Start with the initial belief b1 (which is the same as the initial distribution of the state S1). Choose π1=ψ1(b1).
  • Let b2=H1(π1)b1. Choose π2=ψ2(b2).
  • Let b3=H2(π2)b2. Choose π3=ψ3(b3).
Optimal policies for infinite horizon are not time-homogeneous

Although we presented the result for the finite-horizon setting, the argument naturally extends to the infinite horizon setting as well. In particular, consider the infinite horizon discounted cost setting. Then, the optimal meta-policy is given by the unique bounded fixed point of the following equation: V(b)=minπΠ{c¯(b,π)+γV(H(π)b)},bΔ(S×Y). This implies that the meta-policy ψ is time-homogeneous but the policy π=(π1,π2,) (chosen according to the procedure described above) is not time-homogeneous.

34.2 The general idea

Now we generalize the discussion above to a general multi-agent control problem. Suppose that the system has a state StS. There are N controllers, indexed by the set N:={1,,N}. Let YtnY and AtnAn denote the observation and action, respectively, of agent nN at time t. We assume that the observations are given by Ytn=tn(St,Wtn),nN and the state evolves as St+1=ft0(St,At,Wt0) where At denotes the vector (At1,,AtN) of all actions. We assume that the noise processes {Wt0}t1, {Wtn}t1, nN, are independent across time and also mutually independent.

We assume that all agents use finite state controllers. In particular, agent nN uses a local state ZtnZn. At time t, agent n first updates its state using a state-update function ftn, as follows: Ztn=ftn(Zt1n,Ytn) and then uses the updated state to choose an action as follows: Atn=πtn(Ztn). We call Ztn as agent state.

A commonly used example of agent state is Ztn=(Ytd+1n,Ytd+2n,,Ytn) where the agent uses a sliding window of the last d observations as a local state. The example presented in the previous section is a special case of a sliding window controller with window size d=1.

Thus, we can view the problem from the time of a view of a system designer. The state sufficient for input-output mapping is (St,Zt1,,Ztn); the control input is (ft1,,ftn,πt1,,πtn) and the output is the instanteous cost. We separate the belief state before taking actions and the belief state before updating the local memory: bt=P(St,Zt1,,ZtNf1:t1:N,π1:t11:N),b¯t=P(St+1,Zt1,,ZtNf1:t1:N,π1:t1:N).

As for , we can show the following:

Lemma 34.2 The belief state satisfies the following properties.

  1. Sufficient for predicting itself. There exists a linear transformations Ht(πt1:N) and H¯t(ft1:N) such that b¯t=Ht(πt1:N)bt,bt+1=H¯t(ft1:N)b¯t.

  2. Sufficient for performance evaluation. There exists functions c¯t:[SA]R such that the expected per-step cost can be expressed as E[ct(St,At)f1:t1:N,π1:t1:N]=c¯t(bt,πt1:N).

Therefore, we can obtain the following dynamic programming decomposition.

Dynamic program

Initialize VT+1(b)0. Then, for all t{T,,1}, recursively define: Qt(b,π1:N)=c¯t(b,f1:N,π1:N)+V¯t(Ht(π1:N)b)Q¯t(b¯,f1:N)=Vt+1(H¯t(f1:N)b¯), and Vt(b)=minπ1:NQt(b,π1:N),V¯t(b¯)=minf1:NQ¯t(b¯,f1:N).

Let ψt(b) and ψ¯t(b¯) denote the arg min of Qt(b,π1:N) and Q¯t(b¯,f1:N). Then, the meta-strategy (ψ1,ψ¯1,ψ2,ψ¯2,,ψT) is optimal.

Simplifying the dynamic program

Instead of optimizing the dynamic program over all policies f1:N and π1:N at once, we can optimize them one by one. In particular, define an intermediate beliefs bt1=P(St,Zt1,ZtN,At1f1:t1:N,π1:t1,π1:t12:N),bt2=P(St,Zt1,ZtN,At1:2f1:t1:N,π1:t1:2,π1:t13:N),=btN=P(St,Zt1,ZtN,At1:Nf1:t1:N,π1:t1:N), and b¯t1=P(St+1,Zt+11,Zt2:Nf1:t+11,f1:t2:N,π1:t1:N),b¯t2=P(St+1,Zt+11:2,Zt3:Nf1:t+11:2,f1:t3:N,π1:t1:N),=b¯tN=P(St+1,Zt+11:Nf1:t+11:Nπ1:t1:N).

Then, we can show that these beliefs update as: btHt1(πt1)bt1Ht2(πt2)HtN(πtN)btNb¯tH¯t1(ft+11)b¯t1H¯t2(ft+12)b¯tN=bt+1. We can then decompose the DP over time in a similar manner and just optimize over only one of πtn or ftn at each step.

Notes

The presentation here is borrowed from Mahajan (). The general idea first appeared in Witsenhausen (), where it was called the standard form. The model in Witsenhausen () was fairly general. It was specialized to POMDPs with finite memory in Sandell (). Witsenhausen’s standard form was rediscovered in Dibangoye et al. (), where it was called occupation MDP.

close all nutshells