34 Designer’s Approach
In all the discussion so far, we have always made an implicit assumption: the decision maker has perfect recall, i.e., it remembers everything that it has seen and done in the past. This assumption, is at the heart of the dynamic programming argument. For example, consider policy evaluation for a POMDP (using standard notation). The value function is given by
In this section, we start with the simplest model without perfect recall: a POMDP where actions are chosen based on just the current observation. How do we find the best policy? To avoid any confusion, let’s start with a formal description of the system model.
We assume that the system has a state
Unlike MDPs and standard POMDPs, it is assumed that the decision maker does not have access to past observations and actions and can choose the current action as a function of just the current observation, i.e.,
Given the above system model, we want to choose a control strategy
Note that the only difference from the POMDP model is that
An advantage of not having perfect recall is that the size of decision rule
34.1 The designer’s approach
The standard “dynamic-progarmming viewpoint” is to evaluate the value function (or the cost-to-go function) from the point of view of the agent at each time. Instead, consider the problem from the point of view of the system designer before the system starts operating. The system designer knows the system model and the statistics of the primitive random variables but does not know the observation of the agent. The designer is concerned with the optimal decision rule for the agent before the system starts operating.
The designer may view the system as a stochastic input-output system: the stochastic inputs are the primitive random variables, the control inputs are the decision rules, and the output is the instantenous cost. The input-output relationship can be described consistently by
1 In this particular example, we could have worked with just
Hence the designer can obtain a sequential decomposition by forming a belief on the state (sufficient for input-output mapping) of the system, based on all the history of observations and actions available to it (i.e., all the past decision rules, since the designer doesn’t observe anything else). This belief can be described by
The belief state sastisfies the usual properties of an information state.
Lemma 34.1 The belief state satisfies the following properties.
Sufficient for predicting itself. There exists a linear transformation
such thatSufficient for performance evaluation. There exists functions
such that the expected per-step cost can be expressed as
Since the “belief” is an unconditional probability, it evolves similar to the probability distribution of a Markov chain. In particular, :
For the second part, note that
The decision rule
Initialize
Let
The set
above may be restricted to all deterministic maps from to without any loss of optimality.As for POMDPs, we can show that the value function
is piecewise linear and convex.
The dynamic program above determines a meta-policy
- Start with the initial belief
(which is the same as the initial distribution of the state ). Choose . - Let
. Choose . - Let
. Choose . - …
Although we presented the result for the finite-horizon setting, the argument naturally extends to the infinite horizon setting as well. In particular, consider the infinite horizon discounted cost setting. Then, the optimal meta-policy is given by the unique bounded fixed point of the following equation:
34.2 The general idea
Now we generalize the discussion above to a general multi-agent control problem. Suppose that the system has a state
We assume that all agents use finite state controllers. In particular, agent
A commonly used example of agent state is
Thus, we can view the problem from the time of a view of a system designer. The state sufficient for input-output mapping is
As for Lemma 34.1, we can show the following:
Lemma 34.2 The belief state satisfies the following properties.
Sufficient for predicting itself. There exists a linear transformations
and such thatSufficient for performance evaluation. There exists functions
such that the expected per-step cost can be expressed as
Therefore, we can obtain the following dynamic programming decomposition.
Initialize
Let
Instead of optimizing the dynamic program over all policies
Then, we can show that these beliefs update as:
Notes
The presentation here is borrowed from Mahajan (2008). The general idea first appeared in Witsenhausen (1973), where it was called the standard form. The model in Witsenhausen (1973) was fairly general. It was specialized to POMDPs with finite memory in Sandell (1974). Witsenhausen’s standard form was rediscovered in Dibangoye et al. (2016), where it was called occupation MDP.