Theory of Bellman Equation

there is a confusing point in simple environment of Bellman equation in course book of ‘Reinforcement Learning for Robotics’. The code is like:

import numpy as np

S = [0,0,0]

gamma = 1

a0p = [0.2, 0.5, 0.3]
a1p = [0.3, 0.2, 0.5]

a0r = [0.0, -1.0, 1.0]
a1r = [0.0, -2.0, 1.0]

for s in range(len(S)):
    a_0 = a0p[0] * (a0r[0] + gamma * S[0]) + a0p[1] * (a0r[1] + gamma * S[1]) + a0p[2] * (a0r[2] + gamma * S[2])
    a_1 = a1p[0] * (a1r[0] + gamma * S[0]) + a1p[1] * (a1r[1] + gamma * S[1]) + a1p[2] * (a1r[2] + gamma * S[2])
    S[s] = round(np.maximum(a_0,a_1),2)

For now my understanding is:
The initial state is A, and there are 3 possible next states: A, B, C, with different actions with also different possibilities. However these 3 possible next states are parallel, if I don’t misunderstand. But in the code shown above, the state value S[s] is updated along with iteration, a.k.a along with time. In 1st iteration, the S[0] is updated, i.e. the state value of state A, in 2nd iteration S[1] (the state value of state B) is updated, the same with state C. The question is: when state B is updated, we use the updated value of state A which was updated in 1st iteration, and when state C is updated, both A and B were already updated. As a result, the updating of B depends on A, the updating of C depends both on A and B. Since these 3 state are parallel, why is one depended on another one or two? Why must we update as A-B-C, rather than B-C-A or some other order?
Action value is different from state value. In this line:

S[s] = round(np.maximum(a_0,a_1),2)

S[s] is state value, while a_0 and a_1 are action values. Why can we just keep the bigger action value as state value? What about the possibility part: pi(a|s)?

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

  • Question 1:

In the given implementation, the state value function is updated using the value iteration algorithm, which is a dynamic programming method for finding the optimal value function in reinforcement learning. The value iteration algorithm updates the state value function iteratively by considering the Bellman equation.

The reason for updating the value function in a specific order is to ensure that the updated values of the value function converge to the optimal values. The algorithm performs updates for each state in a sequential order, repeatedly calculating the expected returns for all possible actions from the current state and selecting the maximum value.

The specific order of updating the state value function can vary, and in the provided code, the order is simply based on iterating over the states in a sequential manner. This order ensures that all state values are updated, eventually converging to their optimal values.

The value iteration algorithm converges when the updates to the state value function become smaller than a predefined threshold or when a maximum number of iterations is reached. By iteratively applying the Bellman equation, the algorithm gradually improves the state value function estimates until they converge to the optimal values.

It’s important to note that the value iteration algorithm assumes complete knowledge of the environment dynamics and rewards, which are represented by the transition matrix and reward matrix, respectively. In practice, these values may not be known, and other algorithms, such as model-free reinforcement learning methods, are used to estimate them.


I don’t understand your point. Can you please specify in a more detailed manner?