Hi,
there is a confusing point in simple environment of Bellman equation in course book of ‘Reinforcement Learning for Robotics’. The code is like:
import numpy as np
S = [0,0,0]
gamma = 1
a0p = [0.2, 0.5, 0.3]
a1p = [0.3, 0.2, 0.5]
a0r = [0.0, -1.0, 1.0]
a1r = [0.0, -2.0, 1.0]
for s in range(len(S)):
a_0 = a0p[0] * (a0r[0] + gamma * S[0]) + a0p[1] * (a0r[1] + gamma * S[1]) + a0p[2] * (a0r[2] + gamma * S[2])
a_1 = a1p[0] * (a1r[0] + gamma * S[0]) + a1p[1] * (a1r[1] + gamma * S[1]) + a1p[2] * (a1r[2] + gamma * S[2])
S[s] = round(np.maximum(a_0,a_1),2)
print(S)
For now my understanding is:
1.
The initial state is A, and there are 3 possible next states: A, B, C, with different actions with also different possibilities. However these 3 possible next states are parallel, if I don’t misunderstand. But in the code shown above, the state value S[s] is updated along with iteration, a.k.a along with time. In 1st iteration, the S[0] is updated, i.e. the state value of state A, in 2nd iteration S[1] (the state value of state B) is updated, the same with state C. The question is: when state B is updated, we use the updated value of state A which was updated in 1st iteration, and when state C is updated, both A and B were already updated. As a result, the updating of B depends on A, the updating of C depends both on A and B. Since these 3 state are parallel, why is one depended on another one or two? Why must we update as A-B-C, rather than B-C-A or some other order?
2.
Action value is different from state value. In this line:
S[s] = round(np.maximum(a_0,a_1),2)
S[s]
is state value, while a_0
and a_1
are action values. Why can we just keep the bigger action value as state value? What about the possibility part: pi(a|s)?