You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I implemented a way of using the angle of the pole to help train the NN for the
cart-pole game. I also recorded the average length of the episode on each
algorithm. I also learned about the Temporal Difference learning and Q-Learning
algorithms for training models using MDP.
Copy file name to clipboardExpand all lines: HandsOnMachineLearningWithScikitLearnAndTensorFlow/.ipynb_checkpoints/homl_ch18_Reinforcement-learning-checkpoint.ipynb
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: [33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m
I trained the model on 200 iteration on the HMS computing cluster.
458
470
The results of that model are shown below.
459
471
@@ -533,6 +545,43 @@ $
533
545
534
546
## Temporal Difference (TD) Learning
535
547
548
+
When training a model, however, the states, transition probabilities, and rewards are not known.
549
+
Therefore, the model must explore the environment to learn about the possible states.
550
+
The *Temporal Difference Learning* (TD Learning) algorithm, is similar to the Value Iteration algorithm, but beginning under the assumption that the model only knows the possible states and actions, but nothing more.
551
+
Thus, the agent uses an *exploration policy* (e.g. randomly making decisions) to explore the MDP.
552
+
As it learns, it updates the estimates of the state values based on the transitions and rewards that are actually observed.
* $r + \gamma \cdot V_k(s')$ is called the *TD target*
564
+
* $delta_k(s,r,s')$ is called the *TD error*
565
+
566
+
For each state $s$, the algorithm keeps a running average of the immediate rewards the agent gets upon choosing an action plus the rewards it expects to get later.
567
+
568
+
## Q-Learning
569
+
570
+
The Q-Learning algorithm is an adaptation of the Q-Value Iteration algorithm to the situation where the MDP system values are unknown.
571
+
Q-Learns watches an agent play (e.g. randomly) and gradually improves its estimates of the unknown system parameters.
572
+
Once the estimates are good enough, the optimal policy is the choose the action with the highest Q-Value (i.e. a greedy policy).
573
+
574
+
$
575
+
Q(s, a) \xleftarrow[\alpha]{} r + \gamma \cdot \max_{a'} Q(s', a')
576
+
$
577
+
578
+
where $a \xleftarrow[\alpha]{} b \,$ means $\, a_{k+1} \leftarrow (1-\alpha) \cdot a_k + \alpha \cdot b_k$.
579
+
580
+
For each state-action pair $(s,a)$, the algorithms records a running average of the rewards $r$ the agent receives upon leaving state $s$ with action $a$, plus the sum of discounted expected future rewards.
581
+
The maximum Q-Value of the next state is taken as this value because it is assumed the agent will act optimally from then on.
0 commit comments