Skip to content

Commit fe44938

Browse files
committed
124 of 100 Days of Python
I implemented a way of using the angle of the pole to help train the NN for the cart-pole game. I also recorded the average length of the episode on each algorithm. I also learned about the Temporal Difference learning and Q-Learning algorithms for training models using MDP.
1 parent 788bf7c commit fe44938

File tree

6 files changed

+269
-55
lines changed

6 files changed

+269
-55
lines changed

HandsOnMachineLearningWithScikitLearnAndTensorFlow/.ipynb_checkpoints/homl_ch18_Reinforcement-learning-checkpoint.ipynb

Lines changed: 103 additions & 23 deletions
Large diffs are not rendered by default.

HandsOnMachineLearningWithScikitLearnAndTensorFlow/homl_ch18_Reinforcement-learning.ipynb

Lines changed: 103 additions & 23 deletions
Large diffs are not rendered by default.

HandsOnMachineLearningWithScikitLearnAndTensorFlow/homl_ch18_Reinforcement-learning.md

Lines changed: 58 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ obs
5858

5959

6060

61-
array([-0.03100318, -0.01310485, -0.00181168, -0.00497128])
61+
array([-0.03726314, 0.01048634, -0.01756487, 0.01227879])
6262

6363

6464

@@ -109,7 +109,7 @@ obs
109109

110110

111111

112-
array([-0.03126528, -0.20820077, -0.0019111 , 0.28713949])
112+
array([-0.03705342, -0.18437937, -0.0173193 , 0.29936845])
113113

114114

115115

@@ -205,7 +205,7 @@ plt.show()
205205
```
206206

207207

208-
![png](homl_ch18_Reinforcement-learning_files/homl_ch18_Reinforcement-learning_17_0.png)
208+
![png](homl_ch18_Reinforcement-learning_files/homl_ch18_Reinforcement-learning_18_0.png)
209209

210210

211211

@@ -216,7 +216,7 @@ np.mean(totals), np.median(totals), np.std(totals), np.min(totals), np.max(total
216216

217217

218218

219-
(42.02, 41.0, 9.141094026428128, 24.0, 70.0)
219+
(42.388, 40.0, 8.824367172777887, 24.0, 68.0)
220220

221221

222222

@@ -403,6 +403,8 @@ Here is how it works:
403403

404404

405405
```python
406+
avg_episode_lengths = []
407+
406408
for iteration in range(n_iterations):
407409
print(f"iteration {iteration}")
408410
all_rewards, all_grads = play_multiple_episodes(env,
@@ -423,6 +425,8 @@ for iteration in range(n_iterations):
423425
)
424426
all_mean_grads.append(mean_grads)
425427
optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))
428+
429+
avg_episode_lengths.append(np.mean([len(x) for x in all_rewards]))
426430
```
427431

428432
iteration 0
@@ -438,7 +442,19 @@ for iteration in range(n_iterations):
438442

439443

440444
```python
441-
if True:
445+
plt.plot(avg_episode_lengths, 'b-')
446+
plt.xlabel('iteration', fontsize=15)
447+
plt.ylabel('average reward', fontsize=15)
448+
plt.show()
449+
```
450+
451+
452+
![png](homl_ch18_Reinforcement-learning_files/homl_ch18_Reinforcement-learning_43_0.png)
453+
454+
455+
456+
```python
457+
if False:
442458
obs = env.reset()
443459
for _ in range(300):
444460
env.render()
@@ -450,10 +466,6 @@ if True:
450466
env.close()
451467
```
452468

453-
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.
454-
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
455-
456-
457469
I trained the model on 200 iteration on the HMS computing cluster.
458470
The results of that model are shown below.
459471

@@ -533,6 +545,43 @@ $
533545

534546
## Temporal Difference (TD) Learning
535547

548+
When training a model, however, the states, transition probabilities, and rewards are not known.
549+
Therefore, the model must explore the environment to learn about the possible states.
550+
The *Temporal Difference Learning* (TD Learning) algorithm, is similar to the Value Iteration algorithm, but beginning under the assumption that the model only knows the possible states and actions, but nothing more.
551+
Thus, the agent uses an *exploration policy* (e.g. randomly making decisions) to explore the MDP.
552+
As it learns, it updates the estimates of the state values based on the transitions and rewards that are actually observed.
553+
554+
$
555+
V_{k+1}(s) \leftarrow (1-\alpha)V_{k}(s) + \alpha (r + \gamma \cdot V_k(s')) \\
556+
V_{k+1}(s) \leftarrow V_{k}(s) + \alpha \cdot \delta_k(s, r, s') \\
557+
\quad \text{where} \quad \delta_k(s,r,s') = r + \gamma \cdot V_k(s') - V_k(s)
558+
$
559+
560+
where:
561+
562+
* $\alpha$ is the learning rate
563+
* $r + \gamma \cdot V_k(s')$ is called the *TD target*
564+
* $delta_k(s,r,s')$ is called the *TD error*
565+
566+
For each state $s$, the algorithm keeps a running average of the immediate rewards the agent gets upon choosing an action plus the rewards it expects to get later.
567+
568+
## Q-Learning
569+
570+
The Q-Learning algorithm is an adaptation of the Q-Value Iteration algorithm to the situation where the MDP system values are unknown.
571+
Q-Learns watches an agent play (e.g. randomly) and gradually improves its estimates of the unknown system parameters.
572+
Once the estimates are good enough, the optimal policy is the choose the action with the highest Q-Value (i.e. a greedy policy).
573+
574+
$
575+
Q(s, a) \xleftarrow[\alpha]{} r + \gamma \cdot \max_{a'} Q(s', a')
576+
$
577+
578+
where $a \xleftarrow[\alpha]{} b \,$ means $\, a_{k+1} \leftarrow (1-\alpha) \cdot a_k + \alpha \cdot b_k$.
579+
580+
For each state-action pair $(s,a)$, the algorithms records a running average of the rewards $r$ the agent receives upon leaving state $s$ with action $a$, plus the sum of discounted expected future rewards.
581+
The maximum Q-Value of the next state is taken as this value because it is assumed the agent will act optimally from then on.
582+
583+
Now we can implement the Q-Learning algorithm.
584+
536585

537586
```python
538587

Loading
Loading

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -549,3 +549,8 @@ I trained the model on the HMS computing cluster.
549549

550550
**Day 123 - February 27, 2020:**
551551
We learned about Markov Descision Processes and how to estimate the optimal policy by modeling the system as a Markov process.
552+
553+
**Day 124 - February 28, 2020:**
554+
I implemented a way of using the angle of the pole to help train the NN for the cart-pole game.
555+
I also recorded the average length of the episode on each algorithm.
556+
I also learned about the Temporal Difference learning and Q-Learning algorithms for training models using MDP.

0 commit comments

Comments
 (0)