Skip to content

Commit 959d7ea

Browse files
committed
129 of 100 Days of Python
I learned about setting up a TF Agents environment and the common wrappers applied to Atari games.
1 parent 06d4a56 commit 959d7ea

File tree

8 files changed

+966
-81
lines changed

8 files changed

+966
-81
lines changed

HandsOnMachineLearningWithScikitLearnAndTensorFlow/.ipynb_checkpoints/homl_ch18_Reinforcement-learning-checkpoint.ipynb

Lines changed: 336 additions & 17 deletions
Large diffs are not rendered by default.

HandsOnMachineLearningWithScikitLearnAndTensorFlow/homl_ch18_Reinforcement-learning.ipynb

Lines changed: 336 additions & 17 deletions
Large diffs are not rendered by default.

HandsOnMachineLearningWithScikitLearnAndTensorFlow/homl_ch18_Reinforcement-learning.md

Lines changed: 242 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,10 @@ obs = env.reset()
5151
obs
5252
```
5353

54-
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
55-
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
5654

5755

5856

59-
60-
61-
array([ 0.04916234, -0.01195665, 0.01878708, -0.03852683])
57+
array([ 0.03024519, -0.03961465, -0.03585694, -0.00028358])
6258

6359

6460

@@ -109,7 +105,7 @@ obs
109105

110106

111107

112-
array([ 0.04892321, -0.20734289, 0.01801655, 0.2600239 ])
108+
array([ 0.02945289, -0.2342045 , -0.03586261, 0.2808739 ])
113109

114110

115111

@@ -216,7 +212,7 @@ np.mean(totals), np.median(totals), np.std(totals), np.min(totals), np.max(total
216212

217213

218214

219-
(41.538, 40.0, 8.921466023025587, 24.0, 68.0)
215+
(42.754, 41.0, 9.098872677425485, 24.0, 67.0)
220216

221217

222218

@@ -922,7 +918,6 @@ for episode in range(600):
922918

923919
To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.
924920

925-
Stopping early at episode 477
926921

927922

928923

@@ -1135,6 +1130,245 @@ model = keras.Model(inputs=[input_states], outputs=[Q_values])
11351130

11361131
## The TF-Agents library
11371132

1133+
The TF-Agents library is a Reinforcement Learning library based on TF.
1134+
It provides many environments, including wrapping around OpenAI Gym, physics engines, and models.
1135+
We will use it to train a DQN to play the Atari game *Breakout*.
1136+
1137+
### TG-Agents environment
1138+
1139+
We can create Breakout environment which is a wrapper around an OpenAI Gym environment.
1140+
1141+
1142+
```python
1143+
from tf_agents.environments import suite_gym
1144+
1145+
breakout_env = suite_gym.load('Breakout-v4')
1146+
breakout_env
1147+
```
1148+
1149+
1150+
1151+
1152+
<tf_agents.environments.wrappers.TimeLimit at 0x146ff3790>
1153+
1154+
1155+
1156+
There are some differences between the APIs of OpenAI Gym and TF-Agents.
1157+
For instance, calling the `reset()` method of an environment does not return just the observations, but a `TimeStep` object with a bunch of information.
1158+
1159+
1160+
```python
1161+
breakout_env.reset()
1162+
```
1163+
1164+
1165+
1166+
1167+
TimeStep(step_type=array(0, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([[[0, 0, 0],
1168+
[0, 0, 0],
1169+
[0, 0, 0],
1170+
...,
1171+
[0, 0, 0],
1172+
[0, 0, 0],
1173+
[0, 0, 0]],
1174+
1175+
[[0, 0, 0],
1176+
[0, 0, 0],
1177+
[0, 0, 0],
1178+
...,
1179+
[0, 0, 0],
1180+
[0, 0, 0],
1181+
[0, 0, 0]],
1182+
1183+
[[0, 0, 0],
1184+
[0, 0, 0],
1185+
[0, 0, 0],
1186+
...,
1187+
[0, 0, 0],
1188+
[0, 0, 0],
1189+
[0, 0, 0]],
1190+
1191+
...,
1192+
1193+
[[0, 0, 0],
1194+
[0, 0, 0],
1195+
[0, 0, 0],
1196+
...,
1197+
[0, 0, 0],
1198+
[0, 0, 0],
1199+
[0, 0, 0]],
1200+
1201+
[[0, 0, 0],
1202+
[0, 0, 0],
1203+
[0, 0, 0],
1204+
...,
1205+
[0, 0, 0],
1206+
[0, 0, 0],
1207+
[0, 0, 0]],
1208+
1209+
[[0, 0, 0],
1210+
[0, 0, 0],
1211+
[0, 0, 0],
1212+
...,
1213+
[0, 0, 0],
1214+
[0, 0, 0],
1215+
[0, 0, 0]]], dtype=uint8))
1216+
1217+
1218+
1219+
1220+
```python
1221+
breakout_env.step(1)
1222+
```
1223+
1224+
1225+
1226+
1227+
TimeStep(step_type=array(1, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([[[0, 0, 0],
1228+
[0, 0, 0],
1229+
[0, 0, 0],
1230+
...,
1231+
[0, 0, 0],
1232+
[0, 0, 0],
1233+
[0, 0, 0]],
1234+
1235+
[[0, 0, 0],
1236+
[0, 0, 0],
1237+
[0, 0, 0],
1238+
...,
1239+
[0, 0, 0],
1240+
[0, 0, 0],
1241+
[0, 0, 0]],
1242+
1243+
[[0, 0, 0],
1244+
[0, 0, 0],
1245+
[0, 0, 0],
1246+
...,
1247+
[0, 0, 0],
1248+
[0, 0, 0],
1249+
[0, 0, 0]],
1250+
1251+
...,
1252+
1253+
[[0, 0, 0],
1254+
[0, 0, 0],
1255+
[0, 0, 0],
1256+
...,
1257+
[0, 0, 0],
1258+
[0, 0, 0],
1259+
[0, 0, 0]],
1260+
1261+
[[0, 0, 0],
1262+
[0, 0, 0],
1263+
[0, 0, 0],
1264+
...,
1265+
[0, 0, 0],
1266+
[0, 0, 0],
1267+
[0, 0, 0]],
1268+
1269+
[[0, 0, 0],
1270+
[0, 0, 0],
1271+
[0, 0, 0],
1272+
...,
1273+
[0, 0, 0],
1274+
[0, 0, 0],
1275+
[0, 0, 0]]], dtype=uint8))
1276+
1277+
1278+
1279+
We can also get the parameters of an environment through specific methods.
1280+
1281+
1282+
```python
1283+
breakout_env.observation_spec()
1284+
```
1285+
1286+
1287+
1288+
1289+
BoundedArraySpec(shape=(210, 160, 3), dtype=dtype('uint8'), name='observation', minimum=0, maximum=255)
1290+
1291+
1292+
1293+
1294+
```python
1295+
breakout_env.action_spec()
1296+
```
1297+
1298+
1299+
1300+
1301+
BoundedArraySpec(shape=(), dtype=dtype('int64'), name='action', minimum=0, maximum=3)
1302+
1303+
1304+
1305+
1306+
```python
1307+
breakout_env.time_step_spec()
1308+
```
1309+
1310+
1311+
1312+
1313+
TimeStep(step_type=ArraySpec(shape=(), dtype=dtype('int32'), name='step_type'), reward=ArraySpec(shape=(), dtype=dtype('float32'), name='reward'), discount=BoundedArraySpec(shape=(), dtype=dtype('float32'), name='discount', minimum=0.0, maximum=1.0), observation=BoundedArraySpec(shape=(210, 160, 3), dtype=dtype('uint8'), name='observation', minimum=0, maximum=255))
1314+
1315+
1316+
1317+
1318+
```python
1319+
breakout_env.gym.get_action_meanings()
1320+
```
1321+
1322+
1323+
1324+
1325+
['NOOP', 'FIRE', 'RIGHT', 'LEFT']
1326+
1327+
1328+
1329+
### Environment wrappers and Atari preprocessing
1330+
1331+
TF-Agents includes *environment wrappers*: wrappers for environments that are automatically involved in very step of the environment and add some extra functionality.
1332+
Here are some that seem quite useful:
1333+
1334+
* `ActionClipWrapper`: Clips the actions to the action specification.
1335+
* `ActionDiscretizeWrapper`: If an environment has actions on a continuous scale, this can turn them into a specified number of discrete steps.
1336+
* `ActionRepeat`: Repeats each action for multiple steps, accumulating the rewards - this can be useful to speed up the training in some environments.
1337+
* `RunStats`: Records environment statistics.
1338+
* `TimeLimit`: Interrupts the environment if it runs for longer than a maximum number of steps.
1339+
* `VideoWrapper`: Records a video of the environment.
1340+
1341+
The wrappers for Atari environments are fairly standardized - greyscale and downsampling the observations, max pooling of the last two frames of the game using a 1x1 filter, frame skipping (the default is to skip every 4 frames), end-of-life loss (whether or not to end the game after the player loses a life).
1342+
1343+
We will not use the frame skipping in this case, but will apply a wrapper that merges 4 frames into one (it helps the agent learn about the direction the ball is moving in).
1344+
1345+
1346+
```python
1347+
from tf_agents.environments import suite_atari
1348+
from tf_agents.environments.atari_preprocessing import AtariPreprocessing
1349+
from tf_agents.environments.atari_wrappers import FrameStack4
1350+
1351+
max_episode_steps = 27000
1352+
environment_name = 'BreakoutNoFrameskip-v4'
1353+
1354+
breakout_env = suite_atari.load(
1355+
environment_name,
1356+
max_episode_steps=max_episode_steps,
1357+
gym_env_wrappers=[AtariPreprocessing, FrameStack4]
1358+
)
1359+
```
1360+
1361+
Lastly, we can wrap this environment in `TFPyEnvironment` so it is usable from within a TF graph.
1362+
1363+
1364+
```python
1365+
from tf_agents.environments.tf_py_environment import TFPyEnvironment
1366+
1367+
tf_env = TFPyEnvironment(breakout_env)
1368+
```
1369+
1370+
### Training architecture
1371+
11381372

11391373
```python
11401374

Loading
-175 Bytes
Loading
-1.33 KB
Loading

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -570,3 +570,6 @@ Overall, we demonstrated just how finiky RL training can be.
570570

571571
**Day 128 - March 04, 2020:**
572572
We learned about some more updates to RL made by DeepMind including Double DQN, PER, and Dueling DQN.
573+
574+
**Day 129 - March 05, 2020:**
575+
I learned about setting up a TF Agents environment and the common wrappers applied to Atari games.

0 commit comments

Comments
 (0)