Add tf/DQN with dueling support by ahtsan · Pull Request #582 · rlworkgroup/garage

ahtsan · 2019-03-12T19:46:06Z

DQN implementation with garage.Model.
Current for pixel environments we use a bunch of wrappers from baselines. Later we can setup data processing pipeline to make that faster.
Removed the single-layer mlp in cnn. I think it makes more sense to separate them, so now cnn will return the flattened output. Also modified the unit test for cnn accordingly.
Used self.models to store all the models in QFunction. Sometimes there are multiple models and we need something to keep track of them. I assume the models are stored in order, so input must be self.models[0].input and output must be self.models[-1].output.
Added corresponding functions and property in QFunction to make reference easier, e.g. def q_vals() and input property.
Clone method in QFunction to enable object copying (I think there should be better ways). Since all the necessary operations will be done in object construction, we can simply create a new object with a different name (because we want the new object will have a different variable_scope).

Will post benchmark result soon, fixing the scaling.
Will add more tests later.

ryanjulian · 2019-03-12T19:53:34Z

        obs_ph = tf.placeholder(tf.float32, (None, ) + obs_dim, name="obs")

-        self.model.build(obs_ph)
+        with tf.variable_scope(self._variable_scope):


with self._variable_scope

self._variable_scope is a VariableScope object, we have to do

with tf.variable_scope(VariableScopeObject):

to reenter the scope.
(see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/variable_scope.py#L1804).
Also VariableScope is pickleable, variable_scope is not.

ryanjulian · 2019-03-12T19:54:25Z

+                out = model.build(out)
+
+    def q_vals(self):
+        return self.models[-1].networks['default'].outputs


why not use a dict?

you mean store q_vals in a dict? or storing the models in a dict?

the models. if the order of insertion changes at any time, your code will break.

you can also just keep the models in instance variables (self.model1 etc.) and also add them to the list self.models...

CatherineSue · 2019-03-12T19:55:57Z

+            strides=(4, 2, 1),
+            dueling=False)
+
+        policy = DiscreteQfDerivedPolicy(env_spec=env, qf=qf)


CatherineSue · 2019-03-12T19:57:12Z

+        algo.train(sess)
+
+
+run_experiment(


Please update this with LocalRunner

CatherineSue · 2019-03-12T19:58:33Z

+            num_timesteps=num_timesteps,
+            qf_lr=1e-4,
+            discount=1.0,
+            min_buffer_size=1e3,


Have you added an int() call to min_buffer_size?

CatherineSue · 2019-03-12T19:59:00Z

+
+        replay_buffer = SimpleReplayBuffer(
+            env_spec=env.spec,
+            size_in_transitions=int(10000),


int(1e4) or 10000

CatherineSue · 2019-03-12T19:59:10Z

+        qf = DiscreteMLPQFunction(
+            env_spec=env.spec, hidden_sizes=(64, 64), dueling=False)
+
+        policy = DiscreteQfDerivedPolicy(env_spec=env, qf=qf)


CatherineSue · 2019-03-12T20:00:38Z

+                 qf_lr=0.001,
+                 qf_optimizer=tf.train.AdamOptimizer,
+                 discount=1.0,
+                 name=None,


CatherineSue · 2019-03-12T20:12:45Z

+                 num_filters,
+                 strides,
+                 name=None,
+                 padding="SAME",


I think it's better to use single quotations for new files.

CatherineSue · 2019-03-12T20:12:59Z

+                 filter_dims,
+                 num_filters,
+                 strides,
+                 name=None,


name='CNNModel'

Model has a default name if name=None, which will be the class name. How should we do this? If we enforce all derived model class to have name, we don't need the default name at all.

Seems like you have already violated the Model interface by making name an kwarg:

garage/garage/tf/models/base.py

Line 159 in 123efde

def __init__(self, name):

oh yes, so a name is actually required.

CatherineSue · 2019-03-12T20:17:03Z

+            for model in self.models:
+                out = model.build(out)
+
+    def q_vals(self):


@property

CatherineSue · 2019-03-12T20:17:37Z

+
+
+class TestDQN(TfGraphTestCase):
+    def test_dqn_cartpole(self):


How long does it take to run this test?

around 30s.

CatherineSue · 2019-03-12T20:21:01Z

+        episode_rewards.append(0.)
+
+        for itr in range(self.num_timesteps):
+            with logger.prefix('Iteration #%d | ' % itr):


'Timestep' sounds more appropriate to me.

CatherineSue · 2019-03-12T20:25:32Z

+        self._dueling = dueling
+
+        obs_dim = self._env_spec.observation_space.shape
+        action_dim = env_spec.action_space.flat_dim


self._env_spec.action_space.flat_dim

CatherineSue · 2019-03-12T20:29:14Z

@@ -0,0 +1,190 @@
+"""Discrete MLP QFunction."""


We should state this CNN network actually supports CNN2MLP. This is not the same as the CNN* primitive in the current garage. If you think the naming is ok, please add more details to this documentation.

CatherineSue · 2019-03-12T20:29:56Z

+                out = model.build(out, name=name)
+            return out
+
+    def clone(self, name):


Should clone interface be in the base class?

Yes, it will eventually be an interface in Model too.

Please update it into the base class.

CatherineSue · 2019-03-12T20:30:58Z

+            size_in_transitions=int(5000),
+            time_horizon=max_path_length)
+        qf = DiscreteMLPQFunction(env_spec=env.spec, hidden_sizes=(64, 64))
+        policy = DiscreteQfDerivedPolicy(env_spec=env, qf=qf)


CatherineSue · 2019-03-12T20:33:38Z

@@ -0,0 +1,55 @@
+"""
+This script creates a test that fails when garage.tf.algos.DDPG performance is


garage.tf.algos.DQN

ahtsan · 2019-03-29T22:18:06Z

Benchmark with 1M timesteps and random seed=26

ryanjulian

Excellent work. +1000

Please address other reviewers' comments, rebase, and submit!

codecov · 2019-05-01T16:32:54Z

Codecov Report

Merging #582 into master will increase coverage by 1.1%.
The diff coverage is 97%.

@@            Coverage Diff            @@
##           master     #582     +/-   ##
=========================================
+ Coverage   62.25%   63.35%   +1.1%     
=========================================
  Files         163      169      +6     
  Lines        9532     9786    +254     
  Branches     1267     1284     +17     
=========================================
+ Hits         5934     6200    +266     
+ Misses       3288     3269     -19     
- Partials      310      317      +7

Impacted Files	Coverage Δ
garage/tf/core/cnn.py	`100% <ø> (+3.12%)`	⬆️
garage/tf/models/mlp_model.py	`100% <ø> (ø)`	⬆️
garage/tf/misc/tensor_utils.py	`75.73% <100%> (+2.13%)`	⬆️
garage/tf/algos/__init__.py	`100% <100%> (ø)`	⬆️
garage/tf/q_functions/discrete_mlp_q_function.py	`100% <100%> (+57.89%)`	⬆️
garage/tf/models/__init__.py	`100% <100%> (ø)`	⬆️
garage/tf/models/cnn_model_max_pooling.py	`100% <100%> (ø)`
garage/envs/wrappers/__init__.py	`100% <100%> (ø)`	⬆️
garage/tf/algos/ddpg.py	`78.76% <100%> (-0.98%)`	⬇️
garage/envs/wrappers/fire_reset.py	`100% <100%> (ø)`	⬆️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee6e96c...ca08e16. Read the comment docs.

ahtsan · 2019-05-01T16:40:40Z

Fixed all comments above. I will add more tests.

CatherineSue · 2019-05-13T18:35:05Z

        # Select which episodes to use
-        time_horizon = buffer["action"].shape[1]
-        rollout_batch_size = buffer["action"].shape[0]
+        time_horizon = buffer['action'].shape[1]


Is it still valid to have time_horizon since the replay buffer now stores variable-length episodes? It may beyond the scope of this PR. I just think the interface seems a bit confusing now.

Is it because there is no max_path_length anymore in off policy algos?

CatherineSue

I only have some minor comments.

DQN implementation with garage.Model. This is the first algorithm for pixel environments. This PR adds the algorithm as well as the models, primitives and environment wrappers required for training in pixel environments. * Models * MLPDuelingModel * CNNModelWithMaxPooling * Primitives * QFunction2 (base class, without parameterized) * DiscreteCNNQFunction * Wrappers * AtariEnvWrapper (needed when using env wrappers from baselines) The eviction policy of replay buffer used to be random. To make experiments determinisitic, it is changed to First In First Out (FIFO). It was proven to be necessary in order to achieve better result in complex environment for DQN. Added corresponding properties in QFunction to make reference easier, e.g. q_vals. Added clone method in QFunction to enable copying configuration, not including the parameters. Since all the necessary operations will be done in object construction, we can simply create a new object with a different name (because we want the new object to have a different name for variable scope).

ahtsan requested review from CatherineSue and ryanjulian March 12, 2019 19:46

ahtsan requested a review from a team as a code owner March 12, 2019 19:46

ryanjulian reviewed Mar 12, 2019

View reviewed changes

CatherineSue reviewed Mar 12, 2019

View reviewed changes

CatherineSue changed the title ~~DQN with dueling support~~ Add tf/DQN with dueling support Mar 12, 2019

CatherineSue reviewed Mar 12, 2019

View reviewed changes

Comment thread garage/tf/algos/dqn.py Outdated

CatherineSue reviewed Mar 12, 2019

View reviewed changes

ryanjulian approved these changes Mar 29, 2019

View reviewed changes

ahtsan force-pushed the dqn branch 2 times, most recently from 3890c4d to 5b95308 Compare May 1, 2019 16:04

ahtsan force-pushed the dqn branch 2 times, most recently from b2983a1 to b5d8ff4 Compare May 9, 2019 07:06

ryanjulian reviewed May 9, 2019

View reviewed changes

Comment thread garage/tf/q_functions/base2.py

ryanjulian approved these changes May 9, 2019

View reviewed changes

ahtsan force-pushed the dqn branch 3 times, most recently from 2dac995 to 4dc7abb Compare May 11, 2019 04:56