Refactor *MLPPolicy with model by ahtsan · Pull Request #574 · rlworkgroup/garage

ahtsan · 2019-03-07T21:46:36Z

Custom pickle logic is added, basically remove all unpickleable operations in __getstate__ and reconstruct those in __setstate__.

Also the default model (model.networks['default']) will be built in __setstate__ as well, so after unpickled the policy can be used right away.

codecov · 2019-03-07T22:08:41Z

Codecov Report

Merging #574 into master will increase coverage by 0.38%.
The diff coverage is 88.88%.

@@            Coverage Diff            @@
##           master    #574      +/-   ##
=========================================
+ Coverage   60.51%   60.9%   +0.38%     
=========================================
  Files         141     142       +1     
  Lines        8833    8857      +24     
  Branches     1251    1239      -12     
=========================================
+ Hits         5345    5394      +49     
+ Misses       3155    3130      -25     
  Partials      333     333

Impacted Files	Coverage Δ
garage/tf/core/mlp.py	`100% <ø> (ø)`	⬆️
garage/tf/models/__init__.py	`100% <100%> (ø)`
garage/tf/policies/deterministic_mlp_policy.py	`80% <100%> (ø)`	⬆️
garage/tf/models/mlp_model.py	`100% <100%> (ø)`
garage/tf/policies/__init__.py	`100% <100%> (ø)`	⬆️
garage/tf/q_functions/discrete_mlp_q_function.py	`42.1% <27.27%> (-7.9%)`	⬇️
garage/tf/policies/base2.py	`62.79% <70%> (-10.74%)`	⬇️
garage/tf/models/gaussian_mlp_model.py	`96.2% <89.28%> (+20.2%)`	⬆️
...tf/policies/deterministic_mlp_policy_with_model.py	`92.85% <92.85%> (ø)`
...e/tf/policies/categorical_mlp_policy_with_model.py	`95.91% <95.91%> (ø)`
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2d4f4f...8d1b53e. Read the comment docs.

CatherineSue · 2019-03-07T22:25:41Z

We should probably test the refactored policies with benchmark. It prevents from further debugging the benchmark test.

ryanjulian · 2019-03-07T23:07:28Z

@CatherineSue I agree.

but I think if we can figure out good unit tests for policies, we won't worry so much about messing them up when we refactor.

ryanjulian · 2019-03-07T23:08:38Z

for instance, maybe we can define some unit tests which run basic supervised learning using each model and make sure it produces known-good results.

if GaussianMLPModel can properly fit a (state, action) dataset, then it seems unlikely the optimization of GaussianMLPPolicy would get messed up

CatherineSue

I agree we could come up with a good unittest to make sure the policies work well.

CatherineSue

Are we not going to add the unit test that assures the policy with models work?

ryanjulian · 2019-03-14T18:44:04Z

@CatherineSue it's your review, so you get to decide what the criteria are.

I think at this stage a couple simple unit tests would be reasonable:

Initialize the network weights to a known state
Known input/output pair
Make sure that a known input produces the proper output
Empirical mean/variance
Make sure that a known input produces the expected empirical mean/variance (by sampling n times and calculating mean/variance)

I think a test where we attempt to train each model on a known dataset is a gold standard, but perhaps is too much for this PR.

ryanjulian · 2019-03-14T18:45:53Z

It does seem like the new version of this PR misses coverage on several significant branches:
https://codecov.io/gh/rlworkgroup/garage/compare/c4b49939df36487c49282ddb6bd9af647ea05b87...79f4770dcfd92294f026a12d16125e9e9b6dc601/diff

ahtsan · 2019-03-15T18:47:57Z

My thought about adding unit test is, if we have already tested MLPModel, then for any policy that uses MLPModel, we only need to test policy-level functions, like get_action_sym or dist_info_sym and make sure the output is the same as the policy output. The reason is that we already know MLPModel works properly.
To make sure policy.get_action works properly, we can directly compare

out = self.sess.run(policy.model.output, feed_dict={policy.model.input: [obs]})
policy.get_action(obs) == out

But then, it's not that necessary since get_action is essential the same as self.sess.run(policy.model.output, feed_dict={policy.model.input: [obs]}) in the code.

Then since we have already tested mlp and MLPModel is basically wrapping a single mlp, we already know it is working properly.

ryanjulian · 2019-03-15T18:59:33Z

I agree. You can implement this by mocking out the model

ryanjulian · 2019-03-18T22:32:17Z

+from tests.fixtures.envs.dummy import DummyDiscreteEnv
+
+
+class SimpleMLPModel(Model):


please put these fixtures into a test library

ryanjulian · 2019-03-18T22:34:45Z

+    def test_output_values(self):
+        model = MLPModel(
+            output_dim=self.output_dim,
+            hidden_sizes=(2, ),


what about hidden_size = 1?

in most software, bugs center around the values 0, 1, 2, and 3+. it's good to test all of those cases (or whichever ones are valid for your test)

ryanjulian · 2019-03-18T22:37:53Z

+
+    def test_get_action(self):
+        action, _ = self.policy.get_action(self.obs)
+        assert self.env.action_space.contains(np.asarray([action]))


why is np.asarray required here? why is get_action returning something that needs to be converted before being used with a space?

I think it's actually an issue.
e.g. In akro.box, the contains function is as following:

def contains(self, x): """Return boolean specifying if x is a valid member of this space.""" return x.shape == self.shape and (x >= self.low).all() and ( x <= self.high).all()

It assumes the input is a numpy array.
Because the return action from policy.get_action can be a scalar value, it won't work.
Should we force the output of policy.get_action to be a numpy array?

it looks like you found a n = 1 vs n > 1 bug. good job!

i think that the output of get_action should always pass the test

a = foo.get_action(bar) assert a in foo.action_space bs = foo.get_actions(bars) for b in bs: assert b in foo.action_space

the question is -- is this a garage bug or an akro bug?

i think it is a garage bug. the output type of a policy should be fixed -- it should't be "a scalar sometimes and an np.ndarray other times." if the value is single-dimensional, i think it should output an ndarray with shape (1,).

of course, i could be persuaded otherwise if this made code elsewhere really nasty.

ryanjulian · 2019-03-18T22:39:10Z

+class TestGaussianMLPPolicyWithModel(TfGraphTestCase):
+    def setUp(self):
+        super().setUp()
+        self.env = TfEnv(DummyBoxEnv())


please test several output_dim

ryanjulian · 2019-03-18T22:47:46Z

These tests are looking great

Please see my comments about

Moving your test fixtures into libraries in tests/
Testing a variety of relevant values, especially for output_dim and hidden_sizes

Other things I noticed:

Test for DiscreteMLPQFunction is missing
It seems like you missed some test cases on GaussianMLPModel (see yellow and red lines here https://codecov.io/gh/rlworkgroup/garage/compare/11a265792b57b22211b672541f4f2e62d70b317b...949503e07658aa32dfc64f7e1932d3e493ca93ce/diff)

ryanjulian · 2019-03-19T22:00:11Z


 def _worker_set_seed(_, seed):
    logger.log("Setting seed to %d" % seed)
+    # import here to avoid circular dependency


uhh can we restructure this to avoid a circular dependency instead?

i think @zhanpenghe fixed this?

ryanjulian · 2019-03-21T01:57:54Z

-                    p = tf.constant_initializer(self._init_std_param)
-                    std_network = parameter(
+
+                    def p():


just use a lambda

ryanjulian

Looks good.

What about ContinousMLPQFunction?

ryanjulian · 2019-03-25T19:45:46Z

    not-context-manager,
-    c-extension-no-member
+    c-extension-no-member,
+    wrong-import-order


i would prefer if we just configure pylint properly than disable this. which package did it complain about?

ryanjulian · 2019-03-25T19:49:50Z

    -Observations are
-        after reset    : np.zeros(self._shape).
-        action 1 (FIRE): np.ones(self._shape).
+        after reset    : np.ones(self._shape).


did this sneak in from a different PR?

I changed it because we don't want to have zeros output

CatherineSue · 2019-03-25T20:11:24Z

+
+    Args:
+        env_spec: Environment specification.
+        name: variable scope of the mlp.


Variable scope

CatherineSue · 2019-03-25T20:13:13Z

@@ -88,6 +95,14 @@ def terminate(self):
        """Clean up operation."""
        pass


What's this function for? Why pass?

I copy from the existing policy base class from garage.tf.policies.base. If it is not used anywhere, we can remove that.

ryanjulian · 2019-03-26T01:36:38Z

Feel free to merge this and we will deal with pylint later

Removed test for GaussianMLPPolicyWithModel2 which uses tfp.distributions, which should be introduced in later part of refactoring. Interface get_regularizable_vars is also removed with the same reason.

Pylint is having conflict with flake8 regarding to import order checking. Therefore disable pylint checking and rely on flake8 only.

ahtsan requested review from CatherineSue and ryanjulian March 7, 2019 21:46

ahtsan requested a review from a team as a code owner March 7, 2019 21:46

ryanjulian reviewed Mar 7, 2019

View reviewed changes

Comment thread garage/tf/policies/base2.py Outdated

ryanjulian reviewed Mar 7, 2019

View reviewed changes

Comment thread garage/tf/policies/base2.py Outdated

ryanjulian reviewed Mar 7, 2019

View reviewed changes

CatherineSue reviewed Mar 8, 2019

View reviewed changes

Comment thread garage/tf/policies/deterministic_mlp_policy_with_model.py

ahtsan force-pushed the mlp_policy_with_model branch 2 times, most recently from 80417c0 to 9c6b694 Compare March 13, 2019 06:38