Mk v1.1 by mkao006 · Pull Request #44 · EST-Team-Adam/TheReadingMachine

mkao006 · 2017-10-27T18:36:48Z

No description provided.

…thon There were minor differences between the implementation, the difference was actually due to the default setting of noramlisation of the response. It is defaulted to True in R, while False for Python. Normalisation usually improves the fit and prediction, this is also supported by the examination of the prediction. Thus, we have adjusted both to standardise the response

Note, the RNN implementation is very crude, and requires further finetuning

1. We extended the input data by the timestep size, so the prediction starts on the same day as other models. 2. the forecast input is padded so that the forecast give the right length of output.

marcosmilzo

Great job, man! Learning a lot from you

marcosmilzo · 2017-10-28T10:16:22Z

-from statsmodels.nonparametric.smoothers_lowess import lowess
+import pandas as pd
+
+forecast_period = 90


Probably you already told me, but have you already tested different lags?
Should we do, after the meeting, a proper test to tune this?

I have, but it's inconclusive. I guess the question is how far into the future with what level of uncertainty is a good mix for the purpose of the prediction.

But this is something that we do need to consider and adjust later.

marcosmilzo · 2017-10-28T10:17:39Z

-                             response_variable=response_variable))
+elasticnet_prediction = mbe.output()
+lstm_prediction = mlks.output()
+all_prediction = pd.concat([elasticnet_prediction, lstm_prediction])


Always clean code! Neat job!!

marcosmilzo · 2017-10-28T10:24:57Z

+def train_bag_elasticnet(complete_data, forecast_period, holdout_period,
+                         bootstrapIteration, topic_variables,
+                         response_variable, date_col='date'):
+    '''Function to bag and train the Elastic net iwth cross-validation


just a small type "iwht"

marcosmilzo · 2017-10-28T10:25:31Z

+    total_variable_count = len(topic_variables)
+    predictions = np.zeros(shape=(observation_length, bootstrapIteration))
+    cv_min = np.zeros(bootstrapIteration)
+    # This is the of the implementation in R due to the specification of Python


I guess @marcogarieri 's question was about the comment, but at least elaborate it....

My comment was poorly phrased and hastily written. Basically, the parameterisation of the exponential distribution is different in R as opposed to Python. By default, R uses rate while Python uses scale. Thus the reciprocal value of R is taken for Python so that the two implementation match.

yeah, we were both perplexed about the comment.
Makes total sense, maybe copy what you just wrote here in the comment?

marcosmilzo · 2017-10-28T10:34:53Z

+topicModelTable = 'TopicModel'
+model_start_date = datetime(2010, 1, 1).date()
+
+# Model parameters


I remember you mentioned that you were working in best selection of hyper parameters. You used any Bayesian optimiser to find the best?

Not for the Elasticnet, the model doesn't really have hyperparameters. The filter_coef is the amount of exponential smoothing, but after several trial and error, the value 1 performs significantly better (actually it is a cumsum). For the bootstrapIteration, you generally don't need to have a large sample to reduce the variance.

The selection of hyper-parameter is much more relevant for the RNN, but currently we run through a grid of values and examine the result on Tensorboard.

There are methods which are designed to specifically optimise the hyperparameters of an RNN. Please see the learning to learn, but Bayesian optimiser is also another popular option we can explore.

P.S. This section actual outlines one of the main difference between the ML/DL approach. For Elasticnet, I had to find some transformation in the data so the model predicts well, on the other hand, I didn't have to transform the data (other than normalisation) but the tuning of the model requires significant work.

marcosmilzo · 2017-10-28T10:36:43Z

+    topic_modelled_article = pd.read_sql(
+        'SELECT * FROM {}'.format(topicModelTable), engine)
+    sclae_input = topic_modelled_article.drop('id', axis=1)
+    scaled_topic = pd.DataFrame(preprocessing.scale(sclae_input),


That's a question more than a comment. Wouldn't be better to use preprocessing.StandardScaler(), to apply the same transformation of the training set to the test set?

Not sure what you meant here, but the transformation is applied to the input set before the data is split into training and test set. So the same transformation is applied, also, they need to be scaled together, since we are working with time-series where the scale should be the same across time.

mrpozzi · 2017-10-29T21:31:56Z

LGTM

mrpozzi

Great Job! (as usual..)

mrpozzi · 2017-10-29T21:33:08Z

+        .where(lambda x: x != 'id')
+        .dropna()
+        .tolist())
+    cmoplete_topic_variable = [v + suffix


mrpozzi · 2017-10-29T21:33:53Z

+    total_variable_count = len(topic_variables)
+    predictions = np.zeros(shape=(observation_length, bootstrapIteration))
+    cv_min = np.zeros(bootstrapIteration)
+    # This is the of the implementation in R due to the specification of Python


mrpozzi · 2017-10-29T21:37:14Z

@@ -0,0 +1,505 @@
+import os


This file is huge.. any chance it could be split in subroutines?

The first 120 lines are data processing which should have been in a separate data processing module.

I have not done so for two reasons, 1. I was in a hurry to get the model working, so I didn't want to work on 2 separate module. 2. The whole data processing module may change since phase two requires a complete different pipeline, and I didn't have a good idea how I want to restructure it.

Eventually, we will definitely restructure the input pipeline for the RNN and remove the chunk.

I figured it was due to practical concerns, but maybe we can track it with a TODO..

mkao006 added 8 commits October 24, 2017 20:55

add RNN and restructured the module

312a61c

Note, the RNN implementation is very crude, and requires further finetuning

sandbox prototype R script

f039a1c

fixed time shifting issue,

fda5694

1. We extended the input data by the timestep size, so the prediction starts on the same day as other models. 2. the forecast input is padded so that the forecast give the right length of output.

moved model parameters from controller to the modeller

c6536b1

extended the training data and added prediciton plot to tensorboard

024b4a4

update for tensorflow

9abae51

update

4742047

marcosmilzo requested review from marcosmilzo and removed request for marcosmilzo October 28, 2017 10:01

marcosmilzo reviewed Oct 28, 2017

View reviewed changes

mrpozzi approved these changes Oct 29, 2017

View reviewed changes

fix issues from review

6fbc293

mrpozzi merged commit 778bfda into master Oct 30, 2017

mkao006 deleted the mk_v1.1 branch November 27, 2017 14:57

Uh oh!

Conversation

mkao006 commented Oct 27, 2017

Uh oh!

marcosmilzo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrpozzi commented Oct 29, 2017

Uh oh!

mrpozzi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants