Skip to content

Mk v1.1#44

Merged
mrpozzi merged 9 commits into
masterfrom
mk_v1.1
Oct 30, 2017
Merged

Mk v1.1#44
mrpozzi merged 9 commits into
masterfrom
mk_v1.1

Conversation

@mkao006

@mkao006 mkao006 commented Oct 27, 2017

Copy link
Copy Markdown
Contributor

No description provided.

…thon

There were minor differences between the implementation, the difference was actually
due to the default setting of noramlisation of the response. It is defaulted to True in R,
while False for Python. Normalisation usually improves the fit and prediction, this is
also supported by the examination of the prediction. Thus, we have adjusted both to
standardise the response
Note, the RNN implementation is very crude, and requires further finetuning
1. We extended the input data by the timestep size, so the prediction starts on the same day as other models.
2. the forecast input is padded so that the forecast give the right length of output.
@marcosmilzo marcosmilzo requested review from marcosmilzo and removed request for marcosmilzo October 28, 2017 10:01

@marcosmilzo marcosmilzo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, man! Learning a lot from you

from statsmodels.nonparametric.smoothers_lowess import lowess
import pandas as pd

forecast_period = 90

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you already told me, but have you already tested different lags?
Should we do, after the meeting, a proper test to tune this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have, but it's inconclusive. I guess the question is how far into the future with what level of uncertainty is a good mix for the purpose of the prediction.

But this is something that we do need to consider and adjust later.

response_variable=response_variable))
elasticnet_prediction = mbe.output()
lstm_prediction = mlks.output()
all_prediction = pd.concat([elasticnet_prediction, lstm_prediction])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always clean code! Neat job!!

def train_bag_elasticnet(complete_data, forecast_period, holdout_period,
bootstrapIteration, topic_variables,
response_variable, date_col='date'):
'''Function to bag and train the Elastic net iwth cross-validation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small type "iwht"

total_variable_count = len(topic_variables)
predictions = np.zeros(shape=(observation_length, bootstrapIteration))
cv_min = np.zeros(bootstrapIteration)
# This is the of the implementation in R due to the specification of Python

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess @marcogarieri 's question was about the comment, but at least elaborate it....

My comment was poorly phrased and hastily written. Basically, the parameterisation of the exponential distribution is different in R as opposed to Python. By default, R uses rate while Python uses scale. Thus the reciprocal value of R is taken for Python so that the two implementation match.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we were both perplexed about the comment.
Makes total sense, maybe copy what you just wrote here in the comment?

topicModelTable = 'TopicModel'
model_start_date = datetime(2010, 1, 1).date()

# Model parameters

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you mentioned that you were working in best selection of hyper parameters. You used any Bayesian optimiser to find the best?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for the Elasticnet, the model doesn't really have hyperparameters. The filter_coef is the amount of exponential smoothing, but after several trial and error, the value 1 performs significantly better (actually it is a cumsum). For the bootstrapIteration, you generally don't need to have a large sample to reduce the variance.

The selection of hyper-parameter is much more relevant for the RNN, but currently we run through a grid of values and examine the result on Tensorboard.

There are methods which are designed to specifically optimise the hyperparameters of an RNN. Please see the learning to learn, but Bayesian optimiser is also another popular option we can explore.

P.S. This section actual outlines one of the main difference between the ML/DL approach. For Elasticnet, I had to find some transformation in the data so the model predicts well, on the other hand, I didn't have to transform the data (other than normalisation) but the tuning of the model requires significant work.

topic_modelled_article = pd.read_sql(
'SELECT * FROM {}'.format(topicModelTable), engine)
sclae_input = topic_modelled_article.drop('id', axis=1)
scaled_topic = pd.DataFrame(preprocessing.scale(sclae_input),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a question more than a comment. Wouldn't be better to use preprocessing.StandardScaler(), to apply the same transformation of the training set to the test set?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant here, but the transformation is applied to the input set before the data is split into training and test set. So the same transformation is applied, also, they need to be scaled together, since we are working with time-series where the scale should be the same across time.

@mrpozzi

mrpozzi commented Oct 29, 2017

Copy link
Copy Markdown
Contributor

LGTM

@mrpozzi mrpozzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job! (as usual..)

.where(lambda x: x != 'id')
.dropna()
.tolist())
cmoplete_topic_variable = [v + suffix

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. typo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

total_variable_count = len(topic_variables)
predictions = np.zeros(shape=(observation_length, bootstrapIteration))
cv_min = np.zeros(bootstrapIteration)
# This is the of the implementation in R due to the specification of Python

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

@@ -0,0 +1,505 @@
import os

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is huge.. any chance it could be split in subroutines?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first 120 lines are data processing which should have been in a separate data processing module.

I have not done so for two reasons, 1. I was in a hurry to get the model working, so I didn't want to work on 2 separate module. 2. The whole data processing module may change since phase two requires a complete different pipeline, and I didn't have a good idea how I want to restructure it.

Eventually, we will definitely restructure the input pipeline for the RNN and remove the chunk.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it was due to practical concerns, but maybe we can track it with a TODO..

@mrpozzi mrpozzi merged commit 778bfda into master Oct 30, 2017
@mkao006 mkao006 deleted the mk_v1.1 branch November 27, 2017 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants