Conversation
…thon There were minor differences between the implementation, the difference was actually due to the default setting of noramlisation of the response. It is defaulted to True in R, while False for Python. Normalisation usually improves the fit and prediction, this is also supported by the examination of the prediction. Thus, we have adjusted both to standardise the response
Note, the RNN implementation is very crude, and requires further finetuning
1. We extended the input data by the timestep size, so the prediction starts on the same day as other models. 2. the forecast input is padded so that the forecast give the right length of output.
marcosmilzo
left a comment
There was a problem hiding this comment.
Great job, man! Learning a lot from you
| from statsmodels.nonparametric.smoothers_lowess import lowess | ||
| import pandas as pd | ||
|
|
||
| forecast_period = 90 |
There was a problem hiding this comment.
Probably you already told me, but have you already tested different lags?
Should we do, after the meeting, a proper test to tune this?
There was a problem hiding this comment.
I have, but it's inconclusive. I guess the question is how far into the future with what level of uncertainty is a good mix for the purpose of the prediction.
But this is something that we do need to consider and adjust later.
| response_variable=response_variable)) | ||
| elasticnet_prediction = mbe.output() | ||
| lstm_prediction = mlks.output() | ||
| all_prediction = pd.concat([elasticnet_prediction, lstm_prediction]) |
There was a problem hiding this comment.
Always clean code! Neat job!!
| def train_bag_elasticnet(complete_data, forecast_period, holdout_period, | ||
| bootstrapIteration, topic_variables, | ||
| response_variable, date_col='date'): | ||
| '''Function to bag and train the Elastic net iwth cross-validation |
There was a problem hiding this comment.
just a small type "iwht"
| total_variable_count = len(topic_variables) | ||
| predictions = np.zeros(shape=(observation_length, bootstrapIteration)) | ||
| cv_min = np.zeros(bootstrapIteration) | ||
| # This is the of the implementation in R due to the specification of Python |
There was a problem hiding this comment.
I guess @marcogarieri 's question was about the comment, but at least elaborate it....
My comment was poorly phrased and hastily written. Basically, the parameterisation of the exponential distribution is different in R as opposed to Python. By default, R uses rate while Python uses scale. Thus the reciprocal value of R is taken for Python so that the two implementation match.
There was a problem hiding this comment.
yeah, we were both perplexed about the comment.
Makes total sense, maybe copy what you just wrote here in the comment?
| topicModelTable = 'TopicModel' | ||
| model_start_date = datetime(2010, 1, 1).date() | ||
|
|
||
| # Model parameters |
There was a problem hiding this comment.
I remember you mentioned that you were working in best selection of hyper parameters. You used any Bayesian optimiser to find the best?
There was a problem hiding this comment.
Not for the Elasticnet, the model doesn't really have hyperparameters. The filter_coef is the amount of exponential smoothing, but after several trial and error, the value 1 performs significantly better (actually it is a cumsum). For the bootstrapIteration, you generally don't need to have a large sample to reduce the variance.
The selection of hyper-parameter is much more relevant for the RNN, but currently we run through a grid of values and examine the result on Tensorboard.
There are methods which are designed to specifically optimise the hyperparameters of an RNN. Please see the learning to learn, but Bayesian optimiser is also another popular option we can explore.
P.S. This section actual outlines one of the main difference between the ML/DL approach. For Elasticnet, I had to find some transformation in the data so the model predicts well, on the other hand, I didn't have to transform the data (other than normalisation) but the tuning of the model requires significant work.
| topic_modelled_article = pd.read_sql( | ||
| 'SELECT * FROM {}'.format(topicModelTable), engine) | ||
| sclae_input = topic_modelled_article.drop('id', axis=1) | ||
| scaled_topic = pd.DataFrame(preprocessing.scale(sclae_input), |
There was a problem hiding this comment.
That's a question more than a comment. Wouldn't be better to use preprocessing.StandardScaler(), to apply the same transformation of the training set to the test set?
There was a problem hiding this comment.
Not sure what you meant here, but the transformation is applied to the input set before the data is split into training and test set. So the same transformation is applied, also, they need to be scaled together, since we are working with time-series where the scale should be the same across time.
|
LGTM |
| .where(lambda x: x != 'id') | ||
| .dropna() | ||
| .tolist()) | ||
| cmoplete_topic_variable = [v + suffix |
| total_variable_count = len(topic_variables) | ||
| predictions = np.zeros(shape=(observation_length, bootstrapIteration)) | ||
| cv_min = np.zeros(bootstrapIteration) | ||
| # This is the of the implementation in R due to the specification of Python |
| @@ -0,0 +1,505 @@ | |||
| import os | |||
There was a problem hiding this comment.
This file is huge.. any chance it could be split in subroutines?
There was a problem hiding this comment.
The first 120 lines are data processing which should have been in a separate data processing module.
I have not done so for two reasons, 1. I was in a hurry to get the model working, so I didn't want to work on 2 separate module. 2. The whole data processing module may change since phase two requires a complete different pipeline, and I didn't have a good idea how I want to restructure it.
Eventually, we will definitely restructure the input pipeline for the RNN and remove the chunk.
There was a problem hiding this comment.
I figured it was due to practical concerns, but maybe we can track it with a TODO..
No description provided.