Conversation
…ersion The old scon version is not available on Pypi.
Both the remove noun and stemming procedure reduces the dimension and details of the data we want to see if the current model can handle the complexity
…inear interpolation
… of price for merge
* removed redudant import * replace inconsistent quotation marks * removed stemming and all text processing * changed topic name seperator from " " to "_"
…nged normalisation method
The logging is the main reason why the scraper fails when it is executed by Airflow. The titles are non-unicode objects which when attempted to write to file it causes an exception. This is not the case for stand alone execution as the logs for Scrapy has been disabled in the settings file.
The commodity tagging has not been functioning since the last integration. After a thorough investigation, it was found that this is due to the change in input. Previously, the articles were a list of tokens but it has now been changed to a single string. We split the string into tokens to resolve this issue
…nment warning and avoid modification to the raw data
Changes are: * Remove hard coding of the JSON file path name, now an environment variable SCRAPER_FILE_PREFIX. * Removed stop words `SanitizeArticlePipeline`. Stop words now processed by the article processing module. Furthre, this speeds up the scraper significantly. * Commented out logging in `AmisJsonPipeline`. This is the soruce of the unicode problem which resulted in the termination of Airflow. When executed in standalone mode, with the log turned off, theerror is not triggered. However, when log is created in Airflow, the error is triggered and terminates the process. * Split the processor into processor and controller * Execute the processor directory in the scraper directory, this avoids copying of files.
Restructure data processing
| raw_date = response.url.split("/")[-2] | ||
| item['link'] = response.url.replace( | ||
| 'http://', '').replace('https://', '') | ||
| raw_date = response.url.split('/')[-2] |
There was a problem hiding this comment.
Gosh man!! You really formatted even the quotes properly everywhere!!!!
There was a problem hiding this comment.
My editor has very little tolerance for inconsistencies.... :P
marcosmilzo
left a comment
There was a problem hiding this comment.
Great job man!!
I will try now to run the airflow to test it and I will approve it as soon as I made it run.
Thanks!
| id_col='id'): | ||
| original_id = topic[id_col] | ||
| scored_topic = topic.drop(id_col, axis=1).apply(lambda x: x * sentiment) | ||
| # scored_topic = topic.drop(id_col, axis=1).apply(lambda x: x * sentiment) |
There was a problem hiding this comment.
Why now is it split between positive and negative?
There was a problem hiding this comment.
This is because the compound sentiment is 80% + positive. So I decided to model the positive and negative separately. It also makes more sense as the response from the market are generally assymetric.
| ## tagged and thus they are identical | ||
| ## variables. This results in unreliable | ||
| ## prediction. | ||
| subset(., select = -c(grep("contain", colnames(.)))) %>% |
There was a problem hiding this comment.
if you want to keep in dplyr commands:
select(-contains("contain")) %>%
There was a problem hiding this comment.
ah true! Will change that.
Without the change in the code, the entire pipeline stop at the topic_modelling step. I do not know if this change make any problem with linux.
Add ch domain for noggers.
The blog is no longer updated and thus it provides little value in extracting information from this source, further, a comparison in the prediction shows almost no change.
…eaning of the topics is much clearler
|
|
||
| class SanitizeArticlePipeline(object): | ||
|
|
||
| def __init__(self): |
There was a problem hiding this comment.
Are we dealing with stop words in the proper text analysis? there is a lot of garbage we could easily pre-filter...
| line = json.dumps(item_dict, ensure_ascii=False) + '\n' | ||
| self.datafiles[spider.name].write(line) | ||
| spider.logger.info("Written Item: " + item['title']) | ||
| # spider.logger.info('Written Item: ' + item['title']) |
There was a problem hiding this comment.
these lines could be dropped instead of commented out.
|
|
||
| # Crawl responsibly by identifying yourself (and your website) on the user-agent | ||
| #USER_AGENT = 'amisSpider (+http://www.yourdomain.com)' | ||
| # Crawl responsibly by identifying yourself (and your website) on the |
There was a problem hiding this comment.
we could remove this patronizing bullshit.
It's a leftover of the autogenerated code...
| raw_date = response.url.split("/")[-2] | ||
| item['link'] = response.url.replace( | ||
| 'http://', '').replace('https://', '') | ||
| raw_date = response.url.split('/')[-2] |
| os.makedirs(log_dir) | ||
|
|
||
| process = CrawlerProcess(get_project_settings()) | ||
| import controller as ctr |
There was a problem hiding this comment.
this is awesome. thanks for cleaning up
Airflow.requirements.txtfrom scratch.This will be the basis of the final model of V1.