Skip to content

Dev#43

Merged
mkao006 merged 371 commits into
masterfrom
dev
Oct 5, 2017
Merged

Dev#43
mkao006 merged 371 commits into
masterfrom
dev

Conversation

@mkao006

@mkao006 mkao006 commented Oct 4, 2017

Copy link
Copy Markdown
Contributor

No description provided.

albertomun and others added 30 commits April 26, 2017 19:49
This reverts commit 5311433, reversing
changes made to 2190732.
Updated README for both articles and twitter content.
We still need to integrate the sentiment data
A whole corpus analysis is performed producing two new outputs: a file containing all the sentences extracted sentiment and a file containing a positive, neutral, negative and compound sentiment score per each article.
Main file update for the whole corpus analysis
mkao006 and others added 26 commits September 5, 2017 17:22
The logging is the main reason why the scraper fails when it is executed by Airflow. The
titles are non-unicode objects which when attempted to write to file it causes an
exception. This is not the case for stand alone execution as the logs for Scrapy
has been disabled in the settings file.
The commodity tagging has not been functioning since the last integration. After
a thorough investigation, it was found that this is due to the change in input.
Previously, the articles were a list of tokens but it has now been changed to a single
string. We split the string into tokens to resolve this issue
…nment warning and avoid modification to the raw data
Changes are:

* Remove hard coding of the JSON file path name, now an environment variable SCRAPER_FILE_PREFIX.
* Removed stop words `SanitizeArticlePipeline`. Stop words now processed by the article processing
  module. Furthre, this speeds up the scraper significantly.
* Commented out logging in `AmisJsonPipeline`. This is the soruce of the unicode problem
  which resulted in the termination of Airflow. When executed in standalone mode, with the
  log turned off, theerror is not triggered. However, when log is created in Airflow,
  the error is triggered and terminates the process.
* Split the processor into processor and controller
* Execute the processor directory in the scraper directory, this avoids copying of files.
Without the change in the code, the entire pipeline stop at the topic_modelling step.
I do not know if this change make any problem with linux.
Add ch domain for noggers.
The blog is no longer updated and thus it provides little value in extracting information from this source,
further, a comparison in the prediction shows almost no change.
@mkao006 mkao006 requested review from marcosmilzo and mrpozzi October 4, 2017 18:12

@mrpozzi mrpozzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mkao006 mkao006 merged commit 83584bd into master Oct 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants