Skip to content

Restructure data processing#41

Merged
mkao006 merged 31 commits into
mkfrom
restructure_data_processing
Sep 14, 2017
Merged

Restructure data processing#41
mkao006 merged 31 commits into
mkfrom
restructure_data_processing

Conversation

@mkao006

@mkao006 mkao006 commented Sep 14, 2017

Copy link
Copy Markdown
Contributor

No description provided.

mkao006 added 30 commits August 26, 2017 13:53
* removed redudant import
* replace inconsistent quotation marks
* removed stemming and all text processing
* changed topic name seperator from " " to "_"
The logging is the main reason why the scraper fails when it is executed by Airflow. The
titles are non-unicode objects which when attempted to write to file it causes an
exception. This is not the case for stand alone execution as the logs for Scrapy
has been disabled in the settings file.
The commodity tagging has not been functioning since the last integration. After
a thorough investigation, it was found that this is due to the change in input.
Previously, the articles were a list of tokens but it has now been changed to a single
string. We split the string into tokens to resolve this issue
…nment warning and avoid modification to the raw data
Changes are:

* Remove hard coding of the JSON file path name, now an environment variable SCRAPER_FILE_PREFIX.
* Removed stop words `SanitizeArticlePipeline`. Stop words now processed by the article processing
  module. Furthre, this speeds up the scraper significantly.
* Commented out logging in `AmisJsonPipeline`. This is the soruce of the unicode problem
  which resulted in the termination of Airflow. When executed in standalone mode, with the
  log turned off, theerror is not triggered. However, when log is created in Airflow,
  the error is triggered and terminates the process.
* Split the processor into processor and controller
* Execute the processor directory in the scraper directory, this avoids copying of files.
@mkao006 mkao006 merged commit 01b623b into mk Sep 14, 2017
@mkao006 mkao006 deleted the restructure_data_processing branch November 27, 2017 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant