Dev by mkao006 · Pull Request #43 · EST-Team-Adam/TheReadingMachine

mkao006 · 2017-10-04T18:11:56Z

No description provided.

Work In Progress

This reverts commit 5311433, reversing changes made to 2190732.

Updated README for both articles and twitter content.

We still need to integrate the sentiment data

A whole corpus analysis is performed producing two new outputs: a file containing all the sentences extracted sentiment and a file containing a positive, neutral, negative and compound sentiment score per each article.

Main file update for the whole corpus analysis

The logging is the main reason why the scraper fails when it is executed by Airflow. The titles are non-unicode objects which when attempted to write to file it causes an exception. This is not the case for stand alone execution as the logs for Scrapy has been disabled in the settings file.

The commodity tagging has not been functioning since the last integration. After a thorough investigation, it was found that this is due to the change in input. Previously, the articles were a list of tokens but it has now been changed to a single string. We split the string into tokens to resolve this issue

…nment warning and avoid modification to the raw data

Changes are: * Remove hard coding of the JSON file path name, now an environment variable SCRAPER_FILE_PREFIX. * Removed stop words `SanitizeArticlePipeline`. Stop words now processed by the article processing module. Furthre, this speeds up the scraper significantly. * Commented out logging in `AmisJsonPipeline`. This is the soruce of the unicode problem which resulted in the termination of Airflow. When executed in standalone mode, with the log turned off, theerror is not triggered. However, when log is created in Airflow, the error is triggered and terminates the process. * Split the processor into processor and controller * Execute the processor directory in the scraper directory, this avoids copying of files.

Restructure data processing

Without the change in the code, the entire pipeline stop at the topic_modelling step. I do not know if this change make any problem with linux.

Add ch domain for noggers.

The blog is no longer updated and thus it provides little value in extracting information from this source, further, a comparison in the prediction shows almost no change.

into mk2

…eaning of the topics is much clearler

Restructured pipeline

mrpozzi

LGTM

albertomun and others added 30 commits April 26, 2017 19:49

Dictionary Additions main file

9f03aff

Work In Progress

Dictionary Additions Readme (WIP)

6ecdf47

Merge remote-tracking branch 'origin/Alberto' into mk

5311433

Revert "Merge remote-tracking branch 'origin/Alberto' into mk"

d030cc4

This reverts commit 5311433, reversing changes made to 2190732.

From .ipynb to .py for the topic model

d791fcc

Merge remote-tracking branch 'origin/dev' into mk

c17faf8

moved file into data pipeline

b576ac3

moved file into data pipeline

2218dcc

Add files via upload

a2376f7

SBD Twitter Update

0ade996

Twitter Update

0134716

Sentences Selector Twitter Update

cd182ce

Sentence Selector and Sentiment Extraction

f05c305

Updated README for both articles and twitter content.

removed old redudant file

cd6a8ce

restructured geotagging

7cd7688

restructed geotagging

974b709

reformat

e53867f

update

b8ef165

minor correction and reformat

0cb4847

first commit

8800d99

restructured

3b63e61

restructure

7ffdc17

restructured

4a0c66c

no longer required, moved to sandbox

aa17c4e

first commit

b9b2292

We still need to integrate the sentiment data

added article dates

164ef90

rewrote the harmonisation

f8f8fa9

first commit of the initialisation scripts

bf5bc6e

Whole corpus analysis

b10fb56

A whole corpus analysis is performed producing two new outputs: a file containing all the sentences extracted sentiment and a file containing a positive, neutral, negative and compound sentiment score per each article.

Whole corpus analysis

8ed7f07

Main file update for the whole corpus analysis

mkao006 and others added 26 commits September 5, 2017 17:22

impemented stacked LASSO for improved stability and foreast

4395984

added option to specify the lasso and ridge alpha coefficient

e685d62

optimised the code and also added prediction beyond the price data

53b56de

added function for splitting data into train, test and predictoin

594b391

fixed indentation

ca6571d

full implementation of the price model in Python

4c9047f

update

caaf874

fixed world grain incorrect variable reference and nogger domain issue

8adaceb

changed input table from Raw to Processed articles

dee0af6

updated from clean environment

7197b55

create a copy of the data for manipulation, this eliminates the assig…

3a82072

…nment warning and avoid modification to the raw data

removed redudant variables and added variable for scraper file naming

804f68c

ignore log files

098ad60

Merge branch 'mk' into restructure_data_processing

35bfaca

Merge pull request #41 from EST-Team-Adam/restructure_data_processing

01b623b

Restructure data processing

MacOSx problem

e5afa38

Without the change in the code, the entire pipeline stop at the topic_modelling step. I do not know if this change make any problem with linux.

Update spiders.py

0ec6b7c

Add ch domain for noggers.

removed Noggers from scraper.

282708f

The blog is no longer updated and thus it provides little value in extracting information from this source, further, a comparison in the prediction shows almost no change.

Merge branch 'mk2' of https://github.com/EST-Team-Adam/TheReadingMachine

7c6ba92

into mk2

following dplyr convention as per suggestion of Marco

9e77b4d

corrected function

768c578

removed stemming, the prediction does not appear to change, and the m…

5f91fa1

…eaning of the topics is much clearler

Merge pull request #42 from EST-Team-Adam/mk2

265ec65

Restructured pipeline

mkao006 requested review from marcosmilzo and mrpozzi October 4, 2017 18:12

mrpozzi approved these changes Oct 4, 2017

View reviewed changes

mkao006 merged commit 83584bd into master Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev#43

Dev#43
mkao006 merged 371 commits into
masterfrom
dev

mkao006 commented Oct 4, 2017

Uh oh!

mrpozzi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mkao006 commented Oct 4, 2017

Uh oh!

mrpozzi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants