Restructured pipeline by mkao006 · Pull Request #42 · EST-Team-Adam/TheReadingMachine

mkao006 · 2017-09-14T07:26:55Z

Fixed minor issues of the scraper with Airflow.
Updated the requirements.txt from scratch.
Enhanced price model with bagging.
Re-wrote the modelling with Pytohn
Fixed commodity tagging.

This will be the basis of the final model of V1.

…ersion The old scon version is not available on Pypi.

Both the remove noun and stemming procedure reduces the dimension and details of the data we want to see if the current model can handle the complexity

…tion

…inear interpolation

… of price for merge

* removed redudant import * replace inconsistent quotation marks * removed stemming and all text processing * changed topic name seperator from " " to "_"

…nged normalisation method

The logging is the main reason why the scraper fails when it is executed by Airflow. The titles are non-unicode objects which when attempted to write to file it causes an exception. This is not the case for stand alone execution as the logs for Scrapy has been disabled in the settings file.

The commodity tagging has not been functioning since the last integration. After a thorough investigation, it was found that this is due to the change in input. Previously, the articles were a list of tokens but it has now been changed to a single string. We split the string into tokens to resolve this issue

…nment warning and avoid modification to the raw data

Changes are: * Remove hard coding of the JSON file path name, now an environment variable SCRAPER_FILE_PREFIX. * Removed stop words `SanitizeArticlePipeline`. Stop words now processed by the article processing module. Furthre, this speeds up the scraper significantly. * Commented out logging in `AmisJsonPipeline`. This is the soruce of the unicode problem which resulted in the termination of Airflow. When executed in standalone mode, with the log turned off, theerror is not triggered. However, when log is created in Airflow, the error is triggered and terminates the process. * Split the processor into processor and controller * Execute the processor directory in the scraper directory, this avoids copying of files.

Restructure data processing

marcosmilzo · 2017-09-17T11:36:52Z

-            raw_date = response.url.split("/")[-2]
+            item['link'] = response.url.replace(
+                'http://', '').replace('https://', '')
+            raw_date = response.url.split('/')[-2]


Gosh man!! You really formatted even the quotes properly everywhere!!!!

My editor has very little tolerance for inconsistencies.... :P

+1 for that.

marcosmilzo

Great job man!!
I will try now to run the airflow to test it and I will approve it as soon as I made it run.
Thanks!

marcosmilzo · 2017-09-17T11:44:51Z

+                        id_col='id'):
    original_id = topic[id_col]
-    scored_topic = topic.drop(id_col, axis=1).apply(lambda x: x * sentiment)
+    # scored_topic = topic.drop(id_col, axis=1).apply(lambda x: x * sentiment)


Why now is it split between positive and negative?

This is because the compound sentiment is 80% + positive. So I decided to model the positive and negative separately. It also makes more sense as the response from the market are generally assymetric.

marcosmilzo · 2017-09-17T11:56:53Z

+        ##                 tagged and thus they are identical
+        ##                 variables. This results in unreliable
+        ##                 prediction.
+        subset(., select = -c(grep("contain", colnames(.)))) %>%


if you want to keep in dplyr commands:
select(-contains("contain")) %>%

ah true! Will change that.

Without the change in the code, the entire pipeline stop at the topic_modelling step. I do not know if this change make any problem with linux.

Add ch domain for noggers.

The blog is no longer updated and thus it provides little value in extracting information from this source, further, a comparison in the prediction shows almost no change.

into mk2

…eaning of the topics is much clearler

mrpozzi

Great job! LGTM

mrpozzi · 2017-10-04T18:00:58Z


 class SanitizeArticlePipeline(object):

-    def __init__(self):


Are we dealing with stop words in the proper text analysis? there is a lot of garbage we could easily pre-filter...

mrpozzi · 2017-10-04T18:01:54Z

+            line = json.dumps(item_dict, ensure_ascii=False) + '\n'
            self.datafiles[spider.name].write(line)
-            spider.logger.info("Written Item: " + item['title'])
+            # spider.logger.info('Written Item: ' + item['title'])


these lines could be dropped instead of commented out.

mrpozzi · 2017-10-04T18:02:41Z


-# Crawl responsibly by identifying yourself (and your website) on the user-agent
-#USER_AGENT = 'amisSpider (+http://www.yourdomain.com)'
+# Crawl responsibly by identifying yourself (and your website) on the


we could remove this patronizing bullshit.
It's a leftover of the autogenerated code...

mrpozzi · 2017-10-04T18:03:22Z

-            raw_date = response.url.split("/")[-2]
+            item['link'] = response.url.replace(
+                'http://', '').replace('https://', '')
+            raw_date = response.url.split('/')[-2]


+1 for that.

mrpozzi · 2017-10-04T18:04:30Z

-os.makedirs(log_dir)
-
-process = CrawlerProcess(get_project_settings())
+import controller as ctr


this is awesome. thanks for cleaning up

mkao006 added 30 commits August 11, 2017 13:28

fixed undefined variable

a7c3046

corrected the data type and added length specification

490390d

removed reductant string and also corrected data type

d0e2fa8

remove redudant file as per conversation with Marco

ba30f29

kill all airflow instance rather than just the scheduler

cbdcdcd

added slimit for price extraction and also reverted Scrapy and scon v…

2a02a49

…ersion The old scon version is not available on Pypi.

changed default setting

edd6ef5

Both the remove noun and stemming procedure reduces the dimension and details of the data we want to see if the current model can handle the complexity

moved aggregation and and merge from price modeling to data harmonisa…

b67fa60

…tion

resolved conflict

36fcc9f

added missing slimit package for article scraping

24f03eb

convert the price time series to a regular spaced time series using l…

8ea4758

…inear interpolation

removed the creation of lead response variable and use the full dates…

2a1fafd

… of price for merge

corrected reference before assignment

fb05884

removed the hack, and also removed noun and perform stemming

471bc58

Merge branch 'dev' into restructure_data_processing

fc8e1d4

Merge branch 'dev' into mk

ae40c0d

removed text processing

122fee0

* removed redudant import * replace inconsistent quotation marks * removed stemming and all text processing * changed topic name seperator from " " to "_"

removed hardcoding of names

1f68390

removed renaming and also switch back to the original table

2967569

changed the source data table to the processed article

f9f9d87

changed source data table to processed article

74a52a4

removed stemming and also added back the 'id' column

c5f6b25

compute score based on both the negative and positive score, also cha…

598eed2

…nged normalisation method

changed argument based on the new implementation

cdd7746

impemented stacked LASSO for improved stability and foreast

4395984

added option to specify the lasso and ridge alpha coefficient

e685d62

optimised the code and also added prediction beyond the price data

53b56de

added function for splitting data into train, test and predictoin

594b391

fixed indentation

ca6571d

full implementation of the price model in Python

4c9047f

mkao006 and others added 12 commits September 8, 2017 16:30

update

caaf874

fixed world grain incorrect variable reference and nogger domain issue

8adaceb

changed input table from Raw to Processed articles

dee0af6

updated from clean environment

7197b55

create a copy of the data for manipulation, this eliminates the assig…

3a82072

…nment warning and avoid modification to the raw data

removed redudant variables and added variable for scraper file naming

804f68c

ignore log files

098ad60

Merge branch 'mk' into restructure_data_processing

35bfaca

Merge pull request #41 from EST-Team-Adam/restructure_data_processing

01b623b

Restructure data processing

mkao006 requested review from marcosmilzo and mrpozzi September 14, 2017 07:26

marcosmilzo reviewed Sep 17, 2017

View reviewed changes

marcosmilzo and others added 7 commits September 18, 2017 14:00

MacOSx problem

e5afa38

Without the change in the code, the entire pipeline stop at the topic_modelling step. I do not know if this change make any problem with linux.

Update spiders.py

0ec6b7c

Add ch domain for noggers.

removed Noggers from scraper.

282708f

The blog is no longer updated and thus it provides little value in extracting information from this source, further, a comparison in the prediction shows almost no change.

Merge branch 'mk2' of https://github.com/EST-Team-Adam/TheReadingMachine

7c6ba92

into mk2

following dplyr convention as per suggestion of Marco

9e77b4d

corrected function

768c578

removed stemming, the prediction does not appear to change, and the m…

5f91fa1

…eaning of the topics is much clearler

mrpozzi approved these changes Oct 4, 2017

View reviewed changes

mkao006 merged commit 265ec65 into dev Oct 4, 2017

mkao006 deleted the mk2 branch November 27, 2017 14:57

Uh oh!

Conversation

mkao006 commented Sep 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcosmilzo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrpozzi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants