Skip to content

Restructured pipeline#42

Merged
mkao006 merged 49 commits into
devfrom
mk2
Oct 4, 2017
Merged

Restructured pipeline#42
mkao006 merged 49 commits into
devfrom
mk2

Conversation

@mkao006

@mkao006 mkao006 commented Sep 14, 2017

Copy link
Copy Markdown
Contributor
  • Fixed minor issues of the scraper with Airflow.
  • Updated the requirements.txt from scratch.
  • Enhanced price model with bagging.
  • Re-wrote the modelling with Pytohn
  • Fixed commodity tagging.

This will be the basis of the final model of V1.

mkao006 added 30 commits August 11, 2017 13:28
…ersion

The old scon version is not available on Pypi.
Both the remove noun and stemming procedure reduces the dimension and details of the data
we want to see if the current model can handle the complexity
* removed redudant import
* replace inconsistent quotation marks
* removed stemming and all text processing
* changed topic name seperator from " " to "_"
mkao006 and others added 12 commits September 8, 2017 16:30
The logging is the main reason why the scraper fails when it is executed by Airflow. The
titles are non-unicode objects which when attempted to write to file it causes an
exception. This is not the case for stand alone execution as the logs for Scrapy
has been disabled in the settings file.
The commodity tagging has not been functioning since the last integration. After
a thorough investigation, it was found that this is due to the change in input.
Previously, the articles were a list of tokens but it has now been changed to a single
string. We split the string into tokens to resolve this issue
…nment warning and avoid modification to the raw data
Changes are:

* Remove hard coding of the JSON file path name, now an environment variable SCRAPER_FILE_PREFIX.
* Removed stop words `SanitizeArticlePipeline`. Stop words now processed by the article processing
  module. Furthre, this speeds up the scraper significantly.
* Commented out logging in `AmisJsonPipeline`. This is the soruce of the unicode problem
  which resulted in the termination of Airflow. When executed in standalone mode, with the
  log turned off, theerror is not triggered. However, when log is created in Airflow,
  the error is triggered and terminates the process.
* Split the processor into processor and controller
* Execute the processor directory in the scraper directory, this avoids copying of files.
raw_date = response.url.split("/")[-2]
item['link'] = response.url.replace(
'http://', '').replace('https://', '')
raw_date = response.url.split('/')[-2]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh man!! You really formatted even the quotes properly everywhere!!!!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My editor has very little tolerance for inconsistencies.... :P

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for that.

@marcosmilzo marcosmilzo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job man!!
I will try now to run the airflow to test it and I will approve it as soon as I made it run.
Thanks!

id_col='id'):
original_id = topic[id_col]
scored_topic = topic.drop(id_col, axis=1).apply(lambda x: x * sentiment)
# scored_topic = topic.drop(id_col, axis=1).apply(lambda x: x * sentiment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why now is it split between positive and negative?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the compound sentiment is 80% + positive. So I decided to model the positive and negative separately. It also makes more sense as the response from the market are generally assymetric.

Comment thread pipeline/price_modelling/controller.R Outdated
## tagged and thus they are identical
## variables. This results in unreliable
## prediction.
subset(., select = -c(grep("contain", colnames(.)))) %>%

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to keep in dplyr commands:
select(-contains("contain")) %>%

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true! Will change that.

marcosmilzo and others added 7 commits September 18, 2017 14:00
Without the change in the code, the entire pipeline stop at the topic_modelling step.
I do not know if this change make any problem with linux.
Add ch domain for noggers.
The blog is no longer updated and thus it provides little value in extracting information from this source,
further, a comparison in the prediction shows almost no change.

@mrpozzi mrpozzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! LGTM


class SanitizeArticlePipeline(object):

def __init__(self):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we dealing with stop words in the proper text analysis? there is a lot of garbage we could easily pre-filter...

line = json.dumps(item_dict, ensure_ascii=False) + '\n'
self.datafiles[spider.name].write(line)
spider.logger.info("Written Item: " + item['title'])
# spider.logger.info('Written Item: ' + item['title'])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines could be dropped instead of commented out.


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'amisSpider (+http://www.yourdomain.com)'
# Crawl responsibly by identifying yourself (and your website) on the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could remove this patronizing bullshit.
It's a leftover of the autogenerated code...

raw_date = response.url.split("/")[-2]
item['link'] = response.url.replace(
'http://', '').replace('https://', '')
raw_date = response.url.split('/')[-2]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for that.

os.makedirs(log_dir)

process = CrawlerProcess(get_project_settings())
import controller as ctr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome. thanks for cleaning up

@mkao006 mkao006 merged commit 265ec65 into dev Oct 4, 2017
@mkao006 mkao006 deleted the mk2 branch November 27, 2017 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants