[NH-026] - Extract content from several RSS feeds

We must extract the content of the RSS resources always whenever possible.

**Expected Behavior**
RSS resources must been stored with the **content** and **tag** fields informed. At this moment, the list which contains all theses resources is:

- 99designs
- Airbnb
- AirPair
- Alan Storm
- Alex Rogozhnikov
- Allegro.tech
- Andrew Brampton
- Antirez
- Appnexus
- Ariejan de Vroom
- Ariya Hidayat
- Auth0
- Axel Rauschmayer
- Babbel
- Badoo
- BenefitFocus
- Bitly
- Bjørn Johansen
- Carlos Becker
- Chen Hui Jing
- Chris Hager
- CloudBees
- CockroachDB
- Codemancers
- Codementor
- CodeName One
- Commercetools
- Condé Nast
- Crystal
- Curalate
- Daily JS
- Dan Luu
- DataFox
- Dennis Yurichev
- Dragan Djuric
- Dragan Gaic
- Drew DeVault
- Drivy
- Ebay
- Eddie Smith
- Elastic
- Elegant Code
- Engine Yard
- Eric Elliot
- Erik Runyon
- Evan Hahn
- Evan Miller
- Eventbrite
- Feedzai
- Findmypast
- Freek Van der Herten
- Gilt
- GO-JEK
- Guardian
- HackerEarth
- Haptik
- Hashrocket
- Hayden James
- HERE
- High Scalability
- HomeAway
- Housing.com
- Hypriot
- Ian Hummel
- IBM developerWorks
- Imaginea
- Instacart
- Instagram
- Jake Trent
- Jamis Buck
- Jane Street
- Jessie Frazelle
- Jobandtalent
- Joe Nelson
- Jonas Plum
- Jonathan Snook
- Josh Haberman
- Juri Strumpflohner
- K. Harrison
- Khan Academy
- Kinvolk
- Kogan.com
- Lambda the Ultimate
- Latacora
- Lazarus Lazaridis
- LINE
- Lyft
- Mallow Tech
- Mandrill
- MapTiler
- Marek Majkowski
- Mary Rose Cook
- Matt Might
- Medium
- Mike Fogus
- Milosz Galazka
- Miro Cupak
- MongoDB
- Monsanto
- Nate Berkopec
- Nelson Elhage
- New York Times
- Nic Raboy
- Nick Craver
- Nick Galbreath
- Nikola Brežnjak
- Nikolay Nemshilov
- Okta
- OLX
- Paul Graham
- Paul Lewis
- Paweł Chudzik
- Periscope Data
- Peter Norvig
- Philip Walton
- Piotr Pasich
- Pivotal
- Pony Foo
- PullReview
- Ray Wenderlich
- ReactJS News
- Redbubble
- Rightscale
- Riot Games
- RoseHosting
- Runtastic
- Secret Escapes
- Shape Security
- ShowMax
- SitePoint
- Slack
- Soundcloud
- Speedledger
- Srinivas Tamada
- Steve Bellovin
- Stitch Fix
- Stripe
- Sudhagar
- SurveyMonkey
- Teespring
- That Thing In Swift
- The Daily WTF
- Ticketmaster
- Tikhon Jelvis
- Toptal
- TrackMaven
- Trello
- Trivago
- Twilio
- Twitch
- Uber
- Una Kravets
- Vlad Mihalcea
- WalmartLabs
- Wayfair
- Wealthfront
- WePay
- Wilfred Hughes
- William Kennedy
- Wojtek Gawroński
- Wonga Technology
- Yelp
- Zulily

**Current Behavior**
The collecting process is storing documents related to feeds which not inform about the content and tags of the RSS item.

**Steps to reproduce**
For reproducing the current behavior you need:
- Up and running docker-compose stack
- Run FeedCollectorApplication

**Steps to fix**
A good practice to fix theses errors could be:
1. In a unit test, similar to **RssFeedListenerTest**, you could reproduce the parser process of a feed through Rome library.
2.  **Fix A**: The parser couldn't extract the content but it appears in the feed. Maybe could be a bug in the code.
3. **Fix B**: The content of the feed doesn't appear in the feed. We must include CSS selectors in the crawling process initial data file to be able to add content and tags from HTML page in the Elasticsearch document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NH-026] - Extract content from several RSS feeds #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NH-026] - Extract content from several RSS feeds #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions