Skip to content

[NH-026] - Extract content from several RSS feeds #26

@ivangrod

Description

@ivangrod

We must extract the content of the RSS resources always whenever possible.

Expected Behavior
RSS resources must been stored with the content and tag fields informed. At this moment, the list which contains all theses resources is:

  • 99designs
  • Airbnb
  • AirPair
  • Alan Storm
  • Alex Rogozhnikov
  • Allegro.tech
  • Andrew Brampton
  • Antirez
  • Appnexus
  • Ariejan de Vroom
  • Ariya Hidayat
  • Auth0
  • Axel Rauschmayer
  • Babbel
  • Badoo
  • BenefitFocus
  • Bitly
  • Bjørn Johansen
  • Carlos Becker
  • Chen Hui Jing
  • Chris Hager
  • CloudBees
  • CockroachDB
  • Codemancers
  • Codementor
  • CodeName One
  • Commercetools
  • Condé Nast
  • Crystal
  • Curalate
  • Daily JS
  • Dan Luu
  • DataFox
  • Dennis Yurichev
  • Dragan Djuric
  • Dragan Gaic
  • Drew DeVault
  • Drivy
  • Ebay
  • Eddie Smith
  • Elastic
  • Elegant Code
  • Engine Yard
  • Eric Elliot
  • Erik Runyon
  • Evan Hahn
  • Evan Miller
  • Eventbrite
  • Feedzai
  • Findmypast
  • Freek Van der Herten
  • Gilt
  • GO-JEK
  • Guardian
  • HackerEarth
  • Haptik
  • Hashrocket
  • Hayden James
  • HERE
  • High Scalability
  • HomeAway
  • Housing.com
  • Hypriot
  • Ian Hummel
  • IBM developerWorks
  • Imaginea
  • Instacart
  • Instagram
  • Jake Trent
  • Jamis Buck
  • Jane Street
  • Jessie Frazelle
  • Jobandtalent
  • Joe Nelson
  • Jonas Plum
  • Jonathan Snook
  • Josh Haberman
  • Juri Strumpflohner
  • K. Harrison
  • Khan Academy
  • Kinvolk
  • Kogan.com
  • Lambda the Ultimate
  • Latacora
  • Lazarus Lazaridis
  • LINE
  • Lyft
  • Mallow Tech
  • Mandrill
  • MapTiler
  • Marek Majkowski
  • Mary Rose Cook
  • Matt Might
  • Medium
  • Mike Fogus
  • Milosz Galazka
  • Miro Cupak
  • MongoDB
  • Monsanto
  • Nate Berkopec
  • Nelson Elhage
  • New York Times
  • Nic Raboy
  • Nick Craver
  • Nick Galbreath
  • Nikola Brežnjak
  • Nikolay Nemshilov
  • Okta
  • OLX
  • Paul Graham
  • Paul Lewis
  • Paweł Chudzik
  • Periscope Data
  • Peter Norvig
  • Philip Walton
  • Piotr Pasich
  • Pivotal
  • Pony Foo
  • PullReview
  • Ray Wenderlich
  • ReactJS News
  • Redbubble
  • Rightscale
  • Riot Games
  • RoseHosting
  • Runtastic
  • Secret Escapes
  • Shape Security
  • ShowMax
  • SitePoint
  • Slack
  • Soundcloud
  • Speedledger
  • Srinivas Tamada
  • Steve Bellovin
  • Stitch Fix
  • Stripe
  • Sudhagar
  • SurveyMonkey
  • Teespring
  • That Thing In Swift
  • The Daily WTF
  • Ticketmaster
  • Tikhon Jelvis
  • Toptal
  • TrackMaven
  • Trello
  • Trivago
  • Twilio
  • Twitch
  • Uber
  • Una Kravets
  • Vlad Mihalcea
  • WalmartLabs
  • Wayfair
  • Wealthfront
  • WePay
  • Wilfred Hughes
  • William Kennedy
  • Wojtek Gawroński
  • Wonga Technology
  • Yelp
  • Zulily

Current Behavior
The collecting process is storing documents related to feeds which not inform about the content and tags of the RSS item.

Steps to reproduce
For reproducing the current behavior you need:

  • Up and running docker-compose stack
  • Run FeedCollectorApplication

Steps to fix
A good practice to fix theses errors could be:

  1. In a unit test, similar to RssFeedListenerTest, you could reproduce the parser process of a feed through Rome library.
  2. Fix A: The parser couldn't extract the content but it appears in the feed. Maybe could be a bug in the code.
  3. Fix B: The content of the feed doesn't appear in the feed. We must include CSS selectors in the crawling process initial data file to be able to add content and tags from HTML page in the Elasticsearch document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions