Skip to content
This repository was archived by the owner on Feb 7, 2024. It is now read-only.

Sagrone/scraper

Repository files navigation

Sagrone scraper

Gem Version Build Status

Simple library to scrap web pages. Bellow you will find information on how to use it.

Table of Contents

Installation

Add this line to your application's Gemfile:

$ gem 'sagrone_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install sagrone_scraper

Basic Usage

In order to scrape a web page you will need to:

  1. create a new scraper class by inheriting from SagroneScraper::Base, and
  2. instantiate it with a url or page
  3. then you can use the scraper instance to scrape the page and extract structured data

More informations at SagroneScraper::Base module.

Modules

SagroneScraper::Agent

The agent is responsible for obtaining a page, Mechanize::Page, from a URL. Here is how you can create an agent:

require 'sagrone_scraper'

agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
agent.page
# => Mechanize::Page

agent.page.at('.ProfileHeaderCard-bio').text
# => "Javascript User Group Milano #milanojs"

SagroneScraper::Base

Here we define a TwitterScraper, by inheriting from SagroneScraper::Base class.

The scraper is responsible for extracting structured data from a page or a url. The page can be obtained by the agent.

Public instance methods will be used to extract data, whereas private instance methods will be ignored (seen as helper methods). Most importantly self.can_scrape?(url) class method ensures that only a known subset of pages can be scraped for data.

Create a scraper class

require 'sagrone_scraper'

class TwitterScraper < SagroneScraper::Base
  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i

  def self.can_scrape?(url)
    url.match(TWITTER_PROFILE_URL) ? true : false
  end

  # Public instance methods are used for data extraction.

  def bio
    text_at('.ProfileHeaderCard-bio')
  end

  def location
    text_at('.ProfileHeaderCard-locationText')
  end

  private

  # Private instance methods are not used for data extraction.

  def text_at(selector)
    page.at(selector).text if page.at(selector)
  end
end

Instantiate the scraper

# Instantiate the scraper with a "url".
scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')

# Instantiate the scraper with a "page" (Mechanize::Page).
agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
scraper = TwitterScraper.new(page: agent.page)

Scrape the page

scraper.scrape_page!

Extract the data

scraper.attributes
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

SagroneScraper::Collection

This is the simplest way to scrape a web page:

require 'sagrone_scraper'

# 1) Define a scraper. For example, the TwitterScraper above.

# 2) New created scrapers will be registered.
SagroneScraper.Collection::registered_scrapers
# => ['TwitterScraper']

# 3) Here we use the collection to scrape data at a URL.
SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

Contributing

  1. Fork it ( https://github.com/[my-github-username]/sagrone_scraper/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

About

Simple library to scrap web pages.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages