Sagrone scraper

Simple library to scrap web pages. Bellow you will find information on how to use it.

Installation

Add this line to your application's Gemfile:

$ gem 'sagrone_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install sagrone_scraper

Basic Usage

In order to scrape a web page you will need to:

create a new scraper class by inheriting from SagroneScraper::Base, and
instantiate it with a url or page
then you can use the scraper instance to scrape the page and extract structured data

More informations at SagroneScraper::Base module.

Modules

`SagroneScraper::Agent`

The agent is responsible for obtaining a page, Mechanize::Page, from a URL. Here is how you can create an agent:

require 'sagrone_scraper'

agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
agent.page
# => Mechanize::Page

agent.page.at('.ProfileHeaderCard-bio').text
# => "Javascript User Group Milano #milanojs"

`SagroneScraper::Base`

Here we define a TwitterScraper, by inheriting from SagroneScraper::Base class.

The scraper is responsible for extracting structured data from a page or a url. The page can be obtained by the agent.

Public instance methods will be used to extract data, whereas private instance methods will be ignored (seen as helper methods). Most importantly self.can_scrape?(url) class method ensures that only a known subset of pages can be scraped for data.

Create a scraper class

require 'sagrone_scraper'

class TwitterScraper < SagroneScraper::Base
  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i

  def self.can_scrape?(url)
    url.match(TWITTER_PROFILE_URL) ? true : false
  end

  # Public instance methods are used for data extraction.

  def bio
    text_at('.ProfileHeaderCard-bio')
  end

  def location
    text_at('.ProfileHeaderCard-locationText')
  end

  private

  # Private instance methods are not used for data extraction.

  def text_at(selector)
    page.at(selector).text if page.at(selector)
  end
end

Instantiate the scraper

# Instantiate the scraper with a "url".
scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')

# Instantiate the scraper with a "page" (Mechanize::Page).
agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
scraper = TwitterScraper.new(page: agent.page)

Scrape the page

scraper.scrape_page!

Extract the data

scraper.attributes
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

`SagroneScraper::Collection`

This is the simplest way to scrape a web page:

require 'sagrone_scraper'

# 1) Define a scraper. For example, the TwitterScraper above.

# 2) New created scrapers will be registered.
SagroneScraper.Collection::registered_scrapers
# => ['TwitterScraper']

# 3) Here we use the collection to scrape data at a URL.
SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

Contributing

Fork it ( https://github.com/[my-github-username]/sagrone_scraper/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
lib		lib
spec		spec
.editorconfig		.editorconfig
.gitignore		.gitignore
.rspec		.rspec
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
Guardfile		Guardfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
sagrone_scraper.gemspec		sagrone_scraper.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sagrone scraper

Table of Contents

Installation

Basic Usage

Modules

`SagroneScraper::Agent`

`SagroneScraper::Base`

Create a scraper class

Instantiate the scraper

Scrape the page

Extract the data

`SagroneScraper::Collection`

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sagrone scraper

Table of Contents

Installation

Basic Usage

Modules

SagroneScraper::Agent

SagroneScraper::Base

Create a scraper class

Instantiate the scraper

Scrape the page

Extract the data

SagroneScraper::Collection

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`SagroneScraper::Agent`

`SagroneScraper::Base`

`SagroneScraper::Collection`

Packages