Skip to content

imPixelity/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concurrent Web Crawler in Go

A concurrent web crawler written in Go that performs domain-restricted crawling and generates link frequency analytics.

The crawler uses goroutines with a semaphore-based concurrency control mechanism and mutex synchronization to safely explore pages in parallel. It applies URL normalization to reduce duplication and builds a link frequency map during traversal to produce a ranked crawl report.

This implementation follows an opportunistic crawling model, where pages are discovered and processed concurrently without a fixed traversal order. As a result, crawl paths depend on execution timing and may vary between runs. It serves as a foundation for understanding real-world crawler architectures and highlights the differences between execution-driven crawling and state-driven, queue-based systems used in production.

Preview

10 Worker

╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4128
Crawl Duration                 : 46.321s
Pages/Second                   : 25.63
Most Linked Page               : https://crawler-test.com (399 links)
Least Linked Page              : https://crawler-test.com/robots_protocol/robots_excluded_3 (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 399 links
  2 . http://crawler-test.com                            - 391 links
  3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links
  5 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links
╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4128
Crawl Duration                 : 43.997s
Pages/Second                   : 26.98
Most Linked Page               : https://crawler-test.com (400 links)
Least Linked Page              : https://crawler-test.com/titles/page_title_length/10 (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 400 links
  2 . http://crawler-test.com                            - 390 links
  3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links
  5 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links

20 Worker

╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4124
Crawl Duration                 : 23.589s
Pages/Second                   : 50.32
Most Linked Page               : https://crawler-test.com (396 links)
Least Linked Page              : https://crawler-test.com/urls/page_with_hreflang/2 (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 396 links
  2 . http://crawler-test.com                            - 390 links
  3 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links
  5 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4040
Crawl Duration                 : 27.442s
Pages/Second                   : 43.25
Most Linked Page               : https://crawler-test.com (396 links)
Least Linked Page              : http://crawler-test.com/robots_protocol/user_excluded_1/bar/baz (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 396 links
  2 . http://crawler-test.com                            - 391 links
  3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . http://crawler-test.com/other/duplicated_body_content_2 - 243 links
  5 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links

Prerequisites

Installation

git clone https://github.com/imPixelity/web-crawler
cd web-crawler
go mod tidy

Usage

go build -o crawler
./crawler <baseURL> <maxConcurrency> <maxPages>
./crawler https://example.com 5 20
Argument Description
baseURL The starting URL and domain to restrict crawling to
maxConcurrency Maximum number of goroutines running concurrently
maxPages Maximum number of pages to crawl

Limitations

  • The crawler uses an opportunistic concurrent model, so results may vary slightly between runs (nondeterministic traversal order).
  • Crawling and link counting are performed simultaneously, meaning link statistics depend on traversal order and discovery timing.
  • When a maximum page limit is used, traversal becomes order-dependent, meaning different subsets of pages may be crawled across runs.
  • No persistent storage, all crawl state is kept in memory and lost after execution.
  • No retry or scheduling system for failed requests.
  • URL normalization is basic and may not handle all edge cases.
  • Does not respect robots.txt or implement crawl politeness (rate limiting per domain).

Intended Use

This project is designed for educational purposes and experimentation with:

  • Concurrency patterns in Go
  • Graph traversal techniques
  • Basic web crawling concepts

It is not designed for production or large-scale crawling workloads.

Design Notes

This crawler follows an execution-driven, recursive concurrency model rather than a queue-based architecture. This design prioritizes simplicity and transparency over determinism and completeness, while demonstrating core concurrency concepts such as goroutines, synchronization, and bounded parallelism.

About

Concurrent web crawler in Go with link analysis and concurrency control

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages