Concurrent Web Crawler in Go

A concurrent web crawler written in Go that performs domain-restricted crawling and generates link frequency analytics.

The crawler uses goroutines with a semaphore-based concurrency control mechanism and mutex synchronization to safely explore pages in parallel. It applies URL normalization to reduce duplication and builds a link frequency map during traversal to produce a ranked crawl report.

This implementation follows an opportunistic crawling model, where pages are discovered and processed concurrently without a fixed traversal order. As a result, crawl paths depend on execution timing and may vary between runs. It serves as a foundation for understanding real-world crawler architectures and highlights the differences between execution-driven crawling and state-driven, queue-based systems used in production.

Preview

10 Worker

╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4128
Crawl Duration                 : 46.321s
Pages/Second                   : 25.63
Most Linked Page               : https://crawler-test.com (399 links)
Least Linked Page              : https://crawler-test.com/robots_protocol/robots_excluded_3 (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 399 links
  2 . http://crawler-test.com                            - 391 links
  3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links
  5 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links

╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4128
Crawl Duration                 : 43.997s
Pages/Second                   : 26.98
Most Linked Page               : https://crawler-test.com (400 links)
Least Linked Page              : https://crawler-test.com/titles/page_title_length/10 (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 400 links
  2 . http://crawler-test.com                            - 390 links
  3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links
  5 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links

20 Worker

╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4124
Crawl Duration                 : 23.589s
Pages/Second                   : 50.32
Most Linked Page               : https://crawler-test.com (396 links)
Least Linked Page              : https://crawler-test.com/urls/page_with_hreflang/2 (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 396 links
  2 . http://crawler-test.com                            - 390 links
  3 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links
  5 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links

╔══════════════════════════════════════════════════════════════════════╗
║                          CRAWL REPORT                                ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL                       : https://crawler-test.com/
Unique Pages                   : 1187
Total Links                    : 4040
Crawl Duration                 : 27.442s
Pages/Second                   : 43.25
Most Linked Page               : https://crawler-test.com (396 links)
Least Linked Page              : http://crawler-test.com/robots_protocol/user_excluded_1/bar/baz (1 links)

Top 5 Most Linked Pages:
  1 . https://crawler-test.com                           - 396 links
  2 . http://crawler-test.com                            - 391 links
  3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
  4 . http://crawler-test.com/other/duplicated_body_content_2 - 243 links
  5 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links

Prerequisites

Go 1.25.8

Installation

git clone https://github.com/imPixelity/web-crawler
cd web-crawler
go mod tidy

Usage

go build -o crawler
./crawler <baseURL> <maxConcurrency> <maxPages>
./crawler https://example.com 5 20

Argument	Description
`baseURL`	The starting URL and domain to restrict crawling to
`maxConcurrency`	Maximum number of goroutines running concurrently
`maxPages`	Maximum number of pages to crawl

Limitations

The crawler uses an opportunistic concurrent model, so results may vary slightly between runs (nondeterministic traversal order).
Crawling and link counting are performed simultaneously, meaning link statistics depend on traversal order and discovery timing.
When a maximum page limit is used, traversal becomes order-dependent, meaning different subsets of pages may be crawled across runs.
No persistent storage, all crawl state is kept in memory and lost after execution.
No retry or scheduling system for failed requests.
URL normalization is basic and may not handle all edge cases.
Does not respect robots.txt or implement crawl politeness (rate limiting per domain).

Intended Use

This project is designed for educational purposes and experimentation with:

Concurrency patterns in Go
Graph traversal techniques
Basic web crawling concepts

It is not designed for production or large-scale crawling workloads.

Design Notes

This crawler follows an execution-driven, recursive concurrency model rather than a queue-based architecture. This design prioritizes simplicity and transparency over determinism and completeness, while demonstrating core concurrency concepts such as goroutines, synchronization, and bounded parallelism.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
recursive_crawl.go		recursive_crawl.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concurrent Web Crawler in Go

Preview

10 Worker

20 Worker

Prerequisites

Installation

Usage

Limitations

Intended Use

Design Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Concurrent Web Crawler in Go

Preview

10 Worker

20 Worker

Prerequisites

Installation

Usage

Limitations

Intended Use

Design Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages