A concurrent web crawler written in Go that performs domain-restricted crawling and generates link frequency analytics.
The crawler uses goroutines with a semaphore-based concurrency control mechanism and mutex synchronization to safely explore pages in parallel. It applies URL normalization to reduce duplication and builds a link frequency map during traversal to produce a ranked crawl report.
This implementation follows an opportunistic crawling model, where pages are discovered and processed concurrently without a fixed traversal order. As a result, crawl paths depend on execution timing and may vary between runs. It serves as a foundation for understanding real-world crawler architectures and highlights the differences between execution-driven crawling and state-driven, queue-based systems used in production.
╔══════════════════════════════════════════════════════════════════════╗
║ CRAWL REPORT ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL : https://crawler-test.com/
Unique Pages : 1187
Total Links : 4128
Crawl Duration : 46.321s
Pages/Second : 25.63
Most Linked Page : https://crawler-test.com (399 links)
Least Linked Page : https://crawler-test.com/robots_protocol/robots_excluded_3 (1 links)
Top 5 Most Linked Pages:
1 . https://crawler-test.com - 399 links
2 . http://crawler-test.com - 391 links
3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
4 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links
5 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links╔══════════════════════════════════════════════════════════════════════╗
║ CRAWL REPORT ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL : https://crawler-test.com/
Unique Pages : 1187
Total Links : 4128
Crawl Duration : 43.997s
Pages/Second : 26.98
Most Linked Page : https://crawler-test.com (400 links)
Least Linked Page : https://crawler-test.com/titles/page_title_length/10 (1 links)
Top 5 Most Linked Pages:
1 . https://crawler-test.com - 400 links
2 . http://crawler-test.com - 390 links
3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
4 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links
5 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links╔══════════════════════════════════════════════════════════════════════╗
║ CRAWL REPORT ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL : https://crawler-test.com/
Unique Pages : 1187
Total Links : 4124
Crawl Duration : 23.589s
Pages/Second : 50.32
Most Linked Page : https://crawler-test.com (396 links)
Least Linked Page : https://crawler-test.com/urls/page_with_hreflang/2 (1 links)
Top 5 Most Linked Pages:
1 . https://crawler-test.com - 396 links
2 . http://crawler-test.com - 390 links
3 . http://crawler-test.com/other/duplicated_body_content_1 - 243 links
4 . https://crawler-test.com/other/duplicated_body_content_2 - 243 links
5 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links╔══════════════════════════════════════════════════════════════════════╗
║ CRAWL REPORT ║
╚══════════════════════════════════════════════════════════════════════╝
Base URL : https://crawler-test.com/
Unique Pages : 1187
Total Links : 4040
Crawl Duration : 27.442s
Pages/Second : 43.25
Most Linked Page : https://crawler-test.com (396 links)
Least Linked Page : http://crawler-test.com/robots_protocol/user_excluded_1/bar/baz (1 links)
Top 5 Most Linked Pages:
1 . https://crawler-test.com - 396 links
2 . http://crawler-test.com - 391 links
3 . https://crawler-test.com/other/duplicated_body_content_1 - 243 links
4 . http://crawler-test.com/other/duplicated_body_content_2 - 243 links
5 . https://crawler-test.com/other/duplicated_body_content_2 - 243 linksgit clone https://github.com/imPixelity/web-crawler
cd web-crawler
go mod tidygo build -o crawler
./crawler <baseURL> <maxConcurrency> <maxPages>
./crawler https://example.com 5 20| Argument | Description |
|---|---|
baseURL |
The starting URL and domain to restrict crawling to |
maxConcurrency |
Maximum number of goroutines running concurrently |
maxPages |
Maximum number of pages to crawl |
- The crawler uses an opportunistic concurrent model, so results may vary slightly between runs (nondeterministic traversal order).
- Crawling and link counting are performed simultaneously, meaning link statistics depend on traversal order and discovery timing.
- When a maximum page limit is used, traversal becomes order-dependent, meaning different subsets of pages may be crawled across runs.
- No persistent storage, all crawl state is kept in memory and lost after execution.
- No retry or scheduling system for failed requests.
- URL normalization is basic and may not handle all edge cases.
- Does not respect
robots.txtor implement crawl politeness (rate limiting per domain).
This project is designed for educational purposes and experimentation with:
- Concurrency patterns in Go
- Graph traversal techniques
- Basic web crawling concepts
It is not designed for production or large-scale crawling workloads.
This crawler follows an execution-driven, recursive concurrency model rather than a queue-based architecture. This design prioritizes simplicity and transparency over determinism and completeness, while demonstrating core concurrency concepts such as goroutines, synchronization, and bounded parallelism.