π Live Demo β search.collie.codes Β Β·Β π Metrics β metrics.collie.codes
Five Go microservices crawling the web, ranking pages with a custom PageRank implementation, and serving sub-100ms full-text search β deployed live on Kubernetes with auto-scaling, TLS, and a full Prometheus/Grafana observability stack.
| Challenge | Approach |
|---|---|
| Relevance ranking | Custom PageRank (damping 0.85, 100K pages/sweep) fused with PostgreSQL full-text: score = (text Γ 0.3) + (pagerank Γ 0.7) |
| SSRF protection | Crawler validates every URL against private CIDRs, cloud metadata endpoints (AWS/Azure/GCP), and re-resolves DNS to defeat rebinding attacks |
| Backpressure | SpiderβConductor use gRPC bidirectional streaming; Conductor signals throttle back when queue depth rises |
| Auto-scaling | Spider HPA scales 1β10 replicas on CPU (70%) and memory (80%) thresholds β no manual intervention |
| Zero-downtime deploys | Rolling updates + Pod Disruption Budgets across all services |
| Observability | Every service exports Prometheus metrics; Grafana dashboards alert on error rate, queue depth, and p95 latency |
Five services, one database, no message broker.
Spider ββgRPC streamβββΊ Conductor βββΊ PostgreSQL βββ Cartographer (CronJob, 6h)
β
Searcher βββββββββ
β
Frontend βββ User
| Service | Role |
|---|---|
| Spider | Crawls the web with SSRF protection, robots.txt caching, cycle detection, and per-domain rate limiting. Streams pages to Conductor over gRPC. |
| Conductor | Receives the stream, deduplicates against PostgreSQL, and manages the crawl queue. Applies backpressure when overwhelmed. |
| Cartographer | Runs PageRank over the full page graph on a 6-hour Kubernetes CronJob. Writes versioned, timestamped results. |
| Searcher | Combines PostgreSQL GIN full-text search with PageRank scores into a single ranked result set. |
| Frontend | React search UI β search box, paginated results, latency metrics. |
Source D2 files: assets/diagrams/
- Damping factor: 0.85 (standard web setting)
- Sampling: 100 sweeps Γ 100K random pages β avoids loading the full graph into memory
- Versioning:
is_latestflag onPageRankResultstable; stale sweeps retained for historical analysis - Convergence: threshold 0.0001, max 20 iterations per sweep
Final Score = (ts_rank(search_vector, query) Γ 0.3) + (pagerank_score Γ 0.7)
PostgreSQL tsvector + GIN index handles full-text; PageRank provides the long-term authority signal.
- SSRF: Blocks
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16, loopback, link-local, and cloud metadata endpoints (AWS169.254.169.254, Azure169.254.169.253, GCPmetadata.google.internal) - DNS rebinding defence: Resolves hostnames post-validation to catch late-binding attacks
- Rate limiting: 2 req/sec per domain, burst 5
- robots.txt caching: 24h TTL β ~1000Γ reduction in redundant fetches
- Resource caps: 10MB response limit, max 5 redirects, Content-Type validation, semaphore (max 10 concurrent fetches)
- Kubernetes on DigitalOcean: Namespace isolation, RBAC, non-root containers, read-only root filesystems
- HPA: Spider scales 1β10 replicas automatically
- Traefik ingress: TLS termination + auto cert renewal via Let's Encrypt
- Connection pooling: 25 max / 5 idle / 5m lifetime across all services
- Monitoring: Prometheus + Grafana with alerts for
ServiceDown,HighErrorRate >10%,SpiderQueueDepth >50K,SearcherLatency p95>1s
| Service | Throughput | Bottleneck |
|---|---|---|
| Spider | 10 concurrent crawls per replica (Γ10 replicas) | Network latency |
| Conductor | ~1,000 pages/sec | PostgreSQL write throughput |
| Cartographer | 100K pages/sweep | Random sampling (memory-efficient) |
| Searcher | <100ms p95 | GIN index + query cache |
CREATE INDEX idx_pages_url ON SeenPages(url);
CREATE INDEX idx_pages_search ON SeenPages USING GIN(search_vector);
CREATE INDEX idx_pagerank_latest ON PageRankResults(is_latest);
CREATE INDEX idx_pagerank_scores ON PageRankResults(score DESC);
CREATE INDEX idx_queue_url ON Queue(url);- Go 1.24+
- PostgreSQL 12+
- Docker (integration tests use testcontainers)
protocandmockeryfor code generation
# Generate protobuf code and mocks
make gen
# Run all tests
go test ./...
# Run with coverage
go test -cover ./...
# Lint (revive)
make lint
# Build Docker images
make buildSpider
make buildConductorexport DB_USER=postgres DB_PASSWORD=yourpassword DB_HOST=localhost DB_NAME=databaseName
go run cmd/spider/main.go
go run cmd/conductor/main.go
go run cmd/cartographer/main.go
go run cmd/searcher/main.gomake proto_gen # Compile .proto β Go (pkg/generated/)
make mock_gen # Generate testify mocks (pkg/mocks/)
make proto_list # List proto files to be processed
make clean # Remove all generated filesMocks use the expecter pattern for type-safe assertions:
mockClient := mockspider.NewMockSearcherClient(t)
mockClient.EXPECT().
SearchPages(mock.Anything, mock.Anything).
Return(&pb.SearchResponse{Pages: []*pb.Page{{Url: "https://example.com"}}}, nil).
Once()Integration tests spin up real PostgreSQL via testcontainers β no mocked databases.
Go Β· gRPC Β· Protocol Buffers Β· PostgreSQL Β· Kubernetes Β· Prometheus Β· Grafana Β· Traefik Β· React Β· TailwindCSS Β· Docker Β· DigitalOcean Β· D2
MIT β see LICENSE




