Skip to content

Production Monitoring Setup + Stabilization #28

@henry0816191

Description

@henry0816191

Problem

PaperScout is an already-deployed production Slack bot, but the eval identifies that the scheduler's broad except Exception handler in run_forever logs and continues on any poll failure — meaning failures at the async/sync boundary become log entries rather than operationally actionable alerts. The gap between line coverage (90%) and scenario coverage means production failure modes (sustained backpressure on MessageQueue, HTTP 429 Retry-After interleaved with DB writes, concurrent refresh + probe-list builds) are not monitored or alerted on. A production daemon needs health checks, structured alerting on failure categories, and visibility into probe success rates.

Acceptance Criteria

  • Health check endpoint returns structured JSON with: last successful poll timestamp, probe hit rate (last cycle), message queue depth, PostgreSQL connection pool status
  • Structured log entries for RATE_LIMIT, NETWORK, TIMEOUT failure categories are emitted on every poll failure (not just generic exception messages)
  • A Slack alert (to an ops channel, not user DMs) fires if no successful poll has occurred in 2x the configured poll interval
  • Message queue backpressure is logged when queue depth exceeds a configurable threshold
  • At least 2 tests verify health check output format and failure-category log emission

Implementation Notes

The eval notes the project already has a health endpoint in monitor.py. Extend it with the structured fields above. For alerting, leverage the existing slack-bolt connection to post to an ops channel. The CollectorFailureCategory-style taxonomy from boost-data-collector is a good model — adapt it for PaperScout's failure modes (probe timeout, index fetch failure, DB connection loss, Slack API rate limit).

References

  • Eval finding: Section 5, "Error Contract Opacity" — None returns and untyped failures
  • Related files: monitor.py (scheduler, health endpoint), sources.py (probe error paths), scout.py (MessageQueue)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions