feat: anon piece selection and retrieval#487
Conversation
e40d010 to
444a79b
Compare
|
To be sample/sampled better conveys what is happening. |
| first_byte_ms Nullable(Float64), -- time to first response byte | ||
| last_byte_ms Nullable(Float64), -- time to last response byte | ||
| bytes_retrieved Nullable(UInt64), -- bytes received from /piece/{cid} | ||
| throughput_bps Nullable(UInt64), -- effective throughput, bytes per second |
There was a problem hiding this comment.
This is data that can be easily derived. Also is it (ttlb-ttfb)/bytes or simply ttlb/bytes?
There was a problem hiding this comment.
It's http response size / total time of the HTTP request
There was a problem hiding this comment.
yeah it could easily be derived, I agree
|
One example data point in my local clickhouse db: |
|
A discussion that just came up with @iand: This anon retrieval here persists data only in Clickhouse. AFAIU the basic retrieval persists data in postgres, exposes aggregated metrics via prometheus, and if the My initial understanding was that metrics data will exclusively live in Clickhouse while Postgres will handle job queues/orchestration/keeping deal state/etc. To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if |
|
new Clickhouse row: new Postgres row: |
idk if we need to store retrieval++ data in postgres, we should probably just store in prometheus + clickhouse |
|
I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here. |
There was a problem hiding this comment.
Pull request overview
Adds a new “anonymous retrieval” check flow to the backend, enabling scheduled sampling of non-dealbot pieces via the subgraph, retrieval via /piece/{cid}, optional CAR/IPNI validation, and persistence of results/metrics.
Changes:
- Introduces
retrieval-anonmodule/services (piece selection, retrieval + CommP, CAR/IPNI validation) and wires it into pg-boss scheduling. - Replaces the PDP-subgraph client with a unified
SubgraphServiceand renames env config toSUBGRAPH_ENDPOINT. - Adds new Prometheus metrics and a ClickHouse table (
anon_retrieval_checks) plus Postgres schema/entity for anon retrieval results; adjusts HTTP/2 timeout/partial-download behavior.
Reviewed changes
Copilot reviewed 44 out of 47 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pnpm-lock.yaml | Dependency lock updates (oclif/minimatch patch bumps). |
| kustomize/overlays/local/backend-configmap-local.yaml | Renames local env var to SUBGRAPH_ENDPOINT. |
| docs/environment-variables.md | Documents SUBGRAPH_ENDPOINT + anon retrieval job timeout. |
| docs/checks/production-configuration-and-approval-methodology.md | Updates production config reference to SUBGRAPH_ENDPOINT. |
| docs/checks/data-retention.md | Updates data-retention docs to reference SubgraphService/SUBGRAPH_ENDPOINT. |
| apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts | Updates config shape to subgraphEndpoint. |
| apps/backend/src/subgraph/types.ts | Adds anon piece sampling types + CID decoding + response validator. |
| apps/backend/src/subgraph/types.spec.ts | Adds unit tests for subgraph response validators. |
| apps/backend/src/subgraph/subgraph.service.ts | Renames/extends subgraph client; adds sampleAnonPiece() and generic query helper. |
| apps/backend/src/subgraph/subgraph.service.spec.ts | Updates tests and adds coverage for sampleAnonPiece(). |
| apps/backend/src/subgraph/subgraph.module.ts | New Nest module exporting SubgraphService. |
| apps/backend/src/subgraph/queries.ts | New subgraph query definitions incl. anon sampling query builder. |
| apps/backend/src/retrieval-anon/types.ts | New anon retrieval domain result types. |
| apps/backend/src/retrieval-anon/retrieval-anon.module.ts | New Nest module wiring anon retrieval dependencies. |
| apps/backend/src/retrieval-anon/piece-retrieval.service.ts | Implements /piece/{cid} download + CommP validation. |
| apps/backend/src/retrieval-anon/car-validation.service.ts | Implements CAR parsing, IPNI verification, sampled block fetch+hash verification. |
| apps/backend/src/retrieval-anon/anon-retrieval.service.ts | Orchestrates anon selection → retrieval → validation → persistence + metrics. |
| apps/backend/src/retrieval-anon/anon-retrieval.service.spec.ts | Adds unit tests for persistence/metrics behavior (including abort/partial). |
| apps/backend/src/retrieval-anon/anon-piece-selector.service.ts | Implements bucketed/pool sampling + dedup + fallback strategy. |
| apps/backend/src/retrieval-anon/anon-piece-selector.service.spec.ts | Adds unit tests for sampling/fallback/dedup/termination behavior. |
| apps/backend/src/pdp-subgraph/queries.ts | Removes legacy PDP-subgraph queries (migrated to subgraph/). |
| apps/backend/src/pdp-subgraph/pdp-subgraph.module.ts | Removes legacy PDP subgraph module (replaced by SubgraphModule). |
| apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts | Registers new anon retrieval Prometheus metrics + provider. |
| apps/backend/src/metrics-prometheus/check-metrics.service.ts | Adds AnonRetrievalCheckMetrics helper class. |
| apps/backend/src/metrics-prometheus/check-metric-labels.ts | Adds anon_retrieval to the CheckType union. |
| apps/backend/src/jobs/jobs.service.ts | Adds retrieval_anon job type, queue, scheduler, and timeout handling. |
| apps/backend/src/jobs/jobs.service.spec.ts | Updates tests for new dependency ordering and new schedule rows. |
| apps/backend/src/jobs/jobs.module.ts | Imports RetrievalAnonModule for job execution. |
| apps/backend/src/jobs/job-queues.ts | Adds RETRIEVAL_ANON_QUEUE. |
| apps/backend/src/ipni/ipni-verification.service.ts | Changes IPNI verification to per-CID checks with per-CID failure tracking/counts. |
| apps/backend/src/http-client/types.ts | Extends request result with aborted/abortReason for partial HTTP/2 downloads. |
| apps/backend/src/http-client/http-client.service.ts | Reworks HTTP/2 timeout handling and returns partial bytes+metrics on abort mid-download. |
| apps/backend/src/http-client/http-client.service.spec.ts | Adds tests for headersTimeout mapping, signal behavior, partial-download returns, and rethrowing non-abort errors. |
| apps/backend/src/database/types.ts | Adds PieceFetchStatus and IpniCheckStatus enums for anon retrievals. |
| apps/backend/src/database/migrations/1776300000000-CreateAnonRetrievals.ts | Adds Postgres schema (table + enums + indexes) for anon retrievals. |
| apps/backend/src/database/entities/job-schedule-state.entity.ts | Adds retrieval_anon to scheduled job type union. |
| apps/backend/src/database/entities/anon-retrieval.entity.ts | Adds TypeORM entity mapping for anon_retrievals. |
| apps/backend/src/database/database.module.ts | Registers AnonRetrieval entity and exports it for injection. |
| apps/backend/src/data-retention/data-retention.service.ts | Switches data retention polling to SubgraphService and subgraphEndpoint. |
| apps/backend/src/data-retention/data-retention.service.spec.ts | Updates tests to mock SubgraphService and renamed config field. |
| apps/backend/src/data-retention/data-retention.module.ts | Switches module import from legacy PDP subgraph module to SubgraphModule. |
| apps/backend/src/config/app.config.ts | Renames env var to SUBGRAPH_ENDPOINT; adds anon retrieval rates/timeouts/block-sample config; derives HTTP timeouts from max job timeout. |
| apps/backend/src/clickhouse/clickhouse.schema.ts | Adds ClickHouse anon_retrieval_checks table schema. |
| apps/backend/src/app.module.ts | Imports RetrievalAnonModule. |
| apps/backend/README.md | Updates env var docs to SUBGRAPH_ENDPOINT. |
| apps/backend/.env.example | Renames env var, adds anon retrieval env vars, updates timeout guidance. |
| .gitignore | Ignores .tool-versions. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| DEALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(4), | ||
| DATASET_CREATIONS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1), | ||
| RETRIEVALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(2), | ||
| RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).optional(), |
There was a problem hiding this comment.
RETRIEVALS_ANON_PER_SP_PER_HOUR is declared as an optional number, but an empty-string value (as in the .env.example placeholder) will fail Joi number coercion and can prevent the app from booting. Consider adding .empty("") (or .allow("") with normalization) and/or providing a default in the schema to match loadConfig()’s fallback behavior.
| RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).optional(), | |
| RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).empty("").optional(), |
| | [Jobs (pg-boss)](#jobs-pg-boss) | `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`, `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` | | ||
| | [Dataset](#dataset-configuration) | `DEALBOT_LOCAL_DATASETS_PATH`, `RANDOM_PIECE_SIZES` | |
There was a problem hiding this comment.
The jobs section lists ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS, but the PR also introduces RETRIEVALS_ANON_PER_SP_PER_HOUR and ANON_RETRIEVAL_BLOCK_SAMPLE_COUNT which aren’t documented here. This makes it hard to configure/understand the new anon retrieval behavior from the env-var reference.
| const pieceBytes = Buffer.isBuffer(result.data) ? result.data : Buffer.from(result.data); | ||
| const commPValid = await this.validateCommP(pieceBytes, pieceCid); | ||
|
|
||
| this.logger.debug({ | ||
| event: "piece_fetch_success", | ||
| message: "Piece fetched successfully", | ||
| pieceCid, | ||
| spAddress, | ||
| bytesReceived: metrics.responseSize, | ||
| latencyMs: metrics.totalTime, | ||
| ttfbMs: metrics.ttfb, | ||
| }); | ||
|
|
||
| return { | ||
| success: true, | ||
| pieceCid, | ||
| bytesReceived: metrics.responseSize, | ||
| pieceBytes, | ||
| latencyMs: metrics.totalTime, | ||
| ttfbMs: metrics.ttfb, | ||
| throughputBps, | ||
| statusCode: metrics.statusCode, | ||
| commPValid, | ||
| }; | ||
| } catch (error) { |
There was a problem hiding this comment.
fetchPiece() returns success: true for any 2xx response even when commPValid is false. That makes pieceFetchStatus/overall check status report success for corrupted/incorrect bytes, and can incorrectly run CAR validation on invalid data. Consider treating a CommP mismatch as a retrieval failure (set success: false, populate errorMessage, and/or split status into httpOk vs commpValid).
|
|
||
| service_type LowCardinality(String), -- 'direct_sp' (only mode for anon retrievals today) | ||
|
|
||
| piece_fetch_status LowCardinality(String), -- 'success' | 'failed' — outcome of GET /piece/<pieceCid> (HTTP 2xx AND CommP match). CAR/IPNI/block-fetch outcomes live in their own columns. |
There was a problem hiding this comment.
The ClickHouse column comment for piece_fetch_status says it represents “HTTP 2xx AND CommP match”, but the current anon retrieval implementation sets piece_fetch_status=success whenever the HTTP request succeeds, even if commp_valid is false. Either adjust the status semantics in code or update this comment to match reality so dashboards/analyses aren’t misleading.
| piece_fetch_status LowCardinality(String), -- 'success' | 'failed' — outcome of GET /piece/<pieceCid> (HTTP 2xx AND CommP match). CAR/IPNI/block-fetch outcomes live in their own columns. | |
| piece_fetch_status LowCardinality(String), -- 'success' | 'failed' — outcome of GET /piece/<pieceCid> at the fetch/HTTP layer. CommP verification is recorded separately in `commp_valid`; CAR/IPNI/block-fetch outcomes live in their own columns. |
| | [`SUBGRAPH_ENDPOINT`](../environment-variables.md#subgraph_endpoint) | TODO: fill this in | Uses the subgraph from [pdp-explorer](https://github.com/FilOzone/pdp-explorer). | | ||
| | [`MIN_NUM_DATASETS_FOR_CHECKS`](../environment-variables.md#dataset-configuration) | 15 | Ensure there are enough datasets with pieces being added so that statistical significance for [Data Retention Fault Rate](#data-retention-fault-rate) can be achieved quicker. Note that on mainnet each dataset incurs 5 challenges[^1] per daily proof[^2]. With this many datasets, an SP can be approved for data retention after a faultless ~7 days even if the SP doesn't have other datasets. | |
There was a problem hiding this comment.
This table still notes the subgraph “Uses the subgraph from pdp-explorer”, but the rest of the PR/documentation now describes using the dealbot-owned subgraph deployment (apps/subgraph, Goldsky slots). Consider updating the note to avoid pointing operators at the wrong endpoint/source.
we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config. Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services? |
This is my basic point. Clickhouse is an optional component so without storing in postgres there is no record of the retrieval. |
This is a valid point too. It's really boils down to what the purpose of postgres is, whether it is to simply record working state or if it's to provide any diagnostic data. |
It's not necessary. Everything worked fine before the last commit 👍 Then Clickhouse becomes a hard dependency (which I think is fine for the same reason as you enumerated @SgtPooki ) but IIUC this was the point to pause for @iand. So what do we do? Revert to CH only? |
|
Wow - good stuff! A few things from driving by. Let me know if it's key that I look more closely:
|
Yea, we should pump to clickhouse and prometheus. so i think the going consensus is: we should not be adding info to postgres unless it's required to read/infer state. we do need that for deals/data-storage checks, we dont need it for retrievals but it's still there because we haven't removed it. we can eventually migrate to no data-storage/deal table in the future, but that would require a lot more RPC calls and chain walking to figure out what exists and where we want to upload things to. this isn't priority, but keeping new out of postgres where possible is (a priority). |
SgtPooki
left a comment
There was a problem hiding this comment.
reviewed some and have a few high level questions:
- why change PDP_SUBGRAPH_ENDPOINT env var? seems unnecessary and will cause issues with our deployed version, and potentially old stale code as well
- why move subgraph code to /subgraph instead of just leaving in /pdp-subgraph? -- this is a large PR and keeping to existing norms rather than overwriting things would be ideal
- changing how we do ipni verification can have drastic impact on current metrics.. switching to serial individual CID checks could cause already slow IPNI verification for some SPs to start fully failing.
e607b13 to
92ad643
Compare
This reverts commit 6824f75.
…undant clickhouse-enabled gate
- Replace string literals ("valid"|"invalid"|"skipped"|"error") with
IpniCheckStatus enum in anon-retrieval.service.ts
- Drop the `if (clickhouseService.enabled)` wrapper around the insert call;
ClickhouseService.insert is already a no-op when disabled, matching the
pattern used by other retrieval flows
- Fix outdated ipni_status schema comment to include the 'error' value
Co-authored-by: Steve Loeppky <stvn@loeppky.com>
9e56c41 to
8dfb3ca
Compare
Hi folks,
this PR adds the anon retrieval flow from #427. It is a follow-up of:
The main logic is in
./apps/backend/retrieval-anon:anon-retrieval.service.ts- "called" on a schedule as a job. It starts the retrieval process (select piece, fetch piece, validate car, store result)anon-piece-selector.service.ts- implements the subgraph query logiccar-validation.service.ts- parses the piece bytes as a CAR, checks IPNI availability, fetcheskblocks from that car and validates their hashespiece-retreival.service.ts- implements the HTTP request to download the piece and CommP validationThe anonymous piece selection logic works as follows:
The
retrievalAnoncheck probes an SP for non-dealbot pieces so we can detect SPs that behave well even if the teacher is not watching. To do this fairly, the piece selection should satisfy the following requirements:withIPFSIndexingpieces (so CAR/IPNI validation has something to check) but still exercise non-indexed pieces so an SP can't optimise only its CAR corpus.How it works in practice:
Every Root entity in the subgraph carries a
sampleKey = keccak256(setId-rootId)populated once at insert time. Becausekeccak256is uniform over 256 bits and independent of creation order/size/dataset,sampleKeysorts roots into a uniform random permutation that is stable across queries.This is necessary because you cannot just select a random element from a range query in GraphQL. If we knew the total number of pieces we could define a random
skipvalue but this is also capped at 5000. I've read that it becomes very inefficient at higher values. This would also require a non-trivial bookkeeping of active pieces/datasets counts. ThesampleKeyis much easier.Drawing a sample looks like this:
withIPFSIndexing: truewith probability 80%; otherwise no filter.$sampleKeyand query:$sampleKeywhich is effectively a uniform random pick, in O(log N).pdpPaymentEndEpochhas already passed the latest indexed block, or if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession). On a miss, redraws once with a fresh$sampleKey.Subgraph
Note
The deployed subgraphs don't contain the latest changes from the recent PR review. They should still work for testing.
I have deployed the new subgraphs:
mainnet: https://api.goldsky.com/api/public/project_cmo9sxe5xd4ai01x8cpageyid/subgraphs/dealbot-mainnet/0.3.0/gncalibration: https://api.goldsky.com/api/public/project_cmo9sxe5xd4ai01x8cpageyid/subgraphs/dealbot-calibration/0.3.0/gnA deployment looks like this from within the
subgraphfolder (prerequisite is a call togoldsky login):Comments
anonor rather something likesampled?