feat: anon piece selection and retrieval by dennis-tra · Pull Request #487 · FilOzone/dealbot

dennis-tra · 2026-04-28T12:37:09Z

Hi folks,

this PR adds the anon retrieval flow from #427. It is a follow-up of:

feat: dealbot-specific subgraph #469 - added the subgraph)
feat(retrieval-anon): anon piece selection and retrieval #459 - the original large PR which contained the subgraph and anon retrieval flow (I've closed that PR).

The main logic is in ./apps/backend/retrieval-anon:

anon-retrieval.service.ts - "called" on a schedule as a job. It starts the retrieval process (select piece, fetch piece, validate car, store result)
anon-piece-selector.service.ts - implements the subgraph query logic
car-validation.service.ts - parses the piece bytes as a CAR, checks IPNI availability, fetches k blocks from that car and validates their hashes
piece-retreival.service.ts - implements the HTTP request to download the piece and CommP validation

The anonymous piece selection logic works as follows:

The retrievalAnon check probes an SP for non-dealbot pieces so we can detect SPs that behave well even if the teacher is not watching. To do this fairly, the piece selection should satisfy the following requirements:

uniform random across the SP's entire active pieces (not biased toward recent writes, specific payers, or specific sizes).
Prefer withIPFSIndexing pieces (so CAR/IPNI validation has something to check) but still exercise non-indexed pieces so an SP can't optimise only its CAR corpus.
Cover a realistic spread of piece sizes: big enough for useful bandwidth measurements, not so big that SPs with only small deals are skipped.
Avoid immediately re-testing the same piece across consecutive checks.

How it works in practice:

Every Root entity in the subgraph carries a sampleKey = keccak256(setId-rootId) populated once at insert time. Because keccak256 is uniform over 256 bits and independent of creation order/size/dataset,
sampleKey sorts roots into a uniform random permutation that is stable across queries.

This is necessary because you cannot just select a random element from a range query in GraphQL. If we knew the total number of pieces we could define a random skip value but this is also capped at 5000. I've read that it becomes very inefficient at higher values. This would also require a non-trivial bookkeeping of active pieces/datasets counts. The sampleKey is much easier.

Drawing a sample looks like this:

Pick a size bucket (small < 20 MiB, medium 20 MiB to 100 GiB, large 100 MiB to 500 MiB) by weighted random — weights 20% / 50% / 30% respectively.
Pick the pool: withIPFSIndexing: true with probability 80%; otherwise no filter.
Generate 32 random bytes as $sampleKey and query:

query randomPiece {
  roots( // <- piece
    first: 1
    orderBy: $sampleKey
    orderDirection: asc
    where: {
      sampleKey_gte: $sampleKey
      removed: false
      rawSize_gte: $sizeBucket_lo
      rawSize_lte: $sizeBucket_hi
      proofSet_: { // <- dataset
        fwssServiceProvider: $sp
        fwssPayer_not: $dealbotPayer
        isActive: true
        withIPFSIndexing: $pool
      }
    }
  )
}

This returns the root with the smallest sampleKey >= $sampleKey which is effectively a uniform random pick, in O(log N).
Drop it if pdpPaymentEndEpoch has already passed the latest indexed block, or if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession). On a miss, redraws once with a fresh $sampleKey.
Falls back through: (same bucket, opposite pool) -> (any bucket, indexed) -> (any bucket, any) before giving up.

Subgraph

Note

The deployed subgraphs don't contain the latest changes from the recent PR review. They should still work for testing.

I have deployed the new subgraphs:

A deployment looks like this from within the subgraph folder (prerequisite is a call to goldsky login):

pnpm run codegen && pnpm run build:calibration && VERSION=0.3.0 pnpm run deploy:calibration
pnpm run codegen && pnpm run build:mainnet && VERSION=0.3.0 pnpm run deploy:mainnet

Comments

timeout handling was a bit tricky because we have 1) job timeouts 2) a connect timeout 3) a transfer timeout. Connect and transfer timeouts were shared between the basic and anon retrievals but because anon retrievals may download larger files they were too short. I've configured a job timeout for anon retrieval to 5 minutes (which should actually also take the job-rate into account but doesn't at the moment) and the http transfer timeout is set to the maximum job timeout value of the basic and anon retrievals. That's because both code paths use the same HTTP client.
If an http2 retrieval times out but has received partial data, it returns partial information (ttfb, retrieved bytes, etc). I've only added this to http2.
This PR clashes with @iand's feat: add retrieval_type column to retrieval_checks clickhouse table #485. I have created a separate table for the anon retrievals because I figured that the overlap between both types was too little.
Do we want to keep calling it anon or rather something like sampled?
I don't know how much the subgraph will cost

iand · 2026-04-29T14:11:27Z

To be sample/sampled better conveys what is happening.

iand · 2026-04-29T14:21:33Z

+    first_byte_ms              Nullable(Float64),                 -- time to first response byte
+    last_byte_ms               Nullable(Float64),                 -- time to last response byte
+    bytes_retrieved            Nullable(UInt64),                  -- bytes received from /piece/{cid}
+    throughput_bps             Nullable(UInt64),                  -- effective throughput, bytes per second


This is data that can be easily derived. Also is it (ttlb-ttfb)/bytes or simply ttlb/bytes?

It's http response size / total time of the HTTP request

yeah it could easily be derived, I agree

dennis-tra · 2026-04-30T08:15:02Z

One example data point in my local clickhouse db:

timestamp:                  2026-04-30 08:02:34.818
probe_location:             unknown
sp_address:                 0xa3971A7234a3379A1813d9867B531e7EeB20ae07
sp_id:                      9
sp_name:                    ezpdpz-calib
retrieval_id:               488b4334-0bdf-4dae-a299-4c11869f49cc
piece_cid:                  bafkzcibfr737oaqtnih5h25biyipqjbrbv4ibik37mpbjrepqzz7irsgqp6ffszolmga
data_set_id:                6142
piece_id:                   2684
raw_size:                   10486897 -- 10.49 million
with_ipfs_indexing:         true
ipfs_root_cid:              bafybeiehkmndajr3dy5524mnglj3cqbljemwj5f3ldot4ulizzmw4sp4mu
service_type:               direct_sp
retrieval_endpoint:         https://calib.ezpdpz.net/piece/bafkzcibfr737oaqtnih5h25biyipqjbrbv4ibik37mpbjrepqzz7irsgqp6ffszolmga
piece_fetch_status:         success
http_response_code:         200
first_byte_ms:              347
last_byte_ms:               5701
bytes_retrieved:            10486897 -- 10.49 million
throughput_bps:             1839484 -- 1.84 million
commp_valid:                true
car_parseable:              true
car_block_count:            12
block_fetch_endpoint:       https://calib.ezpdpz.net/ipfs/
block_fetch_valid:          true
block_fetch_sampled_count:  5
block_fetch_failed_count:   0
ipni_status:                valid
ipni_verify_ms:             1087
ipni_verified_cids_count:   6
ipni_unverified_cids_count: 0
error_message:              ᴺᵁᴸᴸ

dennis-tra · 2026-04-30T09:23:56Z

A discussion that just came up with @iand:

This anon retrieval here persists data only in Clickhouse. AFAIU the basic retrieval persists data in postgres, exposes aggregated metrics via prometheus, and if the CLICKHOUSE_URL is configured also writes a subset of the postgres data also to clickhouse.

My initial understanding was that metrics data will exclusively live in Clickhouse while Postgres will handle job queues/orchestration/keeping deal state/etc.

To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if CLICKHOUSE_URL is configured, also store relevant metrics in Clickhouse.

dennis-tra · 2026-04-30T11:19:21Z

@iand implemented in 6824f75

new Clickhouse row:

Row 1:
──────
timestamp:                  2026-04-30 11:15:04.945
probe_location:             unknown
sp_address:                 0xCb9e86945cA31E6C3120725BF0385CBAD684040c
sp_id:                      4
sp_name:                    infrafolio-calib
retrieval_id:               ebe6c618-868f-42e9-a70b-f37f21b56f1d
raw_size:                   26196779 -- 26.20 million
with_ipfs_indexing:         true
service_type:               direct_sp
piece_fetch_status:         success
http_response_code:         200
first_byte_ms:              515
last_byte_ms:               5898
bytes_retrieved:            26196779 -- 26.20 million
throughput_bps:             4441638 -- 4.44 million
commp_valid:                true
car_parseable:              true
car_block_count:            275
block_fetch_valid:          true
block_fetch_sampled_count:  5
block_fetch_failed_count:   0
ipni_status:                valid
ipni_verify_ms:             3410
ipni_verified_cids_count:   6
ipni_unverified_cids_count: 0

new Postgres row:

-[ RECORD 3 ]--------------+--------------------------------------------------------------------------------------------------------------------
id                         | ebe6c618-868f-42e9-a70b-f37f21b56f1d
started_at                 | 2026-04-30 11:15:04.945+00
probe_location             | unknown
sp_address                 | 0xCb9e86945cA31E6C3120725BF0385CBAD684040c
sp_id                      | 4
sp_name                    | infrafolio-calib
piece_cid                  | bafkzcibf2we3cayu64rkggkzjhdeqrzrhyoouoaweemqd76nx57jsmjuizzayvpwhafq
data_set_id                | 13199
piece_id                   | 283
raw_size                   | 26196779
with_ipfs_indexing         | t
ipfs_root_cid              | bafkreigd5yjmb4mf5luayeac3danaoglqeqqeqfqmbpq2fhrjyuj2fftam
service_type               | direct_sp
retrieval_endpoint         | https://caliberation-pdp.infrafolio.com/piece/bafkzcibf2we3cayu64rkggkzjhdeqrzrhyoouoaweemqd76nx57jsmjuizzayvpwhafq
piece_fetch_status         | success
http_response_code         | 200
first_byte_ms              | 515
last_byte_ms               | 5898
bytes_retrieved            | 26196779
throughput_bps             | 4441638
commp_valid                | t
car_parseable              | t
car_block_count            | 275
block_fetch_endpoint       | https://caliberation-pdp.infrafolio.com/ipfs/
block_fetch_valid          | t
block_fetch_sampled_count  | 5
block_fetch_failed_count   | 0
ipni_status                | valid
ipni_verify_ms             | 3410
ipni_verified_cids_count   | 6
ipni_unverified_cids_count | 0
error_message              |
created_at                 | 2026-04-30 11:15:04.945+00

SgtPooki · 2026-04-30T12:10:01Z

@dennis-tra

To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if CLICKHOUSE_URL is configured, also store relevant metrics in Clickhouse.

idk if we need to store retrieval++ data in postgres, we should probably just store in prometheus + clickhouse

dennis-tra · 2026-04-30T12:11:39Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

Copilot

Pull request overview

Adds a new “anonymous retrieval” check flow to the backend, enabling scheduled sampling of non-dealbot pieces via the subgraph, retrieval via /piece/{cid}, optional CAR/IPNI validation, and persistence of results/metrics.

Changes:

Introduces retrieval-anon module/services (piece selection, retrieval + CommP, CAR/IPNI validation) and wires it into pg-boss scheduling.
Replaces the PDP-subgraph client with a unified SubgraphService and renames env config to SUBGRAPH_ENDPOINT.
Adds new Prometheus metrics and a ClickHouse table (anon_retrieval_checks) plus Postgres schema/entity for anon retrieval results; adjusts HTTP/2 timeout/partial-download behavior.

Reviewed changes

Copilot reviewed 44 out of 47 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
pnpm-lock.yaml	Dependency lock updates (oclif/minimatch patch bumps).
kustomize/overlays/local/backend-configmap-local.yaml	Renames local env var to `SUBGRAPH_ENDPOINT`.
docs/environment-variables.md	Documents `SUBGRAPH_ENDPOINT` + anon retrieval job timeout.
docs/checks/production-configuration-and-approval-methodology.md	Updates production config reference to `SUBGRAPH_ENDPOINT`.
docs/checks/data-retention.md	Updates data-retention docs to reference `SubgraphService`/`SUBGRAPH_ENDPOINT`.
apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts	Updates config shape to `subgraphEndpoint`.
apps/backend/src/subgraph/types.ts	Adds anon piece sampling types + CID decoding + response validator.
apps/backend/src/subgraph/types.spec.ts	Adds unit tests for subgraph response validators.
apps/backend/src/subgraph/subgraph.service.ts	Renames/extends subgraph client; adds `sampleAnonPiece()` and generic query helper.
apps/backend/src/subgraph/subgraph.service.spec.ts	Updates tests and adds coverage for `sampleAnonPiece()`.
apps/backend/src/subgraph/subgraph.module.ts	New Nest module exporting `SubgraphService`.
apps/backend/src/subgraph/queries.ts	New subgraph query definitions incl. anon sampling query builder.
apps/backend/src/retrieval-anon/types.ts	New anon retrieval domain result types.
apps/backend/src/retrieval-anon/retrieval-anon.module.ts	New Nest module wiring anon retrieval dependencies.
apps/backend/src/retrieval-anon/piece-retrieval.service.ts	Implements `/piece/{cid}` download + CommP validation.
apps/backend/src/retrieval-anon/car-validation.service.ts	Implements CAR parsing, IPNI verification, sampled block fetch+hash verification.
apps/backend/src/retrieval-anon/anon-retrieval.service.ts	Orchestrates anon selection → retrieval → validation → persistence + metrics.
apps/backend/src/retrieval-anon/anon-retrieval.service.spec.ts	Adds unit tests for persistence/metrics behavior (including abort/partial).
apps/backend/src/retrieval-anon/anon-piece-selector.service.ts	Implements bucketed/pool sampling + dedup + fallback strategy.
apps/backend/src/retrieval-anon/anon-piece-selector.service.spec.ts	Adds unit tests for sampling/fallback/dedup/termination behavior.
apps/backend/src/pdp-subgraph/queries.ts	Removes legacy PDP-subgraph queries (migrated to `subgraph/`).
apps/backend/src/pdp-subgraph/pdp-subgraph.module.ts	Removes legacy PDP subgraph module (replaced by `SubgraphModule`).
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts	Registers new anon retrieval Prometheus metrics + provider.
apps/backend/src/metrics-prometheus/check-metrics.service.ts	Adds `AnonRetrievalCheckMetrics` helper class.
apps/backend/src/metrics-prometheus/check-metric-labels.ts	Adds `anon_retrieval` to the `CheckType` union.
apps/backend/src/jobs/jobs.service.ts	Adds `retrieval_anon` job type, queue, scheduler, and timeout handling.
apps/backend/src/jobs/jobs.service.spec.ts	Updates tests for new dependency ordering and new schedule rows.
apps/backend/src/jobs/jobs.module.ts	Imports `RetrievalAnonModule` for job execution.
apps/backend/src/jobs/job-queues.ts	Adds `RETRIEVAL_ANON_QUEUE`.
apps/backend/src/ipni/ipni-verification.service.ts	Changes IPNI verification to per-CID checks with per-CID failure tracking/counts.
apps/backend/src/http-client/types.ts	Extends request result with `aborted`/`abortReason` for partial HTTP/2 downloads.
apps/backend/src/http-client/http-client.service.ts	Reworks HTTP/2 timeout handling and returns partial bytes+metrics on abort mid-download.
apps/backend/src/http-client/http-client.service.spec.ts	Adds tests for headersTimeout mapping, signal behavior, partial-download returns, and rethrowing non-abort errors.
apps/backend/src/database/types.ts	Adds `PieceFetchStatus` and `IpniCheckStatus` enums for anon retrievals.
apps/backend/src/database/migrations/1776300000000-CreateAnonRetrievals.ts	Adds Postgres schema (table + enums + indexes) for anon retrievals.
apps/backend/src/database/entities/job-schedule-state.entity.ts	Adds `retrieval_anon` to scheduled job type union.
apps/backend/src/database/entities/anon-retrieval.entity.ts	Adds TypeORM entity mapping for `anon_retrievals`.
apps/backend/src/database/database.module.ts	Registers `AnonRetrieval` entity and exports it for injection.
apps/backend/src/data-retention/data-retention.service.ts	Switches data retention polling to `SubgraphService` and `subgraphEndpoint`.
apps/backend/src/data-retention/data-retention.service.spec.ts	Updates tests to mock `SubgraphService` and renamed config field.
apps/backend/src/data-retention/data-retention.module.ts	Switches module import from legacy PDP subgraph module to `SubgraphModule`.
apps/backend/src/config/app.config.ts	Renames env var to `SUBGRAPH_ENDPOINT`; adds anon retrieval rates/timeouts/block-sample config; derives HTTP timeouts from max job timeout.
apps/backend/src/clickhouse/clickhouse.schema.ts	Adds ClickHouse `anon_retrieval_checks` table schema.
apps/backend/src/app.module.ts	Imports `RetrievalAnonModule`.
apps/backend/README.md	Updates env var docs to `SUBGRAPH_ENDPOINT`.
apps/backend/.env.example	Renames env var, adds anon retrieval env vars, updates timeout guidance.
.gitignore	Ignores `.tool-versions`.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T12:14:32Z

  DEALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(4),
  DATASET_CREATIONS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1),
  RETRIEVALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(2),
+  RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).optional(),


RETRIEVALS_ANON_PER_SP_PER_HOUR is declared as an optional number, but an empty-string value (as in the .env.example placeholder) will fail Joi number coercion and can prevent the app from booting. Consider adding .empty("") (or .allow("") with normalization) and/or providing a default in the schema to match loadConfig()’s fallback behavior.

Suggested change

RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).optional(),

RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).empty("").optional(),

Copilot · 2026-04-30T12:14:33Z

+| [Jobs (pg-boss)](#jobs-pg-boss)           | `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`,  `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` |
 | [Dataset](#dataset-configuration)         | `DEALBOT_LOCAL_DATASETS_PATH`, `RANDOM_PIECE_SIZES`                                                                                                          |


The jobs section lists ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS, but the PR also introduces RETRIEVALS_ANON_PER_SP_PER_HOUR and ANON_RETRIEVAL_BLOCK_SAMPLE_COUNT which aren’t documented here. This makes it hard to configure/understand the new anon retrieval behavior from the env-var reference.

Copilot · 2026-04-30T12:14:34Z

+      const pieceBytes = Buffer.isBuffer(result.data) ? result.data : Buffer.from(result.data);
+      const commPValid = await this.validateCommP(pieceBytes, pieceCid);
+
+      this.logger.debug({
+        event: "piece_fetch_success",
+        message: "Piece fetched successfully",
+        pieceCid,
+        spAddress,
+        bytesReceived: metrics.responseSize,
+        latencyMs: metrics.totalTime,
+        ttfbMs: metrics.ttfb,
+      });
+
+      return {
+        success: true,
+        pieceCid,
+        bytesReceived: metrics.responseSize,
+        pieceBytes,
+        latencyMs: metrics.totalTime,
+        ttfbMs: metrics.ttfb,
+        throughputBps,
+        statusCode: metrics.statusCode,
+        commPValid,
+      };
+    } catch (error) {


fetchPiece() returns success: true for any 2xx response even when commPValid is false. That makes pieceFetchStatus/overall check status report success for corrupted/incorrect bytes, and can incorrectly run CAR validation on invalid data. Consider treating a CommP mismatch as a retrieval failure (set success: false, populate errorMessage, and/or split status into httpOk vs commpValid).

Copilot · 2026-04-30T12:14:34Z

+
+    service_type               LowCardinality(String),            -- 'direct_sp' (only mode for anon retrievals today)
+
+    piece_fetch_status         LowCardinality(String),            -- 'success' | 'failed' — outcome of GET /piece/<pieceCid> (HTTP 2xx AND CommP match). CAR/IPNI/block-fetch outcomes live in their own columns.


The ClickHouse column comment for piece_fetch_status says it represents “HTTP 2xx AND CommP match”, but the current anon retrieval implementation sets piece_fetch_status=success whenever the HTTP request succeeds, even if commp_valid is false. Either adjust the status semantics in code or update this comment to match reality so dashboards/analyses aren’t misleading.

Suggested change

piece_fetch_status LowCardinality(String), -- 'success' | 'failed' — outcome of GET /piece/<pieceCid> (HTTP 2xx AND CommP match). CAR/IPNI/block-fetch outcomes live in their own columns.

piece_fetch_status LowCardinality(String), -- 'success' | 'failed' — outcome of GET /piece/<pieceCid> at the fetch/HTTP layer. CommP verification is recorded separately in `commp_valid`; CAR/IPNI/block-fetch outcomes live in their own columns.

Copilot · 2026-04-30T12:14:34Z

+| [`SUBGRAPH_ENDPOINT`](../environment-variables.md#subgraph_endpoint) | TODO: fill this in | Uses the subgraph from [pdp-explorer](https://github.com/FilOzone/pdp-explorer). |
 | [`MIN_NUM_DATASETS_FOR_CHECKS`](../environment-variables.md#dataset-configuration) | 15 | Ensure there are enough datasets with pieces being added so that statistical significance for [Data Retention Fault Rate](#data-retention-fault-rate) can be achieved quicker. Note that on mainnet each dataset incurs 5 challenges[^1] per daily proof[^2]. With this many datasets, an SP can be approved for data retention after a faultless ~7 days even if the SP doesn't have other datasets. |


This table still notes the subgraph “Uses the subgraph from pdp-explorer”, but the rest of the PR/documentation now describes using the dealbot-owned subgraph deployment (apps/subgraph, Goldsky slots). Consider updating the note to avoid pointing operators at the wrong endpoint/source.

SgtPooki · 2026-04-30T12:27:14Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config.

Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services?

iand · 2026-04-30T12:56:46Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

This is my basic point. Clickhouse is an optional component so without storing in postgres there is no record of the retrieval.

iand · 2026-04-30T12:58:08Z

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config.

Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services?

This is a valid point too. It's really boils down to what the purpose of postgres is, whether it is to simply record working state or if it's to provide any diagnostic data.

dennis-tra · 2026-04-30T14:05:36Z

If it's necessary, than I don't want to push back too hard [...].

It's not necessary. Everything worked fine before the last commit 👍 Then Clickhouse becomes a hard dependency (which I think is fine for the same reason as you enumerated @SgtPooki ) but IIUC this was the point to pause for @iand.

So what do we do? Revert to CH only?

BigLep · 2026-05-01T17:50:06Z

Wow - good stuff! A few things from driving by. Let me know if it's key that I look more closely:

I think we're fine to use clickhouse only (assuming we also have prometheus so we can see jobs running, completing).
"Drop it ... if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession)." I don't know how important is. Keeping it simple where each run we just pick a random sample is fine. If we we randomly pick the same CID a couple of times in a row, I don't think it's a big deal. Also, what do you do when an SP has less than 500 CIDs?
Documentation: I realize New check: retrieval++ #427 wasn't explicit about this (my bad), but I'd like https://github.com/FilOzone/dealbot/tree/main/docs/checks to stay updated in terms of events, metrics, and check descriptions. Basically I want to have a place where a human can reason about our our checks our metrics.

SgtPooki · 2026-05-01T18:00:59Z

@dennis-tra

I think we're fine to use clickhouse only (assuming we also have prometheus so we can see jobs running, completing).

Yea, we should pump to clickhouse and prometheus. so i think the going consensus is: we should not be adding info to postgres unless it's required to read/infer state. we do need that for deals/data-storage checks, we dont need it for retrievals but it's still there because we haven't removed it.

we can eventually migrate to no data-storage/deal table in the future, but that would require a lot more RPC calls and chain walking to figure out what exists and where we want to upload things to. this isn't priority, but keeping new out of postgres where possible is (a priority).

SgtPooki

reviewed some and have a few high level questions:

why change PDP_SUBGRAPH_ENDPOINT env var? seems unnecessary and will cause issues with our deployed version, and potentially old stale code as well
why move subgraph code to /subgraph instead of just leaving in /pdp-subgraph? -- this is a large PR and keeping to existing norms rather than overwriting things would be ideal
changing how we do ipni verification can have drastic impact on current metrics.. switching to serial individual CID checks could cause already slow IPNI verification for some SPs to start fully failing.

This reverts commit 6824f75.

…undant clickhouse-enabled gate - Replace string literals ("valid"|"invalid"|"skipped"|"error") with IpniCheckStatus enum in anon-retrieval.service.ts - Drop the `if (clickhouseService.enabled)` wrapper around the insert call; ClickhouseService.insert is already a no-op when disabled, matching the pattern used by other retrieval flows - Fix outdated ipni_status schema comment to include the 'error' value

Context: filecoin-project/filecoin-pin#417

Co-authored-by: Steve Loeppky <stvn@loeppky.com>

https://github.com/FilOzone/dealbot/pull/487/changes#r3245245410

FilOzzy added this to FOC Apr 28, 2026

github-project-automation Bot moved this to 📌 Triage in FOC Apr 28, 2026

dennis-tra force-pushed the anon-retrieval branch 2 times, most recently from e40d010 to 444a79b Compare April 28, 2026 12:42

BigLep moved this from 📌 Triage to ⌨️ In Progress in FOC Apr 28, 2026

BigLep assigned dennis-tra Apr 29, 2026

dennis-tra mentioned this pull request Apr 29, 2026

feat(retrieval-anon): anon piece selection and retrieval #459

Closed

iand reviewed Apr 29, 2026

View reviewed changes

dennis-tra marked this pull request as ready for review April 30, 2026 08:04

dennis-tra requested a review from iand April 30, 2026 08:04

iand mentioned this pull request Apr 30, 2026

feat: add retrieval_type column to retrieval_checks clickhouse table #485

Closed

dennis-tra mentioned this pull request Apr 30, 2026

New check: retrieval++ #427

Open

SgtPooki requested review from SgtPooki, Copilot and silent-cipher April 30, 2026 12:08

Copilot started reviewing on behalf of SgtPooki April 30, 2026 12:09 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

SgtPooki requested changes May 1, 2026

View reviewed changes

dennis-tra force-pushed the anon-retrieval branch from e607b13 to 92ad643 Compare May 15, 2026 19:40

dennis-tra and others added 22 commits May 15, 2026 21:59

feat: anon piece selection and retrieval

c9bdfa4

refactor(anon): only use clickhouse

96c82c6

feat(retrieval-anon): track ipni metrics

81a38b1

test(retrieval-anon): new ipni fields

072a096

refactor(retrieval-anon): function signatures

1fcee60

refactor(retrieval-anon): cleanup

4527d29

chore: format code

a797c15

fix: biome checks

54cc487

fix(ipni): return actual verified/unverfied counts

fcfe569

refactor: store anon retrieval data primarily in postgres

fb45bd0

Revert "refactor: store anon retrieval data primarily in postgres"

92c40a8

This reverts commit 6824f75.

remove(retrieval-anon): dedup window logic

ab3748a

revert(ipni): sequential block CID verification

beffac7

Context: filecoin-project/filecoin-pin#417

docs(retrieval-anon): flow description and metrics definitions

f26744b

docs: add missing anonymous retrieval env vars

5cee3ee

docs: fix obsolete reference to the pdp-explorer-owned subgraph

95a2dff

improve: clarity around piece fetch status and commp validation

cff3171

refactor: let two subgraph endpoints coexist

3c2a698

refactor: reduce pr diff noise

d82222f

remove: residual references to a pdp subgraph in the subgraph module

527283f

Apply suggestion from @BigLep

8dfb3ca

Co-authored-by: Steve Loeppky <stvn@loeppky.com>

dennis-tra force-pushed the anon-retrieval branch from 9e56c41 to 8dfb3ca Compare May 15, 2026 20:00

dennis-tra added 6 commits May 15, 2026 22:01

chore: align pnpm-lock.yaml with main

b8a2621

fix: wrong reference to an old maximum anon retrieval piece size

70af7c0

docs: improve anon retrieval documentation

b003d78

docs: fix accidental changes to untouched event descriptions

21b4f2d

rename: metric anonRetrievalStatus to anonPieceRetrievalStatus

a4f0b38

https://github.com/FilOzone/dealbot/pull/487/changes#r3245245410

fix: interpret abort signal as timed out for metric

1a32373

	RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).optional(),
	RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).empty("").optional(),

		\| [Jobs (pg-boss)](#jobs-pg-boss) \| `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`, `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` \|
		\| [Dataset](#dataset-configuration) \| `DEALBOT_LOCAL_DATASETS_PATH`, `RANDOM_PIECE_SIZES` \|


		service_type LowCardinality(String), -- 'direct_sp' (only mode for anon retrievals today)

		piece_fetch_status LowCardinality(String), -- 'success' \| 'failed' — outcome of GET /piece/<pieceCid> (HTTP 2xx AND CommP match). CAR/IPNI/block-fetch outcomes live in their own columns.

		\| [`SUBGRAPH_ENDPOINT`](../environment-variables.md#subgraph_endpoint) \| TODO: fill this in \| Uses the subgraph from [pdp-explorer](https://github.com/FilOzone/pdp-explorer). \|
		\| [`MIN_NUM_DATASETS_FOR_CHECKS`](../environment-variables.md#dataset-configuration) \| 15 \| Ensure there are enough datasets with pieces being added so that statistical significance for [Data Retention Fault Rate](#data-retention-fault-rate) can be achieved quicker. Note that on mainnet each dataset incurs 5 challenges[^1] per daily proof[^2]. With this many datasets, an SP can be approved for data retention after a faultless ~7 days even if the SP doesn't have other datasets. \|

Conversation

dennis-tra commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Subgraph

Comments

Uh oh!

iand commented Apr 29, 2026

Uh oh!

iand Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-tra Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-tra Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

SgtPooki commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

SgtPooki commented Apr 30, 2026

Uh oh!

iand commented Apr 30, 2026

Uh oh!

iand commented Apr 30, 2026

Uh oh!

dennis-tra commented Apr 30, 2026

Uh oh!

BigLep commented May 1, 2026

Uh oh!

SgtPooki commented May 1, 2026

Uh oh!

SgtPooki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dennis-tra commented Apr 28, 2026 •

edited

Loading

dennis-tra commented Apr 30, 2026 •

edited

Loading