Skip to content

feat: anon piece selection and retrieval#487

Open
dennis-tra wants to merge 28 commits into
FilOzone:mainfrom
probe-lab:anon-retrieval
Open

feat: anon piece selection and retrieval#487
dennis-tra wants to merge 28 commits into
FilOzone:mainfrom
probe-lab:anon-retrieval

Conversation

@dennis-tra
Copy link
Copy Markdown
Contributor

@dennis-tra dennis-tra commented Apr 28, 2026

Hi folks,

this PR adds the anon retrieval flow from #427. It is a follow-up of:

The main logic is in ./apps/backend/retrieval-anon:

  • anon-retrieval.service.ts - "called" on a schedule as a job. It starts the retrieval process (select piece, fetch piece, validate car, store result)
  • anon-piece-selector.service.ts - implements the subgraph query logic
  • car-validation.service.ts - parses the piece bytes as a CAR, checks IPNI availability, fetches k blocks from that car and validates their hashes
  • piece-retreival.service.ts - implements the HTTP request to download the piece and CommP validation

The anonymous piece selection logic works as follows:

The retrievalAnon check probes an SP for non-dealbot pieces so we can detect SPs that behave well even if the teacher is not watching. To do this fairly, the piece selection should satisfy the following requirements:

  1. uniform random across the SP's entire active pieces (not biased toward recent writes, specific payers, or specific sizes).
  2. Prefer withIPFSIndexing pieces (so CAR/IPNI validation has something to check) but still exercise non-indexed pieces so an SP can't optimise only its CAR corpus.
  3. Cover a realistic spread of piece sizes: big enough for useful bandwidth measurements, not so big that SPs with only small deals are skipped.
  4. Avoid immediately re-testing the same piece across consecutive checks.

How it works in practice:

Every Root entity in the subgraph carries a sampleKey = keccak256(setId-rootId) populated once at insert time. Because keccak256 is uniform over 256 bits and independent of creation order/size/dataset,
sampleKey sorts roots into a uniform random permutation that is stable across queries.

This is necessary because you cannot just select a random element from a range query in GraphQL. If we knew the total number of pieces we could define a random skip value but this is also capped at 5000. I've read that it becomes very inefficient at higher values. This would also require a non-trivial bookkeeping of active pieces/datasets counts. The sampleKey is much easier.

Drawing a sample looks like this:

  1. Pick a size bucket (small < 20 MiB, medium 20 MiB to 100 GiB, large 100 MiB to 500 MiB) by weighted random — weights 20% / 50% / 30% respectively.
  2. Pick the pool: withIPFSIndexing: true with probability 80%; otherwise no filter.
  3. Generate 32 random bytes as $sampleKey and query:
query randomPiece {
  roots( // <- piece
    first: 1
    orderBy: $sampleKey
    orderDirection: asc
    where: {
      sampleKey_gte: $sampleKey
      removed: false
      rawSize_gte: $sizeBucket_lo
      rawSize_lte: $sizeBucket_hi
      proofSet_: { // <- dataset
        fwssServiceProvider: $sp
        fwssPayer_not: $dealbotPayer
        isActive: true
        withIPFSIndexing: $pool
      }
    }
  )
}
  1. This returns the root with the smallest sampleKey >= $sampleKey which is effectively a uniform random pick, in O(log N).
  2. Drop it if pdpPaymentEndEpoch has already passed the latest indexed block, or if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession). On a miss, redraws once with a fresh $sampleKey.
  3. Falls back through: (same bucket, opposite pool) -> (any bucket, indexed) -> (any bucket, any) before giving up.

Subgraph

Note

The deployed subgraphs don't contain the latest changes from the recent PR review. They should still work for testing.

I have deployed the new subgraphs:

A deployment looks like this from within the subgraph folder (prerequisite is a call to goldsky login):

pnpm run codegen && pnpm run build:calibration && VERSION=0.3.0 pnpm run deploy:calibration
pnpm run codegen && pnpm run build:mainnet && VERSION=0.3.0 pnpm run deploy:mainnet

Comments

  • timeout handling was a bit tricky because we have 1) job timeouts 2) a connect timeout 3) a transfer timeout. Connect and transfer timeouts were shared between the basic and anon retrievals but because anon retrievals may download larger files they were too short. I've configured a job timeout for anon retrieval to 5 minutes (which should actually also take the job-rate into account but doesn't at the moment) and the http transfer timeout is set to the maximum job timeout value of the basic and anon retrievals. That's because both code paths use the same HTTP client.
  • If an http2 retrieval times out but has received partial data, it returns partial information (ttfb, retrieved bytes, etc). I've only added this to http2.
  • This PR clashes with @iand's feat: add retrieval_type column to retrieval_checks clickhouse table #485. I have created a separate table for the anon retrievals because I figured that the overlap between both types was too little.
  • Do we want to keep calling it anon or rather something like sampled?
  • I don't know how much the subgraph will cost

@FilOzzy FilOzzy added this to FOC Apr 28, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC Apr 28, 2026
@dennis-tra dennis-tra force-pushed the anon-retrieval branch 2 times, most recently from e40d010 to 444a79b Compare April 28, 2026 12:42
@BigLep BigLep moved this from 📌 Triage to ⌨️ In Progress in FOC Apr 28, 2026
@iand
Copy link
Copy Markdown
Contributor

iand commented Apr 29, 2026

To be sample/sampled better conveys what is happening.

first_byte_ms Nullable(Float64), -- time to first response byte
last_byte_ms Nullable(Float64), -- time to last response byte
bytes_retrieved Nullable(UInt64), -- bytes received from /piece/{cid}
throughput_bps Nullable(UInt64), -- effective throughput, bytes per second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is data that can be easily derived. Also is it (ttlb-ttfb)/bytes or simply ttlb/bytes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's http response size / total time of the HTTP request

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it could easily be derived, I agree

@dennis-tra dennis-tra marked this pull request as ready for review April 30, 2026 08:04
@dennis-tra dennis-tra requested a review from iand April 30, 2026 08:04
@dennis-tra
Copy link
Copy Markdown
Contributor Author

One example data point in my local clickhouse db:

timestamp:                  2026-04-30 08:02:34.818
probe_location:             unknown
sp_address:                 0xa3971A7234a3379A1813d9867B531e7EeB20ae07
sp_id:                      9
sp_name:                    ezpdpz-calib
retrieval_id:               488b4334-0bdf-4dae-a299-4c11869f49cc
piece_cid:                  bafkzcibfr737oaqtnih5h25biyipqjbrbv4ibik37mpbjrepqzz7irsgqp6ffszolmga
data_set_id:                6142
piece_id:                   2684
raw_size:                   10486897 -- 10.49 million
with_ipfs_indexing:         true
ipfs_root_cid:              bafybeiehkmndajr3dy5524mnglj3cqbljemwj5f3ldot4ulizzmw4sp4mu
service_type:               direct_sp
retrieval_endpoint:         https://calib.ezpdpz.net/piece/bafkzcibfr737oaqtnih5h25biyipqjbrbv4ibik37mpbjrepqzz7irsgqp6ffszolmga
piece_fetch_status:         success
http_response_code:         200
first_byte_ms:              347
last_byte_ms:               5701
bytes_retrieved:            10486897 -- 10.49 million
throughput_bps:             1839484 -- 1.84 million
commp_valid:                true
car_parseable:              true
car_block_count:            12
block_fetch_endpoint:       https://calib.ezpdpz.net/ipfs/
block_fetch_valid:          true
block_fetch_sampled_count:  5
block_fetch_failed_count:   0
ipni_status:                valid
ipni_verify_ms:             1087
ipni_verified_cids_count:   6
ipni_unverified_cids_count: 0
error_message:              ᴺᵁᴸᴸ

@dennis-tra
Copy link
Copy Markdown
Contributor Author

A discussion that just came up with @iand:

This anon retrieval here persists data only in Clickhouse. AFAIU the basic retrieval persists data in postgres, exposes aggregated metrics via prometheus, and if the CLICKHOUSE_URL is configured also writes a subset of the postgres data also to clickhouse.

My initial understanding was that metrics data will exclusively live in Clickhouse while Postgres will handle job queues/orchestration/keeping deal state/etc.

To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if CLICKHOUSE_URL is configured, also store relevant metrics in Clickhouse.

@dennis-tra
Copy link
Copy Markdown
Contributor Author

@iand implemented in 6824f75

new Clickhouse row:

Row 1:
──────
timestamp:                  2026-04-30 11:15:04.945
probe_location:             unknown
sp_address:                 0xCb9e86945cA31E6C3120725BF0385CBAD684040c
sp_id:                      4
sp_name:                    infrafolio-calib
retrieval_id:               ebe6c618-868f-42e9-a70b-f37f21b56f1d
raw_size:                   26196779 -- 26.20 million
with_ipfs_indexing:         true
service_type:               direct_sp
piece_fetch_status:         success
http_response_code:         200
first_byte_ms:              515
last_byte_ms:               5898
bytes_retrieved:            26196779 -- 26.20 million
throughput_bps:             4441638 -- 4.44 million
commp_valid:                true
car_parseable:              true
car_block_count:            275
block_fetch_valid:          true
block_fetch_sampled_count:  5
block_fetch_failed_count:   0
ipni_status:                valid
ipni_verify_ms:             3410
ipni_verified_cids_count:   6
ipni_unverified_cids_count: 0

new Postgres row:

-[ RECORD 3 ]--------------+--------------------------------------------------------------------------------------------------------------------
id                         | ebe6c618-868f-42e9-a70b-f37f21b56f1d
started_at                 | 2026-04-30 11:15:04.945+00
probe_location             | unknown
sp_address                 | 0xCb9e86945cA31E6C3120725BF0385CBAD684040c
sp_id                      | 4
sp_name                    | infrafolio-calib
piece_cid                  | bafkzcibf2we3cayu64rkggkzjhdeqrzrhyoouoaweemqd76nx57jsmjuizzayvpwhafq
data_set_id                | 13199
piece_id                   | 283
raw_size                   | 26196779
with_ipfs_indexing         | t
ipfs_root_cid              | bafkreigd5yjmb4mf5luayeac3danaoglqeqqeqfqmbpq2fhrjyuj2fftam
service_type               | direct_sp
retrieval_endpoint         | https://caliberation-pdp.infrafolio.com/piece/bafkzcibf2we3cayu64rkggkzjhdeqrzrhyoouoaweemqd76nx57jsmjuizzayvpwhafq
piece_fetch_status         | success
http_response_code         | 200
first_byte_ms              | 515
last_byte_ms               | 5898
bytes_retrieved            | 26196779
throughput_bps             | 4441638
commp_valid                | t
car_parseable              | t
car_block_count            | 275
block_fetch_endpoint       | https://caliberation-pdp.infrafolio.com/ipfs/
block_fetch_valid          | t
block_fetch_sampled_count  | 5
block_fetch_failed_count   | 0
ipni_status                | valid
ipni_verify_ms             | 3410
ipni_verified_cids_count   | 6
ipni_unverified_cids_count | 0
error_message              |
created_at                 | 2026-04-30 11:15:04.945+00

@SgtPooki
Copy link
Copy Markdown
Collaborator

@dennis-tra

To be consistent with the basic retrieval flow, I'll change this PR to store data primarily in Postgres and on the write path, if CLICKHOUSE_URL is configured, also store relevant metrics in Clickhouse.

idk if we need to store retrieval++ data in postgres, we should probably just store in prometheus + clickhouse

@dennis-tra
Copy link
Copy Markdown
Contributor Author

dennis-tra commented Apr 30, 2026

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “anonymous retrieval” check flow to the backend, enabling scheduled sampling of non-dealbot pieces via the subgraph, retrieval via /piece/{cid}, optional CAR/IPNI validation, and persistence of results/metrics.

Changes:

  • Introduces retrieval-anon module/services (piece selection, retrieval + CommP, CAR/IPNI validation) and wires it into pg-boss scheduling.
  • Replaces the PDP-subgraph client with a unified SubgraphService and renames env config to SUBGRAPH_ENDPOINT.
  • Adds new Prometheus metrics and a ClickHouse table (anon_retrieval_checks) plus Postgres schema/entity for anon retrieval results; adjusts HTTP/2 timeout/partial-download behavior.

Reviewed changes

Copilot reviewed 44 out of 47 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pnpm-lock.yaml Dependency lock updates (oclif/minimatch patch bumps).
kustomize/overlays/local/backend-configmap-local.yaml Renames local env var to SUBGRAPH_ENDPOINT.
docs/environment-variables.md Documents SUBGRAPH_ENDPOINT + anon retrieval job timeout.
docs/checks/production-configuration-and-approval-methodology.md Updates production config reference to SUBGRAPH_ENDPOINT.
docs/checks/data-retention.md Updates data-retention docs to reference SubgraphService/SUBGRAPH_ENDPOINT.
apps/backend/src/wallet-sdk/wallet-sdk.service.spec.ts Updates config shape to subgraphEndpoint.
apps/backend/src/subgraph/types.ts Adds anon piece sampling types + CID decoding + response validator.
apps/backend/src/subgraph/types.spec.ts Adds unit tests for subgraph response validators.
apps/backend/src/subgraph/subgraph.service.ts Renames/extends subgraph client; adds sampleAnonPiece() and generic query helper.
apps/backend/src/subgraph/subgraph.service.spec.ts Updates tests and adds coverage for sampleAnonPiece().
apps/backend/src/subgraph/subgraph.module.ts New Nest module exporting SubgraphService.
apps/backend/src/subgraph/queries.ts New subgraph query definitions incl. anon sampling query builder.
apps/backend/src/retrieval-anon/types.ts New anon retrieval domain result types.
apps/backend/src/retrieval-anon/retrieval-anon.module.ts New Nest module wiring anon retrieval dependencies.
apps/backend/src/retrieval-anon/piece-retrieval.service.ts Implements /piece/{cid} download + CommP validation.
apps/backend/src/retrieval-anon/car-validation.service.ts Implements CAR parsing, IPNI verification, sampled block fetch+hash verification.
apps/backend/src/retrieval-anon/anon-retrieval.service.ts Orchestrates anon selection → retrieval → validation → persistence + metrics.
apps/backend/src/retrieval-anon/anon-retrieval.service.spec.ts Adds unit tests for persistence/metrics behavior (including abort/partial).
apps/backend/src/retrieval-anon/anon-piece-selector.service.ts Implements bucketed/pool sampling + dedup + fallback strategy.
apps/backend/src/retrieval-anon/anon-piece-selector.service.spec.ts Adds unit tests for sampling/fallback/dedup/termination behavior.
apps/backend/src/pdp-subgraph/queries.ts Removes legacy PDP-subgraph queries (migrated to subgraph/).
apps/backend/src/pdp-subgraph/pdp-subgraph.module.ts Removes legacy PDP subgraph module (replaced by SubgraphModule).
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts Registers new anon retrieval Prometheus metrics + provider.
apps/backend/src/metrics-prometheus/check-metrics.service.ts Adds AnonRetrievalCheckMetrics helper class.
apps/backend/src/metrics-prometheus/check-metric-labels.ts Adds anon_retrieval to the CheckType union.
apps/backend/src/jobs/jobs.service.ts Adds retrieval_anon job type, queue, scheduler, and timeout handling.
apps/backend/src/jobs/jobs.service.spec.ts Updates tests for new dependency ordering and new schedule rows.
apps/backend/src/jobs/jobs.module.ts Imports RetrievalAnonModule for job execution.
apps/backend/src/jobs/job-queues.ts Adds RETRIEVAL_ANON_QUEUE.
apps/backend/src/ipni/ipni-verification.service.ts Changes IPNI verification to per-CID checks with per-CID failure tracking/counts.
apps/backend/src/http-client/types.ts Extends request result with aborted/abortReason for partial HTTP/2 downloads.
apps/backend/src/http-client/http-client.service.ts Reworks HTTP/2 timeout handling and returns partial bytes+metrics on abort mid-download.
apps/backend/src/http-client/http-client.service.spec.ts Adds tests for headersTimeout mapping, signal behavior, partial-download returns, and rethrowing non-abort errors.
apps/backend/src/database/types.ts Adds PieceFetchStatus and IpniCheckStatus enums for anon retrievals.
apps/backend/src/database/migrations/1776300000000-CreateAnonRetrievals.ts Adds Postgres schema (table + enums + indexes) for anon retrievals.
apps/backend/src/database/entities/job-schedule-state.entity.ts Adds retrieval_anon to scheduled job type union.
apps/backend/src/database/entities/anon-retrieval.entity.ts Adds TypeORM entity mapping for anon_retrievals.
apps/backend/src/database/database.module.ts Registers AnonRetrieval entity and exports it for injection.
apps/backend/src/data-retention/data-retention.service.ts Switches data retention polling to SubgraphService and subgraphEndpoint.
apps/backend/src/data-retention/data-retention.service.spec.ts Updates tests to mock SubgraphService and renamed config field.
apps/backend/src/data-retention/data-retention.module.ts Switches module import from legacy PDP subgraph module to SubgraphModule.
apps/backend/src/config/app.config.ts Renames env var to SUBGRAPH_ENDPOINT; adds anon retrieval rates/timeouts/block-sample config; derives HTTP timeouts from max job timeout.
apps/backend/src/clickhouse/clickhouse.schema.ts Adds ClickHouse anon_retrieval_checks table schema.
apps/backend/src/app.module.ts Imports RetrievalAnonModule.
apps/backend/README.md Updates env var docs to SUBGRAPH_ENDPOINT.
apps/backend/.env.example Renames env var, adds anon retrieval env vars, updates timeout guidance.
.gitignore Ignores .tool-versions.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

DEALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(4),
DATASET_CREATIONS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(1),
RETRIEVALS_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).default(2),
RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).optional(),
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RETRIEVALS_ANON_PER_SP_PER_HOUR is declared as an optional number, but an empty-string value (as in the .env.example placeholder) will fail Joi number coercion and can prevent the app from booting. Consider adding .empty("") (or .allow("") with normalization) and/or providing a default in the schema to match loadConfig()’s fallback behavior.

Suggested change
RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).optional(),
RETRIEVALS_ANON_PER_SP_PER_HOUR: Joi.number().min(0.001).max(20).empty("").optional(),

Copilot uses AI. Check for mistakes.
Comment thread docs/environment-variables.md Outdated
Comment on lines 14 to 15
| [Jobs (pg-boss)](#jobs-pg-boss) | `DEALBOT_PGBOSS_SCHEDULER_ENABLED`, `DEALBOT_PGBOSS_POOL_MAX`, `DEALS_PER_SP_PER_HOUR`, `DATASET_CREATIONS_PER_SP_PER_HOUR`, `RETRIEVALS_PER_SP_PER_HOUR`, `JOB_SCHEDULER_POLL_SECONDS`, `JOB_WORKER_POLL_SECONDS`, `PG_BOSS_LOCAL_CONCURRENCY`, `JOB_CATCHUP_MAX_ENQUEUE`, `JOB_SCHEDULE_PHASE_SECONDS`, `JOB_ENQUEUE_JITTER_SECONDS`, `DEAL_JOB_TIMEOUT_SECONDS`, `RETRIEVAL_JOB_TIMEOUT_SECONDS`, `ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS`, `IPFS_BLOCK_FETCH_CONCURRENCY` |
| [Dataset](#dataset-configuration) | `DEALBOT_LOCAL_DATASETS_PATH`, `RANDOM_PIECE_SIZES` |
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jobs section lists ANON_RETRIEVAL_JOB_TIMEOUT_SECONDS, but the PR also introduces RETRIEVALS_ANON_PER_SP_PER_HOUR and ANON_RETRIEVAL_BLOCK_SAMPLE_COUNT which aren’t documented here. This makes it hard to configure/understand the new anon retrieval behavior from the env-var reference.

Copilot uses AI. Check for mistakes.
Comment thread apps/backend/src/retrieval-anon/anon-piece-selector.service.ts
Comment thread apps/backend/src/retrieval-anon/car-validation.service.ts
Comment on lines +106 to +130
const pieceBytes = Buffer.isBuffer(result.data) ? result.data : Buffer.from(result.data);
const commPValid = await this.validateCommP(pieceBytes, pieceCid);

this.logger.debug({
event: "piece_fetch_success",
message: "Piece fetched successfully",
pieceCid,
spAddress,
bytesReceived: metrics.responseSize,
latencyMs: metrics.totalTime,
ttfbMs: metrics.ttfb,
});

return {
success: true,
pieceCid,
bytesReceived: metrics.responseSize,
pieceBytes,
latencyMs: metrics.totalTime,
ttfbMs: metrics.ttfb,
throughputBps,
statusCode: metrics.statusCode,
commPValid,
};
} catch (error) {
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchPiece() returns success: true for any 2xx response even when commPValid is false. That makes pieceFetchStatus/overall check status report success for corrupted/incorrect bytes, and can incorrectly run CAR validation on invalid data. Consider treating a CommP mismatch as a retrieval failure (set success: false, populate errorMessage, and/or split status into httpOk vs commpValid).

Copilot uses AI. Check for mistakes.

service_type LowCardinality(String), -- 'direct_sp' (only mode for anon retrievals today)

piece_fetch_status LowCardinality(String), -- 'success' | 'failed' — outcome of GET /piece/<pieceCid> (HTTP 2xx AND CommP match). CAR/IPNI/block-fetch outcomes live in their own columns.
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClickHouse column comment for piece_fetch_status says it represents “HTTP 2xx AND CommP match”, but the current anon retrieval implementation sets piece_fetch_status=success whenever the HTTP request succeeds, even if commp_valid is false. Either adjust the status semantics in code or update this comment to match reality so dashboards/analyses aren’t misleading.

Suggested change
piece_fetch_status LowCardinality(String), -- 'success' | 'failed' outcome of GET /piece/<pieceCid> (HTTP 2xx AND CommP match). CAR/IPNI/block-fetch outcomes live in their own columns.
piece_fetch_status LowCardinality(String), -- 'success' | 'failed' outcome of GET /piece/<pieceCid> at the fetch/HTTP layer. CommP verification is recorded separately in `commp_valid`; CAR/IPNI/block-fetch outcomes live in their own columns.

Copilot uses AI. Check for mistakes.
Comment on lines 43 to 44
| [`SUBGRAPH_ENDPOINT`](../environment-variables.md#subgraph_endpoint) | TODO: fill this in | Uses the subgraph from [pdp-explorer](https://github.com/FilOzone/pdp-explorer). |
| [`MIN_NUM_DATASETS_FOR_CHECKS`](../environment-variables.md#dataset-configuration) | 15 | Ensure there are enough datasets with pieces being added so that statistical significance for [Data Retention Fault Rate](#data-retention-fault-rate) can be achieved quicker. Note that on mainnet each dataset incurs 5 challenges[^1] per daily proof[^2]. With this many datasets, an SP can be approved for data retention after a faultless ~7 days even if the SP doesn't have other datasets. |
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table still notes the subgraph “Uses the subgraph from pdp-explorer”, but the rest of the PR/documentation now describes using the dealbot-owned subgraph deployment (apps/subgraph, Goldsky slots). Consider updating the note to avoid pointing operators at the wrong endpoint/source.

Copilot uses AI. Check for mistakes.
@SgtPooki
Copy link
Copy Markdown
Collaborator

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config.

Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services?

@iand
Copy link
Copy Markdown
Contributor

iand commented Apr 30, 2026

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

This is my basic point. Clickhouse is an optional component so without storing in postgres there is no record of the retrieval.

@iand
Copy link
Copy Markdown
Contributor

iand commented Apr 30, 2026

I'm fine either way. I can easily revert the latest commit. Just note that currently for the basic retrieval we store the same data in both db systems and one could consider it inconsistent if we did it differently here. @iand please chime in here.

we have been slowly attempting to move away from storing data in postgres, instead leaning on prometheus/clickhouse + logs, so not adding another dependency on the database would be ideal. If it's necessary, than I don't want to push back too hard, but it requires us to manage cleaning up the DB and other various concerns that are more easily handled with prom/clickhouse/betterstack expiry config.

Is there a reason we need to store it in postgres table instead of just piping output to our metrics/logs services?

This is a valid point too. It's really boils down to what the purpose of postgres is, whether it is to simply record working state or if it's to provide any diagnostic data.

@dennis-tra
Copy link
Copy Markdown
Contributor Author

If it's necessary, than I don't want to push back too hard [...].

It's not necessary. Everything worked fine before the last commit 👍 Then Clickhouse becomes a hard dependency (which I think is fine for the same reason as you enumerated @SgtPooki ) but IIUC this was the point to pause for @iand.

So what do we do? Revert to CH only?

@BigLep
Copy link
Copy Markdown
Contributor

BigLep commented May 1, 2026

Wow - good stuff! A few things from driving by. Let me know if it's key that I look more closely:

  1. I think we're fine to use clickhouse only (assuming we also have prometheus so we can see jobs running, completing).

  2. "Drop it ... if its CID appears in the last 500 anonymous retrievals (so we don't sample the same block twice in fast succession)." I don't know how important is. Keeping it simple where each run we just pick a random sample is fine. If we we randomly pick the same CID a couple of times in a row, I don't think it's a big deal. Also, what do you do when an SP has less than 500 CIDs?

  3. Documentation: I realize New check: retrieval++ #427 wasn't explicit about this (my bad), but I'd like https://github.com/FilOzone/dealbot/tree/main/docs/checks to stay updated in terms of events, metrics, and check descriptions. Basically I want to have a place where a human can reason about our our checks our metrics.

@SgtPooki
Copy link
Copy Markdown
Collaborator

SgtPooki commented May 1, 2026

@dennis-tra

I think we're fine to use clickhouse only (assuming we also have prometheus so we can see jobs running, completing).

Yea, we should pump to clickhouse and prometheus. so i think the going consensus is: we should not be adding info to postgres unless it's required to read/infer state. we do need that for deals/data-storage checks, we dont need it for retrievals but it's still there because we haven't removed it.

we can eventually migrate to no data-storage/deal table in the future, but that would require a lot more RPC calls and chain walking to figure out what exists and where we want to upload things to. this isn't priority, but keeping new out of postgres where possible is (a priority).

Copy link
Copy Markdown
Collaborator

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed some and have a few high level questions:

  1. why change PDP_SUBGRAPH_ENDPOINT env var? seems unnecessary and will cause issues with our deployed version, and potentially old stale code as well
  2. why move subgraph code to /subgraph instead of just leaving in /pdp-subgraph? -- this is a large PR and keeping to existing norms rather than overwriting things would be ideal
  3. changing how we do ipni verification can have drastic impact on current metrics.. switching to serial individual CID checks could cause already slow IPNI verification for some SPs to start fully failing.

Comment thread apps/backend/src/database/migrations/1776300000000-CreateAnonRetrievals.ts Outdated
Comment thread apps/backend/src/ipni/ipni-verification.service.ts Outdated
Comment thread apps/backend/src/config/app.config.ts
Comment thread kustomize/overlays/local/backend-configmap-local.yaml
Comment thread docs/environment-variables.md
dennis-tra and others added 22 commits May 15, 2026 21:59
…undant clickhouse-enabled gate

- Replace string literals ("valid"|"invalid"|"skipped"|"error") with
  IpniCheckStatus enum in anon-retrieval.service.ts
- Drop the `if (clickhouseService.enabled)` wrapper around the insert call;
  ClickhouseService.insert is already a no-op when disabled, matching the
  pattern used by other retrieval flows
- Fix outdated ipni_status schema comment to include the 'error' value
Co-authored-by: Steve Loeppky <stvn@loeppky.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ⌨️ In Progress

Development

Successfully merging this pull request may close these issues.

6 participants