Skip to content

feat: cap deal/retrievals with abort signals#263

Merged
SgtPooki merged 10 commits into
mainfrom
258-we-need-to-set-dealretrieval-max-timeout
Feb 16, 2026
Merged

feat: cap deal/retrievals with abort signals#263
SgtPooki merged 10 commits into
mainfrom
258-we-need-to-set-dealretrieval-max-timeout

Conversation

@SgtPooki
Copy link
Copy Markdown
Collaborator

@SgtPooki SgtPooki commented Feb 11, 2026

Summary
Adds end-to-end abort propagation with shared utilities, introduces job-level timeout enforcement for deal/retrieval jobs, and improves observability/testing around aborted runs and retrieval failures. Updates HTTP timeout defaults and documents new job timeout env vars.

Problem
Long-running deal/retrieval jobs and downstream steps lacked consistent abort handling, causing wasted work and incomplete observability when timeouts or cancellations occur. Abort reasons could be lost, and job metrics didn’t clearly distinguish aborts from failures.

Solution

  • Added abort-utils helpers (createAbortError, awaitWithAbort, delay) with tests.
  • Propagated AbortSignal through deal/retrieval flows, add‑ons, and IPNI polling; prevent new work on abort while preserving partial results.
  • Job runner now enforces per‑job timeouts (deal/retrieval) via AbortController, records handler_result="aborted", and keeps success vs business failure semantics.
  • Retrieval results carry an aborted flag; errors preserve non‑Error abort reasons.
  • Added/updated tests for abort behavior and error preservation.

Notes

  • New env vars: DEAL_JOB_TIMEOUT_SECONDS, RETRIEVAL_JOB_TIMEOUT_SECONDS (defaults 6m/1m) and docs updated.
  • HTTP request timeout defaults reduced to 4m to align with expected transfer throughput.
  • Metrics doc now describes jobs_completed_total handler result values (success, aborted, error).

Fixes #258

Copilot AI review requested due to automatic review settings February 11, 2026 19:50
@SgtPooki SgtPooki linked an issue Feb 11, 2026 that may be closed by this pull request
@FilOzzy FilOzzy added this to FOC Feb 11, 2026
@github-project-automation github-project-automation Bot moved this to 📌 Triage in FOC Feb 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end abort propagation and job-level timeout enforcement for deal/retrieval workflows so long-running jobs can be actively cancelled while improving metrics/logging around abort vs failure.

Changes:

  • Introduces shared abort helpers (abort-utils) and adopts AbortSignal propagation across deal/retrieval flows (including add-ons and IPNI polling).
  • Enforces per-job timeouts in the pg-boss job runner and records handler_result="aborted" for timed-out executions.
  • Updates defaults/docs for job timeouts and HTTP request timeouts, plus adds/updates tests for abort behavior and error preservation.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/environment-variables.md Documents new job timeout env vars and adds them to the quick reference.
apps/backend/src/retrieval/retrieval.service.ts Propagates abort signals through retrieval execution and adjusts batch behavior on abort.
apps/backend/src/retrieval/retrieval.service.spec.ts Updates tests for abort behavior and aligns Deal IDs with UUID strings.
apps/backend/src/retrieval-addons/types.ts Extends retrieval test result shape with an optional aborted flag.
apps/backend/src/retrieval-addons/retrieval-addons.service.ts Adds abort checks, uses shared abort-aware delay, and improves error capture for non-Error throws (partially).
apps/backend/src/retrieval-addons/retrieval-addons.service.spec.ts Adds test ensuring non-Error throws are captured in execution results.
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts Documents handler_result semantics for jobs_completed_total.
apps/backend/src/jobs/jobs.service.ts Enforces job-level timeouts via AbortController and reports aborted jobs distinctly.
apps/backend/src/jobs/jobs.service.spec.ts Adds metrics and timeout-abort tests for deal/retrieval jobs; updates private-call signatures.
apps/backend/src/deal/deal.service.ts Propagates abort signal into deal creation/upload/IPNI/retrieval checks; preserves non-Error error messages.
apps/backend/src/deal/deal.service.spec.ts Adds coverage for preserving non-Error abort reasons through deal creation failure recording.
apps/backend/src/deal-addons/strategies/ipni.strategy.ts Propagates abort signal through IPNI monitoring/polling and uses abort-aware delay.
apps/backend/src/deal-addons/interfaces/deal-addon.interface.ts Extends onUploadComplete to accept an optional abort signal.
apps/backend/src/deal-addons/deal-addons.service.ts Propagates abort signal through upload-complete add-on handlers and uses awaitWithAbort.
apps/backend/src/config/app.config.ts Adds config schema + loader for job timeout env vars; reduces default HTTP request timeouts.
apps/backend/src/common/abort-utils.ts Adds shared helpers: createAbortError, awaitWithAbort, and abort-aware delay.
apps/backend/src/common/abort-utils.spec.ts Adds unit tests for abort utilities.
apps/backend/.env.example Adds new job timeout vars and updates HTTP timeout defaults/comments.
Comments suppressed due to low confidence (2)

apps/backend/src/retrieval-addons/retrieval-addons.service.ts:207

  • When a retrieval promise rejects in testAllRetrievalMethods, the recorded error uses result.reason?.message || "Unknown error". If a strategy throws a non-Error (the new spec covers this), .message will be undefined and the reason is lost. Prefer result.reason instanceof Error ? result.reason.message : String(result.reason) so execution results preserve the real failure details.
    const executionResults: RetrievalExecutionResult[] = results.map((result, index) => {
      if (result.status === "fulfilled") {
        return result.value;
      } else {
        // Create failed result - retryCount unknown for catastrophic failures
        return {
          url: urlResults[index].url,
          method: urlResults[index].method,
          data: Buffer.alloc(0),
          metrics: {
            latency: 0,
            ttfb: 0,
            throughput: 0,
            statusCode: 0,
            timestamp: new Date(),
            responseSize: 0,
          },
          success: false,
          error: result.reason?.message || "Unknown error",
          retryCount: undefined, // Unknown for catastrophic failures
        };

apps/backend/src/retrieval/retrieval.service.ts:201

  • performAllRetrievals logs "All retrievals failed" at error level for any thrown error, including aborts thrown via signal.throwIfAborted(). This will create noisy failure logs for expected cancellations/timeouts. Consider skipping the error log (or downgrading to warn) when signal?.aborted is true, similar to the batch-level handling above.
    } catch (error) {
      const errorMessage = error instanceof Error ? error.message : String(error);
      this.logger.error(`All retrievals failed for ${deal.pieceCid}: ${errorMessage}`);


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/backend/src/retrieval-addons/retrieval-addons.service.ts Outdated
Comment thread apps/backend/src/deal-addons/deal-addons.service.ts Outdated
Comment thread apps/backend/src/jobs/jobs.service.ts
Comment thread apps/backend/src/jobs/jobs.service.ts
Comment thread apps/backend/src/retrieval/retrieval.service.ts Outdated
Comment thread apps/backend/src/retrieval/retrieval.service.ts Outdated
@SgtPooki SgtPooki self-assigned this Feb 11, 2026
@SgtPooki SgtPooki moved this from 📌 Triage to 🔎 Awaiting review in FOC Feb 11, 2026
Copy link
Copy Markdown
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SgtPooki: if we don't get @silent-cipher review during his 2026-02-13, I think this would be a good candidate for having another agent do a double check of the change. It should be able to reason about abortcontrollers and its standard behavior and then trace through to make sure it is propagated through everywhere.

Comment thread docs/environment-variables.md
Comment thread docs/environment-variables.md Outdated
Comment thread docs/environment-variables.md Outdated
Copy link
Copy Markdown
Collaborator

@silent-cipher silent-cipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Nothing blocking - just few comments

Comment thread apps/backend/src/jobs/jobs.service.ts Outdated
Comment thread apps/backend/src/jobs/jobs.service.ts
Comment thread apps/backend/src/deal/deal.service.ts Outdated
@BigLep BigLep moved this from 🔎 Awaiting review to ✔️ Approved by reviewer in FOC Feb 13, 2026
@SgtPooki
Copy link
Copy Markdown
Collaborator Author

also updated retreival and deal timeouts to 6m and 1m respectively

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

apps/backend/src/jobs/jobs.service.ts:360

  • Inside handleRetrievalJob, timeoutMs is declared for the job-level abort timer and then re-declared inside recordJobExecution for the interval-based retrieval deadline. Reusing the same name makes it easy to pass the wrong value in future edits; renaming one of them would reduce confusion and prevent subtle bugs.
      try {
        const timeoutsConfig = this.configService.get("timeouts");
        const intervalMs = data.intervalSeconds * 1000;
        const timeoutMs = Math.max(10000, intervalMs - timeoutsConfig.retrievalTimeoutBufferMs);
        const httpTimeoutMs = Math.max(timeoutsConfig.httpRequestTimeoutMs, timeoutsConfig.http2RequestTimeoutMs);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/backend/src/jobs/jobs.service.ts Outdated
Comment thread apps/backend/src/common/abort-utils.ts Outdated
Comment thread docs/environment-variables.md
Comment thread apps/backend/src/config/app.config.ts
Comment thread apps/backend/src/deal-addons/strategies/ipni.strategy.ts Outdated
Comment thread docs/environment-variables.md
Comment thread apps/backend/src/jobs/jobs.service.ts Outdated
@rjan90 rjan90 added this to the M4.1: mainnet ready milestone Feb 16, 2026
@SgtPooki SgtPooki merged commit 0623bcf into main Feb 16, 2026
6 checks passed
@github-project-automation github-project-automation Bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC Feb 16, 2026
@SgtPooki SgtPooki deleted the 258-we-need-to-set-dealretrieval-max-timeout branch February 16, 2026 12:58
SgtPooki added a commit that referenced this pull request May 5, 2026
* docs(checks): close resolved TBDs in data-storage, events, README

Items previously marked TBD that are now implemented in code:

- data-storage.md assertions #3, #5, #6, #7 (pieceConfirmed, IPNI
  discoverability, retrievability, all-checks-gated) -> Yes.
- data-storage.md poll intervals: replace TBD_VARIABLE refs with the
  concrete sources (hardcoded POLLING_INTERVAL_MS = 2.5s for SP piece
  status, IPNI_VERIFICATION_POLLING_MS env var with 2s default for IPNI
  verification; doc previously claimed 5s).
- data-storage.md section 7 header drops TBD; intro disclaimer removed.
- data-storage.md "TBD Summary" rewritten as "Implementation History"
  with code references for inline retrieval, CID integrity, per-deal
  timeout (AbortController -> DealStatus.FAILED), gated status, status
  model, onPieceConfirmed, IPFS gateway retrieval, filecoin-pin CAR.
- events-and-metrics.md: pieceConfirmed -> Yes (pieceConfirmedOnChainMs
  histogram); ipfsRetrievalIntegrityChecked -> implemented inline via
  per-block sha256 verification in ipfs-block.strategy.ts (no discrete
  event); ipfsRetrievalFirstByte/LastByteReceived marked Partial since
  duration histograms exist but no discrete event; histogram-buckets
  TBD replaced with link to metrics-prometheus.module.ts.
- README.md: name the dataset-creation job (data-set-creation) and
  reference its config envs.

Still TBD (not changed in this commit): uploadToSpStart,
ipniVerificationStart, ipfsRetrievalStart events; jobs.md PR #263
lookahead-skip; PDP_SUBGRAPH_ENDPOINT production value.

* docs(checks): address review feedback on callback names and event states

- data-storage.md: rename Synapse callbacks to plural form
  (onPiecesAdded, onPiecesConfirmed) to match deal.service.ts.
- events-and-metrics.md: same rename in the event list. Clarify that
  dealCreated maps to DealStatus.DEAL_CREATED only after all gates pass
  (upload alone sets UPLOADED, not DEAL_CREATED).
- events-and-metrics.md: ipfsRetrievalIntegrityChecked downgraded from
  Yes to Partial since no discrete event is emitted (inline check
  only).
- events-and-metrics.md: Mermaid timeline now matches the table -
  ipfsRetrievalFirstByteReceived/LastByteReceived labelled as
  "Partial: histogram only", ipfsRetrievalIntegrityChecked labelled
  "Partial: inline check, no event".
- README.md: refer to the canonical pg-boss job type
  data_set_creation (underscore) so operators can map the doc to
  jobType values.

* docs(checks): fix unreadable Mermaid rect fill in event timeline

The 'Data Storage Only' rect used rgb(50, 50, 50), which renders as a
near-black block that hides the message labels and arrows inside it
(both on GitHub light/dark themes). Switch to a translucent
rgba(120, 120, 200, 0.15) so the highlight is visible without
obscuring content.

* docs(events): reframe Event List as timing markers, not emitted events

The 'events' in this doc are named anchors used to define metric Timer
Starts/Ends; dealbot does not necessarily emit each as a discrete
Prometheus event or log line. Add an explicit note up top so readers
don't expect every entry to map to an emitted event, and update rows
that were marked TBD/Partial purely because no discrete event is
emitted.

- uploadToSpStart -> Yes (anchor: deal.uploadStartTime in
  deal.service.ts:255).
- ipniVerificationStart -> Yes (anchor: ipniVerificationStartTime in
  ipni-verification.service.ts:63 - drives ipniVerifyMs).
- ipfsRetrievalStart -> Yes (anchor: retrieval startTime in
  retrieval-addons.service.ts:227; logs 'retrieval_started').
- ipfsRetrievalFirstByteReceived -> Yes (drives
  ipfsRetrievalFirstByteMs).
- ipfsRetrievalLastByteReceived -> Yes (drives
  ipfsRetrievalLastByteMs).
- ipfsRetrievalIntegrityChecked -> Yes (per-block sha256 in
  ipfs-block.strategy.ts; inline, no discrete event).
- Mermaid timeline: drop the (TBD) / (Partial: ...) annotations on
  these markers so the diagram and the table agree.

* docs(events): drop Implemented column from Event List

All rows are now Yes (each marker is anchored in code), so the column
adds no signal. Anchor details folded into the Source-of-truth column.
Intro note tightened.

* Update docs/checks/data-storage.md

Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>

* Update docs/checks/README.md

Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>

---------

Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>
dennis-tra added a commit to probe-lab/dealbot that referenced this pull request May 15, 2026
commit 008f0d8
Author: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>
Date:   Thu May 14 20:52:01 2026 +0530

    feat: filter out sps with dev tags (FilOzone#526)

    * feat: update Synapse stack for filecoin-pin 0.21

    feat: update Synapse stack for filecoin-pin 0.21

    * feat: filter out dev providers from active pool

    * feat: look for service_status

    * docs: document serviceStatus=dev opt-out mechanism for SPs

    * chore: remove excessive test cases

    * refactor: stick to dealbot defined serviceStatus format

    ---------

    Co-authored-by: Phi <orjan.roren@gmail.com>

commit 7aa2f8a
Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com>
Date:   Thu May 14 11:15:33 2026 -0400

    fix: handle PDP-terminated datasets via data_set_creation repair (FilOzone#518)

    * fix(jobs): repair PDP-terminated datasets in data_set_creation (FilOzone#379)

    PDP can mark a dataset terminated while FWSS still has pdpEndEpoch=0.
    synapse-sdk createContext filters only on pdpEndEpoch, so it returns
    dead datasets. The next add-pieces fails with "Data set has been
    terminated due to unrecoverable proving failure".

    data_set_creation now classifies each slot as missing | live |
    terminated and runs a bounded repair on terminated: terminateDataSet,
    poll FWSS pdpEndEpoch != 0, mark affected Deal rows cleaned_up in a
    single transaction, defer the replacement to the next tick.

    Deal job now skips when the resolved context's dataSetId is PDP-dead,
    before any data-storage metric, upload, or Deal-row write.

    Includes a one-shot backfill script for existing terminated datasets.

    Upstream trackers (orthogonal): FilOzone/synapse-sdk#780,
    FilOzone/filecoin-services#473.

    * style: biome formatter fixes

    * fix(deal): only treat known terminal probe error as terminated; idempotent repair

    isDataSetLive previously returned false for ANY validateDataSet failure, so a
    transient RPC error could classify a healthy dataset as PDP-terminated and
    trigger destructive repair. It now returns false only for the known terminal
    "does not exist or is not live" message and rethrows everything else.

    repairTerminatedDataSet is now idempotent on partial-prior-run state:
      - If FWSS pdpEndEpoch is already non-zero, skip terminateDataSet entirely.
      - If terminateDataSet reverts with an already-terminated message, treat it
        as a no-op and continue to the FWSS state poll + cleanup.
      - After terminateDataSet, await the tx receipt before polling FWSS state.

    Adds tests for the rethrow path, the already-terminated skip, and the
    revert-as-noop path.

    * fix: address remaining copilot comments

    - waitForPdpEndEpoch: switch abortable sleep to node:timers/promises
      setTimeout({signal}). Removes the manual addEventListener/clearTimeout
      pair, which leaked listeners on resolve and had a race when the signal
      aborted between throwIfAborted() and addEventListener().
    - dev-tools background deal: emit a separate background_deal_skipped
      event/message when createDealForProvider returns null. The previous
      message claimed success on the skip path.

    * chore: drop one-shot backfill script; rely on data_set_creation ticks

    * fix: address silent-cipher review (orphan PENDING + upfront validation)

    - handleDealJob now probes baseline dataset via getDataSetProvisioningStatus
      unconditionally before deal preparation; terminated baseline or selected
      dsIndex fails the job (handler_result=error in Prometheus) instead of
      wasting upload prep.
    - triggerDeal marks the placeholder Deal row FAILED (with errorMessage)
      on the PDP-terminated skip path; preserves the row for HTTP polling and
      audit.
    - Remove dead checkDataSetExists (and its tests); getDataSetProvisioningStatus
      is strictly more informative (missing|live|terminated).

    * fix: replace null-on-skip with typed DealJobTerminatedDataSetError

    createDealForProvider/createDeal now return Promise<Deal> and throw a
    typed error when the targeted data set is PDP-terminated. Callers map
    the typed error to FAILED outcomes without relying on a null return:

    - jobs.service handleDealJob: upfront baseline and dsIndex probes throw
      the typed error; outer catch records handler_result=error and logs
      deal_job_failed_terminated_dataset. dsIndex probe also logs
      dataSetIndex locally before re-throw so per-slot context is preserved.
    - dev-tools triggerDeal: existing background catch updates the Deal row
      to FAILED with the thrown error message.
    - createDeal: a preUploadTerminated flag short-circuits the catch's
      failure metrics and the finally's saveDeal so the terminated path
      does not spam metrics or rows.
    - waitForPdpEndEpoch: wrap getDataSet in awaitWithAbort so in-flight
      polls honor the abort signal (Copilot 3229623471).

    * chore: trim redundant abort check and narrating comments

    - waitForPdpEndEpoch: drop signal?.throwIfAborted() at loop head;
      awaitWithAbort already performs it.
    - Trim narrating fragments from PDP-terminated guard comments; keep
      only the non-obvious FWSS-vs-PDP rationale and the issue link.

    * chore: biome format dev-tools event-name ternary

    * refactor: centralize data-set probe in DealService (FilOzone#535)

    * refactor: centralize data-set probe in DealService

    Lift dsIndex selection + provisioning probe out of handleDealJob into
    DealService.resolveDataSetMetadataForDeal, invoked from
    createDealForProvider. handleDealJob just delegates and maps
    DealJobTerminatedDataSetError to handler_result="error".

    Behavior change: when minNumDataSetsForChecks > 1 and the randomly
    selected indexed slot is PDP-terminated, the deal job falls back to the
    baseline slot instead of failing (logs deal_job_dataset_index_terminated
    first). data_set_creation still owns repair.

    The post-createContext isDataSetLive guard inside createDeal stays as the
    commit-time TOCTOU check on the exact dataSetId the upload will use.

    * style: biome format + consolidate indexed-slot fallback tests via it.each

    * docs: drop TOCTOU phrasing in resolveDataSetMetadataForDeal jsdoc

commit d2f21ce
Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com>
Date:   Wed May 13 07:55:03 2026 -0400

    feat(web): link to combined approved-SP dashboard on landing (FilOzone#525)

    * feat(web): link to combined approved-SP dashboard on landing

    Adds a configurable link on the landing page pointing to the BetterStack
    dashboard that shows combined performance metrics for approved SPs. Lets
    visitors see overall FOC storage experience without first picking an SP.

    Configured via APPROVED_SP_DASHBOARD_URL (runtime) /
    VITE_APPROVED_SP_DASHBOARD_URL (build).

    Closes FilOzone#384

    * feat(web): network-aware approved-SP dashboard CTA

    - Split APPROVED_SP_DASHBOARD_URL into per-network vars (_MAINNET / _CALIBRATION)
      so a single web deployment can serve both networks correctly.
    - Render the link as a primary CTA card above the per-SP table, with copy
      qualified by the current network ("...on Calibration").
    - a11y: mark decorative ExternalLink icons aria-hidden.

    * fix(web): name both runtime and build-time vars in invalid-URL warning

    Copilot review feedback: warning previously named only the runtime var,
    but getConfigUrl falls back to the VITE_* build var too. Surface both
    so operators can find the right knob during local dev.

    * chore(web): biome format

commit c9ad711
Author: Phi-rjan <orjan.roren@gmail.com>
Date:   Wed May 13 03:50:42 2026 +0200

    chore: update Synapse stack for filecoin-pin 0.21 (FilOzone#521)

    * feat: update Synapse stack for filecoin-pin 0.21

    feat: update Synapse stack for filecoin-pin 0.21

    * docs(checks): rename Synapse progress events for filecoin-pin 0.21

    ---------

    Co-authored-by: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com>

commit 0bb5217
Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com>
Date:   Tue May 12 13:42:30 2026 -0400

    fix: stop retention counter double-counts (FilOzone#519)

    * fix: stop retention counter double-counts

    * refactor(data-retention): drop redundant poll guard and instance map

    * docs(data-retention): align wording with poll-local baselines

commit fd6ce8a
Author: FilOz Bot <infra+github-fil-ozzy@filoz.org>
Date:   Thu May 7 00:40:00 2026 -0700

    chore: release to production (main) (FilOzone#514)

commit 9ef0235
Author: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>
Date:   Thu May 7 12:57:21 2026 +0530

    fix: revert back to old synapse version (FilOzone#512)

commit 8658e34
Author: FilOz Bot <infra+github-fil-ozzy@filoz.org>
Date:   Tue May 5 17:44:14 2026 +0200

    chore: release to production (main) (FilOzone#458)

commit c410184
Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com>
Date:   Tue May 5 09:50:50 2026 -0400

    docs(checks): close resolved TBDs in data-storage, events, README (FilOzone#481)

    * docs(checks): close resolved TBDs in data-storage, events, README

    Items previously marked TBD that are now implemented in code:

    - data-storage.md assertions #3, #5, #6, #7 (pieceConfirmed, IPNI
      discoverability, retrievability, all-checks-gated) -> Yes.
    - data-storage.md poll intervals: replace TBD_VARIABLE refs with the
      concrete sources (hardcoded POLLING_INTERVAL_MS = 2.5s for SP piece
      status, IPNI_VERIFICATION_POLLING_MS env var with 2s default for IPNI
      verification; doc previously claimed 5s).
    - data-storage.md section 7 header drops TBD; intro disclaimer removed.
    - data-storage.md "TBD Summary" rewritten as "Implementation History"
      with code references for inline retrieval, CID integrity, per-deal
      timeout (AbortController -> DealStatus.FAILED), gated status, status
      model, onPieceConfirmed, IPFS gateway retrieval, filecoin-pin CAR.
    - events-and-metrics.md: pieceConfirmed -> Yes (pieceConfirmedOnChainMs
      histogram); ipfsRetrievalIntegrityChecked -> implemented inline via
      per-block sha256 verification in ipfs-block.strategy.ts (no discrete
      event); ipfsRetrievalFirstByte/LastByteReceived marked Partial since
      duration histograms exist but no discrete event; histogram-buckets
      TBD replaced with link to metrics-prometheus.module.ts.
    - README.md: name the dataset-creation job (data-set-creation) and
      reference its config envs.

    Still TBD (not changed in this commit): uploadToSpStart,
    ipniVerificationStart, ipfsRetrievalStart events; jobs.md PR FilOzone#263
    lookahead-skip; PDP_SUBGRAPH_ENDPOINT production value.

    * docs(checks): address review feedback on callback names and event states

    - data-storage.md: rename Synapse callbacks to plural form
      (onPiecesAdded, onPiecesConfirmed) to match deal.service.ts.
    - events-and-metrics.md: same rename in the event list. Clarify that
      dealCreated maps to DealStatus.DEAL_CREATED only after all gates pass
      (upload alone sets UPLOADED, not DEAL_CREATED).
    - events-and-metrics.md: ipfsRetrievalIntegrityChecked downgraded from
      Yes to Partial since no discrete event is emitted (inline check
      only).
    - events-and-metrics.md: Mermaid timeline now matches the table -
      ipfsRetrievalFirstByteReceived/LastByteReceived labelled as
      "Partial: histogram only", ipfsRetrievalIntegrityChecked labelled
      "Partial: inline check, no event".
    - README.md: refer to the canonical pg-boss job type
      data_set_creation (underscore) so operators can map the doc to
      jobType values.

    * docs(checks): fix unreadable Mermaid rect fill in event timeline

    The 'Data Storage Only' rect used rgb(50, 50, 50), which renders as a
    near-black block that hides the message labels and arrows inside it
    (both on GitHub light/dark themes). Switch to a translucent
    rgba(120, 120, 200, 0.15) so the highlight is visible without
    obscuring content.

    * docs(events): reframe Event List as timing markers, not emitted events

    The 'events' in this doc are named anchors used to define metric Timer
    Starts/Ends; dealbot does not necessarily emit each as a discrete
    Prometheus event or log line. Add an explicit note up top so readers
    don't expect every entry to map to an emitted event, and update rows
    that were marked TBD/Partial purely because no discrete event is
    emitted.

    - uploadToSpStart -> Yes (anchor: deal.uploadStartTime in
      deal.service.ts:255).
    - ipniVerificationStart -> Yes (anchor: ipniVerificationStartTime in
      ipni-verification.service.ts:63 - drives ipniVerifyMs).
    - ipfsRetrievalStart -> Yes (anchor: retrieval startTime in
      retrieval-addons.service.ts:227; logs 'retrieval_started').
    - ipfsRetrievalFirstByteReceived -> Yes (drives
      ipfsRetrievalFirstByteMs).
    - ipfsRetrievalLastByteReceived -> Yes (drives
      ipfsRetrievalLastByteMs).
    - ipfsRetrievalIntegrityChecked -> Yes (per-block sha256 in
      ipfs-block.strategy.ts; inline, no discrete event).
    - Mermaid timeline: drop the (TBD) / (Partial: ...) annotations on
      these markers so the diagram and the table agree.

    * docs(events): drop Implemented column from Event List

    All rows are now Yes (each marker is anchored in code), so the column
    adds no signal. Anchor details folded into the Source-of-truth column.
    Intro note tightened.

    * Update docs/checks/data-storage.md

    Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>

    * Update docs/checks/README.md

    Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>

    ---------

    Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>

commit 126b2d8
Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com>
Date:   Tue May 5 08:24:47 2026 -0400

    fix(deal): cancel onStored addons when upload fails (FilOzone#505)

    * fix(deal): cancel onStored addons when upload fails

    The synapse-sdk StorageContext.upload fires onStored before commit/addPieces,
    so dealbot's IPNI monitoring runs detached. When executeUpload throws (e.g.
    409 on POST /pdp/data-sets/{id}/pieces against a Curio-terminated dataset),
    the leaked IPNI poll runs to its 120s timeout and logs a misleading
    ipni_tracking_failed event after the deal already failed.

    Wire an AbortController for the detached addons, composed with the parent
    signal via AbortSignal.any, and abort + drain in the catch path.

    Closes FilOzone#503.

    * fix(deal): clear ipniStatus on aborted onStored, fix TS narrowing

    Two follow-ups on the addon-cancel path:

    1. Use a wrapper object for onStoredAddons.promise so TS preserves the
       union type across closure mutation in onProgress; the prior
       `let x: Promise<boolean> | null = null` pattern narrowed to `null`
       in finally and broke typecheck.

    2. Clear deal.ipniStatus on aborted onStored runs. IpniAddonStrategy.onStored
       sets PENDING before awaiting; if we abort before terminal status is set,
       sp_performance_query.helper counts PENDING as `total_ipni_deals` and
       depresses ipni_success_rate. Set null on FAILED deals so aborted runs
       don't pollute the metric.

    * fix(deal): only clear ipniStatus when still PENDING after addon abort

    Earlier fix cleared ipniStatus for any FAILED deal, which would also
    wipe legitimate IpniStatus.FAILED set by IpniAddonStrategy on real IPNI
    failures and IpniStatus.VERIFIED on retrieval-stage failures. Narrow
    the condition to PENDING so only mid-flight aborts are cleared.

    * style: apply biome format

    * fix(deal): skip onStored addon abort when success path already awaited it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🎉 Done

Development

Successfully merging this pull request may close these issues.

we need to set deal/retrieval max timeout

6 participants