feat: cap deal/retrievals with abort signals by SgtPooki · Pull Request #263 · FilOzone/dealbot

SgtPooki · 2026-02-11T19:50:09Z

Summary
Adds end-to-end abort propagation with shared utilities, introduces job-level timeout enforcement for deal/retrieval jobs, and improves observability/testing around aborted runs and retrieval failures. Updates HTTP timeout defaults and documents new job timeout env vars.

Problem
Long-running deal/retrieval jobs and downstream steps lacked consistent abort handling, causing wasted work and incomplete observability when timeouts or cancellations occur. Abort reasons could be lost, and job metrics didn’t clearly distinguish aborts from failures.

Solution

Added abort-utils helpers (createAbortError, awaitWithAbort, delay) with tests.
Propagated AbortSignal through deal/retrieval flows, add‑ons, and IPNI polling; prevent new work on abort while preserving partial results.
Job runner now enforces per‑job timeouts (deal/retrieval) via AbortController, records handler_result="aborted", and keeps success vs business failure semantics.
Retrieval results carry an aborted flag; errors preserve non‑Error abort reasons.
Added/updated tests for abort behavior and error preservation.

Notes

New env vars: DEAL_JOB_TIMEOUT_SECONDS, RETRIEVAL_JOB_TIMEOUT_SECONDS (defaults 6m/1m) and docs updated.
HTTP request timeout defaults reduced to 4m to align with expected transfer throughput.
Metrics doc now describes jobs_completed_total handler result values (success, aborted, error).

Fixes #258

Copilot

Pull request overview

Adds end-to-end abort propagation and job-level timeout enforcement for deal/retrieval workflows so long-running jobs can be actively cancelled while improving metrics/logging around abort vs failure.

Changes:

Introduces shared abort helpers (abort-utils) and adopts AbortSignal propagation across deal/retrieval flows (including add-ons and IPNI polling).
Enforces per-job timeouts in the pg-boss job runner and records handler_result="aborted" for timed-out executions.
Updates defaults/docs for job timeouts and HTTP request timeouts, plus adds/updates tests for abort behavior and error preservation.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
docs/environment-variables.md	Documents new job timeout env vars and adds them to the quick reference.
apps/backend/src/retrieval/retrieval.service.ts	Propagates abort signals through retrieval execution and adjusts batch behavior on abort.
apps/backend/src/retrieval/retrieval.service.spec.ts	Updates tests for abort behavior and aligns Deal IDs with UUID strings.
apps/backend/src/retrieval-addons/types.ts	Extends retrieval test result shape with an optional `aborted` flag.
apps/backend/src/retrieval-addons/retrieval-addons.service.ts	Adds abort checks, uses shared abort-aware `delay`, and improves error capture for non-`Error` throws (partially).
apps/backend/src/retrieval-addons/retrieval-addons.service.spec.ts	Adds test ensuring non-`Error` throws are captured in execution results.
apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts	Documents `handler_result` semantics for `jobs_completed_total`.
apps/backend/src/jobs/jobs.service.ts	Enforces job-level timeouts via `AbortController` and reports aborted jobs distinctly.
apps/backend/src/jobs/jobs.service.spec.ts	Adds metrics and timeout-abort tests for deal/retrieval jobs; updates private-call signatures.
apps/backend/src/deal/deal.service.ts	Propagates abort signal into deal creation/upload/IPNI/retrieval checks; preserves non-`Error` error messages.
apps/backend/src/deal/deal.service.spec.ts	Adds coverage for preserving non-`Error` abort reasons through deal creation failure recording.
apps/backend/src/deal-addons/strategies/ipni.strategy.ts	Propagates abort signal through IPNI monitoring/polling and uses abort-aware delay.
apps/backend/src/deal-addons/interfaces/deal-addon.interface.ts	Extends `onUploadComplete` to accept an optional abort signal.
apps/backend/src/deal-addons/deal-addons.service.ts	Propagates abort signal through upload-complete add-on handlers and uses `awaitWithAbort`.
apps/backend/src/config/app.config.ts	Adds config schema + loader for job timeout env vars; reduces default HTTP request timeouts.
apps/backend/src/common/abort-utils.ts	Adds shared helpers: `createAbortError`, `awaitWithAbort`, and abort-aware `delay`.
apps/backend/src/common/abort-utils.spec.ts	Adds unit tests for abort utilities.
apps/backend/.env.example	Adds new job timeout vars and updates HTTP timeout defaults/comments.

Comments suppressed due to low confidence (2)

apps/backend/src/retrieval-addons/retrieval-addons.service.ts:207

When a retrieval promise rejects in testAllRetrievalMethods, the recorded error uses result.reason?.message || "Unknown error". If a strategy throws a non-Error (the new spec covers this), .message will be undefined and the reason is lost. Prefer result.reason instanceof Error ? result.reason.message : String(result.reason) so execution results preserve the real failure details.

    const executionResults: RetrievalExecutionResult[] = results.map((result, index) => {
      if (result.status === "fulfilled") {
        return result.value;
      } else {
        // Create failed result - retryCount unknown for catastrophic failures
        return {
          url: urlResults[index].url,
          method: urlResults[index].method,
          data: Buffer.alloc(0),
          metrics: {
            latency: 0,
            ttfb: 0,
            throughput: 0,
            statusCode: 0,
            timestamp: new Date(),
            responseSize: 0,
          },
          success: false,
          error: result.reason?.message || "Unknown error",
          retryCount: undefined, // Unknown for catastrophic failures
        };

apps/backend/src/retrieval/retrieval.service.ts:201

performAllRetrievals logs "All retrievals failed" at error level for any thrown error, including aborts thrown via signal.throwIfAborted(). This will create noisy failure logs for expected cancellations/timeouts. Consider skipping the error log (or downgrading to warn) when signal?.aborted is true, similar to the batch-level handling above.

    } catch (error) {
      const errorMessage = error instanceof Error ? error.message : String(error);
      this.logger.error(`All retrievals failed for ${deal.pieceCid}: ${errorMessage}`);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BigLep

@SgtPooki: if we don't get @silent-cipher review during his 2026-02-13, I think this would be a good candidate for having another agent do a double check of the change. It should be able to reason about abortcontrollers and its standard behavior and then trace through to make sure it is propagated through everywhere.

silent-cipher

Looks good to me! Nothing blocking - just few comments

Co-authored-by: Steve Loeppky <biglep@protocol.ai>

SgtPooki · 2026-02-13T17:44:44Z

also updated retreival and deal timeouts to 6m and 1m respectively

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

apps/backend/src/jobs/jobs.service.ts:360

Inside handleRetrievalJob, timeoutMs is declared for the job-level abort timer and then re-declared inside recordJobExecution for the interval-based retrieval deadline. Reusing the same name makes it easy to pass the wrong value in future edits; renaming one of them would reduce confusion and prevent subtle bugs.

      try {
        const timeoutsConfig = this.configService.get("timeouts");
        const intervalMs = data.intervalSeconds * 1000;
        const timeoutMs = Math.max(10000, intervalMs - timeoutsConfig.retrievalTimeoutBufferMs);
        const httpTimeoutMs = Math.max(timeoutsConfig.httpRequestTimeoutMs, timeoutsConfig.http2RequestTimeoutMs);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* docs(checks): close resolved TBDs in data-storage, events, README Items previously marked TBD that are now implemented in code: - data-storage.md assertions #3, #5, #6, #7 (pieceConfirmed, IPNI discoverability, retrievability, all-checks-gated) -> Yes. - data-storage.md poll intervals: replace TBD_VARIABLE refs with the concrete sources (hardcoded POLLING_INTERVAL_MS = 2.5s for SP piece status, IPNI_VERIFICATION_POLLING_MS env var with 2s default for IPNI verification; doc previously claimed 5s). - data-storage.md section 7 header drops TBD; intro disclaimer removed. - data-storage.md "TBD Summary" rewritten as "Implementation History" with code references for inline retrieval, CID integrity, per-deal timeout (AbortController -> DealStatus.FAILED), gated status, status model, onPieceConfirmed, IPFS gateway retrieval, filecoin-pin CAR. - events-and-metrics.md: pieceConfirmed -> Yes (pieceConfirmedOnChainMs histogram); ipfsRetrievalIntegrityChecked -> implemented inline via per-block sha256 verification in ipfs-block.strategy.ts (no discrete event); ipfsRetrievalFirstByte/LastByteReceived marked Partial since duration histograms exist but no discrete event; histogram-buckets TBD replaced with link to metrics-prometheus.module.ts. - README.md: name the dataset-creation job (data-set-creation) and reference its config envs. Still TBD (not changed in this commit): uploadToSpStart, ipniVerificationStart, ipfsRetrievalStart events; jobs.md PR #263 lookahead-skip; PDP_SUBGRAPH_ENDPOINT production value. * docs(checks): address review feedback on callback names and event states - data-storage.md: rename Synapse callbacks to plural form (onPiecesAdded, onPiecesConfirmed) to match deal.service.ts. - events-and-metrics.md: same rename in the event list. Clarify that dealCreated maps to DealStatus.DEAL_CREATED only after all gates pass (upload alone sets UPLOADED, not DEAL_CREATED). - events-and-metrics.md: ipfsRetrievalIntegrityChecked downgraded from Yes to Partial since no discrete event is emitted (inline check only). - events-and-metrics.md: Mermaid timeline now matches the table - ipfsRetrievalFirstByteReceived/LastByteReceived labelled as "Partial: histogram only", ipfsRetrievalIntegrityChecked labelled "Partial: inline check, no event". - README.md: refer to the canonical pg-boss job type data_set_creation (underscore) so operators can map the doc to jobType values. * docs(checks): fix unreadable Mermaid rect fill in event timeline The 'Data Storage Only' rect used rgb(50, 50, 50), which renders as a near-black block that hides the message labels and arrows inside it (both on GitHub light/dark themes). Switch to a translucent rgba(120, 120, 200, 0.15) so the highlight is visible without obscuring content. * docs(events): reframe Event List as timing markers, not emitted events The 'events' in this doc are named anchors used to define metric Timer Starts/Ends; dealbot does not necessarily emit each as a discrete Prometheus event or log line. Add an explicit note up top so readers don't expect every entry to map to an emitted event, and update rows that were marked TBD/Partial purely because no discrete event is emitted. - uploadToSpStart -> Yes (anchor: deal.uploadStartTime in deal.service.ts:255). - ipniVerificationStart -> Yes (anchor: ipniVerificationStartTime in ipni-verification.service.ts:63 - drives ipniVerifyMs). - ipfsRetrievalStart -> Yes (anchor: retrieval startTime in retrieval-addons.service.ts:227; logs 'retrieval_started'). - ipfsRetrievalFirstByteReceived -> Yes (drives ipfsRetrievalFirstByteMs). - ipfsRetrievalLastByteReceived -> Yes (drives ipfsRetrievalLastByteMs). - ipfsRetrievalIntegrityChecked -> Yes (per-block sha256 in ipfs-block.strategy.ts; inline, no discrete event). - Mermaid timeline: drop the (TBD) / (Partial: ...) annotations on these markers so the diagram and the table agree. * docs(events): drop Implemented column from Event List All rows are now Yes (each marker is anchored in code), so the column adds no signal. Anchor details folded into the Source-of-truth column. Intro note tightened. * Update docs/checks/data-storage.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> * Update docs/checks/README.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> --------- Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>

commit 008f0d8 Author: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> Date: Thu May 14 20:52:01 2026 +0530 feat: filter out sps with dev tags (FilOzone#526) * feat: update Synapse stack for filecoin-pin 0.21 feat: update Synapse stack for filecoin-pin 0.21 * feat: filter out dev providers from active pool * feat: look for service_status * docs: document serviceStatus=dev opt-out mechanism for SPs * chore: remove excessive test cases * refactor: stick to dealbot defined serviceStatus format --------- Co-authored-by: Phi <orjan.roren@gmail.com> commit 7aa2f8a Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Thu May 14 11:15:33 2026 -0400 fix: handle PDP-terminated datasets via data_set_creation repair (FilOzone#518) * fix(jobs): repair PDP-terminated datasets in data_set_creation (FilOzone#379) PDP can mark a dataset terminated while FWSS still has pdpEndEpoch=0. synapse-sdk createContext filters only on pdpEndEpoch, so it returns dead datasets. The next add-pieces fails with "Data set has been terminated due to unrecoverable proving failure". data_set_creation now classifies each slot as missing | live | terminated and runs a bounded repair on terminated: terminateDataSet, poll FWSS pdpEndEpoch != 0, mark affected Deal rows cleaned_up in a single transaction, defer the replacement to the next tick. Deal job now skips when the resolved context's dataSetId is PDP-dead, before any data-storage metric, upload, or Deal-row write. Includes a one-shot backfill script for existing terminated datasets. Upstream trackers (orthogonal): FilOzone/synapse-sdk#780, FilOzone/filecoin-services#473. * style: biome formatter fixes * fix(deal): only treat known terminal probe error as terminated; idempotent repair isDataSetLive previously returned false for ANY validateDataSet failure, so a transient RPC error could classify a healthy dataset as PDP-terminated and trigger destructive repair. It now returns false only for the known terminal "does not exist or is not live" message and rethrows everything else. repairTerminatedDataSet is now idempotent on partial-prior-run state: - If FWSS pdpEndEpoch is already non-zero, skip terminateDataSet entirely. - If terminateDataSet reverts with an already-terminated message, treat it as a no-op and continue to the FWSS state poll + cleanup. - After terminateDataSet, await the tx receipt before polling FWSS state. Adds tests for the rethrow path, the already-terminated skip, and the revert-as-noop path. * fix: address remaining copilot comments - waitForPdpEndEpoch: switch abortable sleep to node:timers/promises setTimeout({signal}). Removes the manual addEventListener/clearTimeout pair, which leaked listeners on resolve and had a race when the signal aborted between throwIfAborted() and addEventListener(). - dev-tools background deal: emit a separate background_deal_skipped event/message when createDealForProvider returns null. The previous message claimed success on the skip path. * chore: drop one-shot backfill script; rely on data_set_creation ticks * fix: address silent-cipher review (orphan PENDING + upfront validation) - handleDealJob now probes baseline dataset via getDataSetProvisioningStatus unconditionally before deal preparation; terminated baseline or selected dsIndex fails the job (handler_result=error in Prometheus) instead of wasting upload prep. - triggerDeal marks the placeholder Deal row FAILED (with errorMessage) on the PDP-terminated skip path; preserves the row for HTTP polling and audit. - Remove dead checkDataSetExists (and its tests); getDataSetProvisioningStatus is strictly more informative (missing|live|terminated). * fix: replace null-on-skip with typed DealJobTerminatedDataSetError createDealForProvider/createDeal now return Promise<Deal> and throw a typed error when the targeted data set is PDP-terminated. Callers map the typed error to FAILED outcomes without relying on a null return: - jobs.service handleDealJob: upfront baseline and dsIndex probes throw the typed error; outer catch records handler_result=error and logs deal_job_failed_terminated_dataset. dsIndex probe also logs dataSetIndex locally before re-throw so per-slot context is preserved. - dev-tools triggerDeal: existing background catch updates the Deal row to FAILED with the thrown error message. - createDeal: a preUploadTerminated flag short-circuits the catch's failure metrics and the finally's saveDeal so the terminated path does not spam metrics or rows. - waitForPdpEndEpoch: wrap getDataSet in awaitWithAbort so in-flight polls honor the abort signal (Copilot 3229623471). * chore: trim redundant abort check and narrating comments - waitForPdpEndEpoch: drop signal?.throwIfAborted() at loop head; awaitWithAbort already performs it. - Trim narrating fragments from PDP-terminated guard comments; keep only the non-obvious FWSS-vs-PDP rationale and the issue link. * chore: biome format dev-tools event-name ternary * refactor: centralize data-set probe in DealService (FilOzone#535) * refactor: centralize data-set probe in DealService Lift dsIndex selection + provisioning probe out of handleDealJob into DealService.resolveDataSetMetadataForDeal, invoked from createDealForProvider. handleDealJob just delegates and maps DealJobTerminatedDataSetError to handler_result="error". Behavior change: when minNumDataSetsForChecks > 1 and the randomly selected indexed slot is PDP-terminated, the deal job falls back to the baseline slot instead of failing (logs deal_job_dataset_index_terminated first). data_set_creation still owns repair. The post-createContext isDataSetLive guard inside createDeal stays as the commit-time TOCTOU check on the exact dataSetId the upload will use. * style: biome format + consolidate indexed-slot fallback tests via it.each * docs: drop TOCTOU phrasing in resolveDataSetMetadataForDeal jsdoc commit d2f21ce Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Wed May 13 07:55:03 2026 -0400 feat(web): link to combined approved-SP dashboard on landing (FilOzone#525) * feat(web): link to combined approved-SP dashboard on landing Adds a configurable link on the landing page pointing to the BetterStack dashboard that shows combined performance metrics for approved SPs. Lets visitors see overall FOC storage experience without first picking an SP. Configured via APPROVED_SP_DASHBOARD_URL (runtime) / VITE_APPROVED_SP_DASHBOARD_URL (build). Closes FilOzone#384 * feat(web): network-aware approved-SP dashboard CTA - Split APPROVED_SP_DASHBOARD_URL into per-network vars (_MAINNET / _CALIBRATION) so a single web deployment can serve both networks correctly. - Render the link as a primary CTA card above the per-SP table, with copy qualified by the current network ("...on Calibration"). - a11y: mark decorative ExternalLink icons aria-hidden. * fix(web): name both runtime and build-time vars in invalid-URL warning Copilot review feedback: warning previously named only the runtime var, but getConfigUrl falls back to the VITE_* build var too. Surface both so operators can find the right knob during local dev. * chore(web): biome format commit c9ad711 Author: Phi-rjan <orjan.roren@gmail.com> Date: Wed May 13 03:50:42 2026 +0200 chore: update Synapse stack for filecoin-pin 0.21 (FilOzone#521) * feat: update Synapse stack for filecoin-pin 0.21 feat: update Synapse stack for filecoin-pin 0.21 * docs(checks): rename Synapse progress events for filecoin-pin 0.21 --------- Co-authored-by: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> commit 0bb5217 Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Tue May 12 13:42:30 2026 -0400 fix: stop retention counter double-counts (FilOzone#519) * fix: stop retention counter double-counts * refactor(data-retention): drop redundant poll guard and instance map * docs(data-retention): align wording with poll-local baselines commit fd6ce8a Author: FilOz Bot <infra+github-fil-ozzy@filoz.org> Date: Thu May 7 00:40:00 2026 -0700 chore: release to production (main) (FilOzone#514) commit 9ef0235 Author: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> Date: Thu May 7 12:57:21 2026 +0530 fix: revert back to old synapse version (FilOzone#512) commit 8658e34 Author: FilOz Bot <infra+github-fil-ozzy@filoz.org> Date: Tue May 5 17:44:14 2026 +0200 chore: release to production (main) (FilOzone#458) commit c410184 Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Tue May 5 09:50:50 2026 -0400 docs(checks): close resolved TBDs in data-storage, events, README (FilOzone#481) * docs(checks): close resolved TBDs in data-storage, events, README Items previously marked TBD that are now implemented in code: - data-storage.md assertions #3, #5, #6, #7 (pieceConfirmed, IPNI discoverability, retrievability, all-checks-gated) -> Yes. - data-storage.md poll intervals: replace TBD_VARIABLE refs with the concrete sources (hardcoded POLLING_INTERVAL_MS = 2.5s for SP piece status, IPNI_VERIFICATION_POLLING_MS env var with 2s default for IPNI verification; doc previously claimed 5s). - data-storage.md section 7 header drops TBD; intro disclaimer removed. - data-storage.md "TBD Summary" rewritten as "Implementation History" with code references for inline retrieval, CID integrity, per-deal timeout (AbortController -> DealStatus.FAILED), gated status, status model, onPieceConfirmed, IPFS gateway retrieval, filecoin-pin CAR. - events-and-metrics.md: pieceConfirmed -> Yes (pieceConfirmedOnChainMs histogram); ipfsRetrievalIntegrityChecked -> implemented inline via per-block sha256 verification in ipfs-block.strategy.ts (no discrete event); ipfsRetrievalFirstByte/LastByteReceived marked Partial since duration histograms exist but no discrete event; histogram-buckets TBD replaced with link to metrics-prometheus.module.ts. - README.md: name the dataset-creation job (data-set-creation) and reference its config envs. Still TBD (not changed in this commit): uploadToSpStart, ipniVerificationStart, ipfsRetrievalStart events; jobs.md PR FilOzone#263 lookahead-skip; PDP_SUBGRAPH_ENDPOINT production value. * docs(checks): address review feedback on callback names and event states - data-storage.md: rename Synapse callbacks to plural form (onPiecesAdded, onPiecesConfirmed) to match deal.service.ts. - events-and-metrics.md: same rename in the event list. Clarify that dealCreated maps to DealStatus.DEAL_CREATED only after all gates pass (upload alone sets UPLOADED, not DEAL_CREATED). - events-and-metrics.md: ipfsRetrievalIntegrityChecked downgraded from Yes to Partial since no discrete event is emitted (inline check only). - events-and-metrics.md: Mermaid timeline now matches the table - ipfsRetrievalFirstByteReceived/LastByteReceived labelled as "Partial: histogram only", ipfsRetrievalIntegrityChecked labelled "Partial: inline check, no event". - README.md: refer to the canonical pg-boss job type data_set_creation (underscore) so operators can map the doc to jobType values. * docs(checks): fix unreadable Mermaid rect fill in event timeline The 'Data Storage Only' rect used rgb(50, 50, 50), which renders as a near-black block that hides the message labels and arrows inside it (both on GitHub light/dark themes). Switch to a translucent rgba(120, 120, 200, 0.15) so the highlight is visible without obscuring content. * docs(events): reframe Event List as timing markers, not emitted events The 'events' in this doc are named anchors used to define metric Timer Starts/Ends; dealbot does not necessarily emit each as a discrete Prometheus event or log line. Add an explicit note up top so readers don't expect every entry to map to an emitted event, and update rows that were marked TBD/Partial purely because no discrete event is emitted. - uploadToSpStart -> Yes (anchor: deal.uploadStartTime in deal.service.ts:255). - ipniVerificationStart -> Yes (anchor: ipniVerificationStartTime in ipni-verification.service.ts:63 - drives ipniVerifyMs). - ipfsRetrievalStart -> Yes (anchor: retrieval startTime in retrieval-addons.service.ts:227; logs 'retrieval_started'). - ipfsRetrievalFirstByteReceived -> Yes (drives ipfsRetrievalFirstByteMs). - ipfsRetrievalLastByteReceived -> Yes (drives ipfsRetrievalLastByteMs). - ipfsRetrievalIntegrityChecked -> Yes (per-block sha256 in ipfs-block.strategy.ts; inline, no discrete event). - Mermaid timeline: drop the (TBD) / (Partial: ...) annotations on these markers so the diagram and the table agree. * docs(events): drop Implemented column from Event List All rows are now Yes (each marker is anchored in code), so the column adds no signal. Anchor details folded into the Source-of-truth column. Intro note tightened. * Update docs/checks/data-storage.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> * Update docs/checks/README.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> --------- Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> commit 126b2d8 Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Tue May 5 08:24:47 2026 -0400 fix(deal): cancel onStored addons when upload fails (FilOzone#505) * fix(deal): cancel onStored addons when upload fails The synapse-sdk StorageContext.upload fires onStored before commit/addPieces, so dealbot's IPNI monitoring runs detached. When executeUpload throws (e.g. 409 on POST /pdp/data-sets/{id}/pieces against a Curio-terminated dataset), the leaked IPNI poll runs to its 120s timeout and logs a misleading ipni_tracking_failed event after the deal already failed. Wire an AbortController for the detached addons, composed with the parent signal via AbortSignal.any, and abort + drain in the catch path. Closes FilOzone#503. * fix(deal): clear ipniStatus on aborted onStored, fix TS narrowing Two follow-ups on the addon-cancel path: 1. Use a wrapper object for onStoredAddons.promise so TS preserves the union type across closure mutation in onProgress; the prior `let x: Promise<boolean> | null = null` pattern narrowed to `null` in finally and broke typecheck. 2. Clear deal.ipniStatus on aborted onStored runs. IpniAddonStrategy.onStored sets PENDING before awaiting; if we abort before terminal status is set, sp_performance_query.helper counts PENDING as `total_ipni_deals` and depresses ipni_success_rate. Set null on FAILED deals so aborted runs don't pollute the metric. * fix(deal): only clear ipniStatus when still PENDING after addon abort Earlier fix cleared ipniStatus for any FAILED deal, which would also wipe legitimate IpniStatus.FAILED set by IpniAddonStrategy on real IPNI failures and IpniStatus.VERIFIED on retrieval-stage failures. Narrow the condition to PENDING so only mid-flight aborts are cleared. * style: apply biome format * fix(deal): skip onStored addon abort when success path already awaited it

SgtPooki added 2 commits February 11, 2026 11:47

fix: checks and jobs have a maximum timeout

09a51d7

fix: propogate aborts down to checks

a2bd22f

Copilot AI review requested due to automatic review settings February 11, 2026 19:50

SgtPooki linked an issue Feb 11, 2026 that may be closed by this pull request

we need to set deal/retrieval max timeout #258

Closed

FilOzzy added this to FOC Feb 11, 2026

github-project-automation Bot moved this to 📌 Triage in FOC Feb 11, 2026

Copilot started reviewing on behalf of SgtPooki February 11, 2026 19:50 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

SgtPooki self-assigned this Feb 11, 2026

SgtPooki moved this from 📌 Triage to 🔎 Awaiting review in FOC Feb 11, 2026

SgtPooki mentioned this pull request Feb 11, 2026

fix: remove RETRIEVAL_TIMEOUT_BUFFER_MS #266

Merged

chore: address pr comments

963b5bd

SgtPooki requested a review from silent-cipher February 11, 2026 20:39

This was referenced Feb 11, 2026

lower timeouts for deals/retrievals #267

Open

fix: use single pgboss queue to enforce per SP lock #247

Merged

BigLep reviewed Feb 13, 2026

View reviewed changes

Comment thread docs/environment-variables.md

Comment thread docs/environment-variables.md Outdated

Comment thread docs/environment-variables.md Outdated

silent-cipher approved these changes Feb 13, 2026

View reviewed changes

Comment thread apps/backend/src/jobs/jobs.service.ts Outdated

Comment thread apps/backend/src/jobs/jobs.service.ts

Comment thread apps/backend/src/deal/deal.service.ts Outdated

BigLep moved this from 🔎 Awaiting review to ✔️ Approved by reviewer in FOC Feb 13, 2026

SgtPooki and others added 4 commits February 13, 2026 11:27

chore: address pr comments

8bde4bb

Merge branch 'main' into 258-we-need-to-set-dealretrieval-max-timeout

c1f6963

Update docs/environment-variables.md

15735d4

Co-authored-by: Steve Loeppky <biglep@protocol.ai>

Update docs/environment-variables.md

1a31734

Co-authored-by: Steve Loeppky <biglep@protocol.ai>

SgtPooki requested a review from Copilot February 13, 2026 17:44

Copilot started reviewing on behalf of SgtPooki February 13, 2026 17:45 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

SgtPooki added 2 commits February 13, 2026 16:37

Merge branch 'main' into 258-we-need-to-set-dealretrieval-max-timeout

d439bdc

chore: address pr comments

d818327

rjan90 added this to the M4.1: mainnet ready milestone Feb 16, 2026

Merge branch 'main' into 258-we-need-to-set-dealretrieval-max-timeout

a8d668b

SgtPooki merged commit 0623bcf into main Feb 16, 2026
6 checks passed

github-project-automation Bot moved this from ✔️ Approved by reviewer to 🎉 Done in FOC Feb 16, 2026

SgtPooki deleted the 258-we-need-to-set-dealretrieval-max-timeout branch February 16, 2026 12:58

github-actions Bot mentioned this pull request Feb 13, 2026

chore: release to production (main) #261

Merged

SgtPooki mentioned this pull request Apr 27, 2026

docs(checks): close resolved TBDs in data-storage, events, README #481

Merged

4 tasks

Conversation

SgtPooki commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BigLep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silent-cipher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SgtPooki commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SgtPooki commented Feb 11, 2026 •

edited

Loading