feat: cap deal/retrievals with abort signals#263
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end abort propagation and job-level timeout enforcement for deal/retrieval workflows so long-running jobs can be actively cancelled while improving metrics/logging around abort vs failure.
Changes:
- Introduces shared abort helpers (
abort-utils) and adoptsAbortSignalpropagation across deal/retrieval flows (including add-ons and IPNI polling). - Enforces per-job timeouts in the pg-boss job runner and records
handler_result="aborted"for timed-out executions. - Updates defaults/docs for job timeouts and HTTP request timeouts, plus adds/updates tests for abort behavior and error preservation.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/environment-variables.md | Documents new job timeout env vars and adds them to the quick reference. |
| apps/backend/src/retrieval/retrieval.service.ts | Propagates abort signals through retrieval execution and adjusts batch behavior on abort. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Updates tests for abort behavior and aligns Deal IDs with UUID strings. |
| apps/backend/src/retrieval-addons/types.ts | Extends retrieval test result shape with an optional aborted flag. |
| apps/backend/src/retrieval-addons/retrieval-addons.service.ts | Adds abort checks, uses shared abort-aware delay, and improves error capture for non-Error throws (partially). |
| apps/backend/src/retrieval-addons/retrieval-addons.service.spec.ts | Adds test ensuring non-Error throws are captured in execution results. |
| apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts | Documents handler_result semantics for jobs_completed_total. |
| apps/backend/src/jobs/jobs.service.ts | Enforces job-level timeouts via AbortController and reports aborted jobs distinctly. |
| apps/backend/src/jobs/jobs.service.spec.ts | Adds metrics and timeout-abort tests for deal/retrieval jobs; updates private-call signatures. |
| apps/backend/src/deal/deal.service.ts | Propagates abort signal into deal creation/upload/IPNI/retrieval checks; preserves non-Error error messages. |
| apps/backend/src/deal/deal.service.spec.ts | Adds coverage for preserving non-Error abort reasons through deal creation failure recording. |
| apps/backend/src/deal-addons/strategies/ipni.strategy.ts | Propagates abort signal through IPNI monitoring/polling and uses abort-aware delay. |
| apps/backend/src/deal-addons/interfaces/deal-addon.interface.ts | Extends onUploadComplete to accept an optional abort signal. |
| apps/backend/src/deal-addons/deal-addons.service.ts | Propagates abort signal through upload-complete add-on handlers and uses awaitWithAbort. |
| apps/backend/src/config/app.config.ts | Adds config schema + loader for job timeout env vars; reduces default HTTP request timeouts. |
| apps/backend/src/common/abort-utils.ts | Adds shared helpers: createAbortError, awaitWithAbort, and abort-aware delay. |
| apps/backend/src/common/abort-utils.spec.ts | Adds unit tests for abort utilities. |
| apps/backend/.env.example | Adds new job timeout vars and updates HTTP timeout defaults/comments. |
Comments suppressed due to low confidence (2)
apps/backend/src/retrieval-addons/retrieval-addons.service.ts:207
- When a retrieval promise rejects in
testAllRetrievalMethods, the recordederrorusesresult.reason?.message || "Unknown error". If a strategy throws a non-Error(the new spec covers this),.messagewill be undefined and the reason is lost. Preferresult.reason instanceof Error ? result.reason.message : String(result.reason)so execution results preserve the real failure details.
const executionResults: RetrievalExecutionResult[] = results.map((result, index) => {
if (result.status === "fulfilled") {
return result.value;
} else {
// Create failed result - retryCount unknown for catastrophic failures
return {
url: urlResults[index].url,
method: urlResults[index].method,
data: Buffer.alloc(0),
metrics: {
latency: 0,
ttfb: 0,
throughput: 0,
statusCode: 0,
timestamp: new Date(),
responseSize: 0,
},
success: false,
error: result.reason?.message || "Unknown error",
retryCount: undefined, // Unknown for catastrophic failures
};
apps/backend/src/retrieval/retrieval.service.ts:201
performAllRetrievalslogs "All retrievals failed" at error level for any thrown error, including aborts thrown viasignal.throwIfAborted(). This will create noisy failure logs for expected cancellations/timeouts. Consider skipping the error log (or downgrading to warn) whensignal?.abortedis true, similar to the batch-level handling above.
} catch (error) {
const errorMessage = error instanceof Error ? error.message : String(error);
this.logger.error(`All retrievals failed for ${deal.pieceCid}: ${errorMessage}`);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
BigLep
left a comment
There was a problem hiding this comment.
@SgtPooki: if we don't get @silent-cipher review during his 2026-02-13, I think this would be a good candidate for having another agent do a double check of the change. It should be able to reason about abortcontrollers and its standard behavior and then trace through to make sure it is propagated through everywhere.
silent-cipher
left a comment
There was a problem hiding this comment.
Looks good to me! Nothing blocking - just few comments
Co-authored-by: Steve Loeppky <biglep@protocol.ai>
Co-authored-by: Steve Loeppky <biglep@protocol.ai>
|
also updated retreival and deal timeouts to 6m and 1m respectively |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
apps/backend/src/jobs/jobs.service.ts:360
- Inside handleRetrievalJob,
timeoutMsis declared for the job-level abort timer and then re-declared inside recordJobExecution for the interval-based retrieval deadline. Reusing the same name makes it easy to pass the wrong value in future edits; renaming one of them would reduce confusion and prevent subtle bugs.
try {
const timeoutsConfig = this.configService.get("timeouts");
const intervalMs = data.intervalSeconds * 1000;
const timeoutMs = Math.max(10000, intervalMs - timeoutsConfig.retrievalTimeoutBufferMs);
const httpTimeoutMs = Math.max(timeoutsConfig.httpRequestTimeoutMs, timeoutsConfig.http2RequestTimeoutMs);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* docs(checks): close resolved TBDs in data-storage, events, README Items previously marked TBD that are now implemented in code: - data-storage.md assertions #3, #5, #6, #7 (pieceConfirmed, IPNI discoverability, retrievability, all-checks-gated) -> Yes. - data-storage.md poll intervals: replace TBD_VARIABLE refs with the concrete sources (hardcoded POLLING_INTERVAL_MS = 2.5s for SP piece status, IPNI_VERIFICATION_POLLING_MS env var with 2s default for IPNI verification; doc previously claimed 5s). - data-storage.md section 7 header drops TBD; intro disclaimer removed. - data-storage.md "TBD Summary" rewritten as "Implementation History" with code references for inline retrieval, CID integrity, per-deal timeout (AbortController -> DealStatus.FAILED), gated status, status model, onPieceConfirmed, IPFS gateway retrieval, filecoin-pin CAR. - events-and-metrics.md: pieceConfirmed -> Yes (pieceConfirmedOnChainMs histogram); ipfsRetrievalIntegrityChecked -> implemented inline via per-block sha256 verification in ipfs-block.strategy.ts (no discrete event); ipfsRetrievalFirstByte/LastByteReceived marked Partial since duration histograms exist but no discrete event; histogram-buckets TBD replaced with link to metrics-prometheus.module.ts. - README.md: name the dataset-creation job (data-set-creation) and reference its config envs. Still TBD (not changed in this commit): uploadToSpStart, ipniVerificationStart, ipfsRetrievalStart events; jobs.md PR #263 lookahead-skip; PDP_SUBGRAPH_ENDPOINT production value. * docs(checks): address review feedback on callback names and event states - data-storage.md: rename Synapse callbacks to plural form (onPiecesAdded, onPiecesConfirmed) to match deal.service.ts. - events-and-metrics.md: same rename in the event list. Clarify that dealCreated maps to DealStatus.DEAL_CREATED only after all gates pass (upload alone sets UPLOADED, not DEAL_CREATED). - events-and-metrics.md: ipfsRetrievalIntegrityChecked downgraded from Yes to Partial since no discrete event is emitted (inline check only). - events-and-metrics.md: Mermaid timeline now matches the table - ipfsRetrievalFirstByteReceived/LastByteReceived labelled as "Partial: histogram only", ipfsRetrievalIntegrityChecked labelled "Partial: inline check, no event". - README.md: refer to the canonical pg-boss job type data_set_creation (underscore) so operators can map the doc to jobType values. * docs(checks): fix unreadable Mermaid rect fill in event timeline The 'Data Storage Only' rect used rgb(50, 50, 50), which renders as a near-black block that hides the message labels and arrows inside it (both on GitHub light/dark themes). Switch to a translucent rgba(120, 120, 200, 0.15) so the highlight is visible without obscuring content. * docs(events): reframe Event List as timing markers, not emitted events The 'events' in this doc are named anchors used to define metric Timer Starts/Ends; dealbot does not necessarily emit each as a discrete Prometheus event or log line. Add an explicit note up top so readers don't expect every entry to map to an emitted event, and update rows that were marked TBD/Partial purely because no discrete event is emitted. - uploadToSpStart -> Yes (anchor: deal.uploadStartTime in deal.service.ts:255). - ipniVerificationStart -> Yes (anchor: ipniVerificationStartTime in ipni-verification.service.ts:63 - drives ipniVerifyMs). - ipfsRetrievalStart -> Yes (anchor: retrieval startTime in retrieval-addons.service.ts:227; logs 'retrieval_started'). - ipfsRetrievalFirstByteReceived -> Yes (drives ipfsRetrievalFirstByteMs). - ipfsRetrievalLastByteReceived -> Yes (drives ipfsRetrievalLastByteMs). - ipfsRetrievalIntegrityChecked -> Yes (per-block sha256 in ipfs-block.strategy.ts; inline, no discrete event). - Mermaid timeline: drop the (TBD) / (Partial: ...) annotations on these markers so the diagram and the table agree. * docs(events): drop Implemented column from Event List All rows are now Yes (each marker is anchored in code), so the column adds no signal. Anchor details folded into the Source-of-truth column. Intro note tightened. * Update docs/checks/data-storage.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> * Update docs/checks/README.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> --------- Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com>
commit 008f0d8 Author: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> Date: Thu May 14 20:52:01 2026 +0530 feat: filter out sps with dev tags (FilOzone#526) * feat: update Synapse stack for filecoin-pin 0.21 feat: update Synapse stack for filecoin-pin 0.21 * feat: filter out dev providers from active pool * feat: look for service_status * docs: document serviceStatus=dev opt-out mechanism for SPs * chore: remove excessive test cases * refactor: stick to dealbot defined serviceStatus format --------- Co-authored-by: Phi <orjan.roren@gmail.com> commit 7aa2f8a Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Thu May 14 11:15:33 2026 -0400 fix: handle PDP-terminated datasets via data_set_creation repair (FilOzone#518) * fix(jobs): repair PDP-terminated datasets in data_set_creation (FilOzone#379) PDP can mark a dataset terminated while FWSS still has pdpEndEpoch=0. synapse-sdk createContext filters only on pdpEndEpoch, so it returns dead datasets. The next add-pieces fails with "Data set has been terminated due to unrecoverable proving failure". data_set_creation now classifies each slot as missing | live | terminated and runs a bounded repair on terminated: terminateDataSet, poll FWSS pdpEndEpoch != 0, mark affected Deal rows cleaned_up in a single transaction, defer the replacement to the next tick. Deal job now skips when the resolved context's dataSetId is PDP-dead, before any data-storage metric, upload, or Deal-row write. Includes a one-shot backfill script for existing terminated datasets. Upstream trackers (orthogonal): FilOzone/synapse-sdk#780, FilOzone/filecoin-services#473. * style: biome formatter fixes * fix(deal): only treat known terminal probe error as terminated; idempotent repair isDataSetLive previously returned false for ANY validateDataSet failure, so a transient RPC error could classify a healthy dataset as PDP-terminated and trigger destructive repair. It now returns false only for the known terminal "does not exist or is not live" message and rethrows everything else. repairTerminatedDataSet is now idempotent on partial-prior-run state: - If FWSS pdpEndEpoch is already non-zero, skip terminateDataSet entirely. - If terminateDataSet reverts with an already-terminated message, treat it as a no-op and continue to the FWSS state poll + cleanup. - After terminateDataSet, await the tx receipt before polling FWSS state. Adds tests for the rethrow path, the already-terminated skip, and the revert-as-noop path. * fix: address remaining copilot comments - waitForPdpEndEpoch: switch abortable sleep to node:timers/promises setTimeout({signal}). Removes the manual addEventListener/clearTimeout pair, which leaked listeners on resolve and had a race when the signal aborted between throwIfAborted() and addEventListener(). - dev-tools background deal: emit a separate background_deal_skipped event/message when createDealForProvider returns null. The previous message claimed success on the skip path. * chore: drop one-shot backfill script; rely on data_set_creation ticks * fix: address silent-cipher review (orphan PENDING + upfront validation) - handleDealJob now probes baseline dataset via getDataSetProvisioningStatus unconditionally before deal preparation; terminated baseline or selected dsIndex fails the job (handler_result=error in Prometheus) instead of wasting upload prep. - triggerDeal marks the placeholder Deal row FAILED (with errorMessage) on the PDP-terminated skip path; preserves the row for HTTP polling and audit. - Remove dead checkDataSetExists (and its tests); getDataSetProvisioningStatus is strictly more informative (missing|live|terminated). * fix: replace null-on-skip with typed DealJobTerminatedDataSetError createDealForProvider/createDeal now return Promise<Deal> and throw a typed error when the targeted data set is PDP-terminated. Callers map the typed error to FAILED outcomes without relying on a null return: - jobs.service handleDealJob: upfront baseline and dsIndex probes throw the typed error; outer catch records handler_result=error and logs deal_job_failed_terminated_dataset. dsIndex probe also logs dataSetIndex locally before re-throw so per-slot context is preserved. - dev-tools triggerDeal: existing background catch updates the Deal row to FAILED with the thrown error message. - createDeal: a preUploadTerminated flag short-circuits the catch's failure metrics and the finally's saveDeal so the terminated path does not spam metrics or rows. - waitForPdpEndEpoch: wrap getDataSet in awaitWithAbort so in-flight polls honor the abort signal (Copilot 3229623471). * chore: trim redundant abort check and narrating comments - waitForPdpEndEpoch: drop signal?.throwIfAborted() at loop head; awaitWithAbort already performs it. - Trim narrating fragments from PDP-terminated guard comments; keep only the non-obvious FWSS-vs-PDP rationale and the issue link. * chore: biome format dev-tools event-name ternary * refactor: centralize data-set probe in DealService (FilOzone#535) * refactor: centralize data-set probe in DealService Lift dsIndex selection + provisioning probe out of handleDealJob into DealService.resolveDataSetMetadataForDeal, invoked from createDealForProvider. handleDealJob just delegates and maps DealJobTerminatedDataSetError to handler_result="error". Behavior change: when minNumDataSetsForChecks > 1 and the randomly selected indexed slot is PDP-terminated, the deal job falls back to the baseline slot instead of failing (logs deal_job_dataset_index_terminated first). data_set_creation still owns repair. The post-createContext isDataSetLive guard inside createDeal stays as the commit-time TOCTOU check on the exact dataSetId the upload will use. * style: biome format + consolidate indexed-slot fallback tests via it.each * docs: drop TOCTOU phrasing in resolveDataSetMetadataForDeal jsdoc commit d2f21ce Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Wed May 13 07:55:03 2026 -0400 feat(web): link to combined approved-SP dashboard on landing (FilOzone#525) * feat(web): link to combined approved-SP dashboard on landing Adds a configurable link on the landing page pointing to the BetterStack dashboard that shows combined performance metrics for approved SPs. Lets visitors see overall FOC storage experience without first picking an SP. Configured via APPROVED_SP_DASHBOARD_URL (runtime) / VITE_APPROVED_SP_DASHBOARD_URL (build). Closes FilOzone#384 * feat(web): network-aware approved-SP dashboard CTA - Split APPROVED_SP_DASHBOARD_URL into per-network vars (_MAINNET / _CALIBRATION) so a single web deployment can serve both networks correctly. - Render the link as a primary CTA card above the per-SP table, with copy qualified by the current network ("...on Calibration"). - a11y: mark decorative ExternalLink icons aria-hidden. * fix(web): name both runtime and build-time vars in invalid-URL warning Copilot review feedback: warning previously named only the runtime var, but getConfigUrl falls back to the VITE_* build var too. Surface both so operators can find the right knob during local dev. * chore(web): biome format commit c9ad711 Author: Phi-rjan <orjan.roren@gmail.com> Date: Wed May 13 03:50:42 2026 +0200 chore: update Synapse stack for filecoin-pin 0.21 (FilOzone#521) * feat: update Synapse stack for filecoin-pin 0.21 feat: update Synapse stack for filecoin-pin 0.21 * docs(checks): rename Synapse progress events for filecoin-pin 0.21 --------- Co-authored-by: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> commit 0bb5217 Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Tue May 12 13:42:30 2026 -0400 fix: stop retention counter double-counts (FilOzone#519) * fix: stop retention counter double-counts * refactor(data-retention): drop redundant poll guard and instance map * docs(data-retention): align wording with poll-local baselines commit fd6ce8a Author: FilOz Bot <infra+github-fil-ozzy@filoz.org> Date: Thu May 7 00:40:00 2026 -0700 chore: release to production (main) (FilOzone#514) commit 9ef0235 Author: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> Date: Thu May 7 12:57:21 2026 +0530 fix: revert back to old synapse version (FilOzone#512) commit 8658e34 Author: FilOz Bot <infra+github-fil-ozzy@filoz.org> Date: Tue May 5 17:44:14 2026 +0200 chore: release to production (main) (FilOzone#458) commit c410184 Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Tue May 5 09:50:50 2026 -0400 docs(checks): close resolved TBDs in data-storage, events, README (FilOzone#481) * docs(checks): close resolved TBDs in data-storage, events, README Items previously marked TBD that are now implemented in code: - data-storage.md assertions #3, #5, #6, #7 (pieceConfirmed, IPNI discoverability, retrievability, all-checks-gated) -> Yes. - data-storage.md poll intervals: replace TBD_VARIABLE refs with the concrete sources (hardcoded POLLING_INTERVAL_MS = 2.5s for SP piece status, IPNI_VERIFICATION_POLLING_MS env var with 2s default for IPNI verification; doc previously claimed 5s). - data-storage.md section 7 header drops TBD; intro disclaimer removed. - data-storage.md "TBD Summary" rewritten as "Implementation History" with code references for inline retrieval, CID integrity, per-deal timeout (AbortController -> DealStatus.FAILED), gated status, status model, onPieceConfirmed, IPFS gateway retrieval, filecoin-pin CAR. - events-and-metrics.md: pieceConfirmed -> Yes (pieceConfirmedOnChainMs histogram); ipfsRetrievalIntegrityChecked -> implemented inline via per-block sha256 verification in ipfs-block.strategy.ts (no discrete event); ipfsRetrievalFirstByte/LastByteReceived marked Partial since duration histograms exist but no discrete event; histogram-buckets TBD replaced with link to metrics-prometheus.module.ts. - README.md: name the dataset-creation job (data-set-creation) and reference its config envs. Still TBD (not changed in this commit): uploadToSpStart, ipniVerificationStart, ipfsRetrievalStart events; jobs.md PR FilOzone#263 lookahead-skip; PDP_SUBGRAPH_ENDPOINT production value. * docs(checks): address review feedback on callback names and event states - data-storage.md: rename Synapse callbacks to plural form (onPiecesAdded, onPiecesConfirmed) to match deal.service.ts. - events-and-metrics.md: same rename in the event list. Clarify that dealCreated maps to DealStatus.DEAL_CREATED only after all gates pass (upload alone sets UPLOADED, not DEAL_CREATED). - events-and-metrics.md: ipfsRetrievalIntegrityChecked downgraded from Yes to Partial since no discrete event is emitted (inline check only). - events-and-metrics.md: Mermaid timeline now matches the table - ipfsRetrievalFirstByteReceived/LastByteReceived labelled as "Partial: histogram only", ipfsRetrievalIntegrityChecked labelled "Partial: inline check, no event". - README.md: refer to the canonical pg-boss job type data_set_creation (underscore) so operators can map the doc to jobType values. * docs(checks): fix unreadable Mermaid rect fill in event timeline The 'Data Storage Only' rect used rgb(50, 50, 50), which renders as a near-black block that hides the message labels and arrows inside it (both on GitHub light/dark themes). Switch to a translucent rgba(120, 120, 200, 0.15) so the highlight is visible without obscuring content. * docs(events): reframe Event List as timing markers, not emitted events The 'events' in this doc are named anchors used to define metric Timer Starts/Ends; dealbot does not necessarily emit each as a discrete Prometheus event or log line. Add an explicit note up top so readers don't expect every entry to map to an emitted event, and update rows that were marked TBD/Partial purely because no discrete event is emitted. - uploadToSpStart -> Yes (anchor: deal.uploadStartTime in deal.service.ts:255). - ipniVerificationStart -> Yes (anchor: ipniVerificationStartTime in ipni-verification.service.ts:63 - drives ipniVerifyMs). - ipfsRetrievalStart -> Yes (anchor: retrieval startTime in retrieval-addons.service.ts:227; logs 'retrieval_started'). - ipfsRetrievalFirstByteReceived -> Yes (drives ipfsRetrievalFirstByteMs). - ipfsRetrievalLastByteReceived -> Yes (drives ipfsRetrievalLastByteMs). - ipfsRetrievalIntegrityChecked -> Yes (per-block sha256 in ipfs-block.strategy.ts; inline, no discrete event). - Mermaid timeline: drop the (TBD) / (Partial: ...) annotations on these markers so the diagram and the table agree. * docs(events): drop Implemented column from Event List All rows are now Yes (each marker is anchored in code), so the column adds no signal. Anchor details folded into the Source-of-truth column. Intro note tightened. * Update docs/checks/data-storage.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> * Update docs/checks/README.md Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> --------- Co-authored-by: Puspendra Mahariya <95584952+silent-cipher@users.noreply.github.com> commit 126b2d8 Author: Russell Dempsey <1173416+SgtPooki@users.noreply.github.com> Date: Tue May 5 08:24:47 2026 -0400 fix(deal): cancel onStored addons when upload fails (FilOzone#505) * fix(deal): cancel onStored addons when upload fails The synapse-sdk StorageContext.upload fires onStored before commit/addPieces, so dealbot's IPNI monitoring runs detached. When executeUpload throws (e.g. 409 on POST /pdp/data-sets/{id}/pieces against a Curio-terminated dataset), the leaked IPNI poll runs to its 120s timeout and logs a misleading ipni_tracking_failed event after the deal already failed. Wire an AbortController for the detached addons, composed with the parent signal via AbortSignal.any, and abort + drain in the catch path. Closes FilOzone#503. * fix(deal): clear ipniStatus on aborted onStored, fix TS narrowing Two follow-ups on the addon-cancel path: 1. Use a wrapper object for onStoredAddons.promise so TS preserves the union type across closure mutation in onProgress; the prior `let x: Promise<boolean> | null = null` pattern narrowed to `null` in finally and broke typecheck. 2. Clear deal.ipniStatus on aborted onStored runs. IpniAddonStrategy.onStored sets PENDING before awaiting; if we abort before terminal status is set, sp_performance_query.helper counts PENDING as `total_ipni_deals` and depresses ipni_success_rate. Set null on FAILED deals so aborted runs don't pollute the metric. * fix(deal): only clear ipniStatus when still PENDING after addon abort Earlier fix cleared ipniStatus for any FAILED deal, which would also wipe legitimate IpniStatus.FAILED set by IpniAddonStrategy on real IPNI failures and IpniStatus.VERIFIED on retrieval-stage failures. Narrow the condition to PENDING so only mid-flight aborts are cleared. * style: apply biome format * fix(deal): skip onStored addon abort when success path already awaited it
Summary
Adds end-to-end abort propagation with shared utilities, introduces job-level timeout enforcement for deal/retrieval jobs, and improves observability/testing around aborted runs and retrieval failures. Updates HTTP timeout defaults and documents new job timeout env vars.
Problem
Long-running deal/retrieval jobs and downstream steps lacked consistent abort handling, causing wasted work and incomplete observability when timeouts or cancellations occur. Abort reasons could be lost, and job metrics didn’t clearly distinguish aborts from failures.
Solution
abort-utilshelpers (createAbortError,awaitWithAbort,delay) with tests.AbortSignalthrough deal/retrieval flows, add‑ons, and IPNI polling; prevent new work on abort while preserving partial results.AbortController, recordshandler_result="aborted", and keeps success vs business failure semantics.abortedflag; errors preserve non‑Errorabort reasons.Notes
DEAL_JOB_TIMEOUT_SECONDS,RETRIEVAL_JOB_TIMEOUT_SECONDS(defaults 6m/1m) and docs updated.jobs_completed_totalhandler result values (success,aborted,error).Fixes #258