Skip to content

chore: merge new changes from ipfs/kubo master#2

Open
alvin-reyes wants to merge 882 commits into
IPFSR:masterfrom
ipfs:master
Open

chore: merge new changes from ipfs/kubo master#2
alvin-reyes wants to merge 882 commits into
IPFSR:masterfrom
ipfs:master

Conversation

@alvin-reyes
Copy link
Copy Markdown

No description provided.

lidel and others added 30 commits October 8, 2025 18:29
* test: add migration tests for Windows and macOS

- add dedicated CI workflow for migration tests on Windows/macOS
- workflow triggers on migration-related file changes only

* build: remove redundant go version checks

- remove GO_MIN_VERSION and check_go_version scripts
- go.mod already enforces minimum version (go 1.25)
- fixes make build on Windows

* fix: windows migration panic by reading config into memory

fixes migration panic on Windows when upgrading from v0.37 to v0.38
by reading the entire config file into memory before performing atomic
operations. this avoids file locking issues on Windows where open files
cannot be renamed.

also fixes:
- TestRepoDir to set USERPROFILE on Windows (not just HOME)
- CLI migration tests to sanitize directory names (remove colons)

minimal fix that solves the "panic: error can't be dealt with
transactionally: Access is denied" error without adding unnecessary
platform-specific complexity.

* fix: set PATH for CLI migration tests in CI

the CLI tests need the built ipfs binary to be in PATH

* fix: use ipfs shutdown for graceful daemon termination in tests

replaces platform-specific signal handling with ipfs shutdown command
which works consistently across all platforms including Windows

* fix: isolate PATH modifications in parallel migration tests

tests running in parallel with t.Parallel() were interfering with each
other through global PATH modifications via os.Setenv(). this caused
tests to download real migration binaries instead of using mocks,
leading to Windows failures due to path separator issues in external tools.

now each test builds its own custom PATH and passes it explicitly to
commands, preventing interference between parallel tests.

* chore: improve error messages in WithBackup

* fix: Windows CI migration test failures

- add .exe extension to mock migration binaries on Windows
- handle repo lock file properly in mock migration binary
- ensure lock is created and removed to prevent conflicts

* refactor: align atomicfile error handling with fs-repo-migrations

- check close error in Abort() before attempting removal
- leave temp file on rename failure for debugging (like fs-repo-15-to-16)
- improves consistency with external migration implementations

* fix: use req.Context in repo migrate to avoid double-lock

The repo migrate command was calling cctx.Context() which has a hidden
side effect: it lazily constructs the IPFS node by calling GetNode(),
which opens the repository and acquires repo.lock. When migrations then
tried to acquire the same lock, it failed with "lock is already held by us"
because go4.org/lock tracks locks per-process in a global map.

The fix uses req.Context instead, which is a plain context.Context with
no side effects. This provides what migrations need (cancellation handling)
without triggering node construction or repo opening.

Context types explained:
- req.Context: Standard Go context for request lifetime, cancellation,
  and timeouts. No side effects.
- cctx.Context(): Kubo-specific method that lazily constructs the full
  IPFS node (opens repo, acquires lock, initializes subsystems). Returns
  the node's internal context.

Why req.Context is correct here:
- Migrations work on raw filesystem (only need ConfigRoot path)
- Command has SetDoesNotUseRepo(true) - doesn't need running node
- Migrations handle their own locking via lockfile.Lock()
- Need cancellation support but not node lifecycle

The bug only appeared with embedded migrations (v16+) because they run
in-process. External migrations (pre-v16) were separate processes, so
each had isolated state. Sequential migrations (forward then backward)
in the same process exposed this latent double-lock issue.

Also adds repo.lock acquisition to RunEmbeddedMigrations to prevent
concurrent migration access, and removes the now-unnecessary daemon
lock check from the migrate command handler.

* fix: use req.Context for migrations and autoconf in daemon startup

daemon.go was incorrectly using cctx.Context() in two critical places:

1. Line 337: migrations call - cctx.Context() triggers GetNode() which
   opens the repo and acquires repo.lock BEFORE migrations run, causing
   "lock is already held by us" errors when migrations try to lock

2. Line 390: autoconf client.Start() - uses context for HTTP timeouts
   and background updater lifecycle, doesn't need node construction

Both now use req.Context (plain Go context) which provides:
- request lifetime and cancellation
- no side effects (doesn't construct node or open repo)
- correct lifecycle for HTTP requests and background goroutines

(cherry picked from commit f4834e7)
keep -dev version from master
- clarify staging environment step for FINAL releases
- mark infrastructure updates (collab cluster, bootstrappers) as FINAL only
- improve ipfs-desktop release step wording
- update discourse topic examples to v0.38.0
- reference v0.38.0 release issue in metadata comment
Increase default Provide.DHT.MaxProvideConnsPerWorker value to match the
DHT replication factor (16 -> 20). A similar value is used in legacy
systems (with and without accelerated DHT client).
Upgrade to latest go-dsqueue and go-ds-pebble
* feat: provide stats

* added N/A

* format

* workers stats alignment

* ipfs provide stat --all --compact

* consolidating compact stat

* update column alignment

* flags combinations errors

* command description

* change schedule AvgPrefixLen to float

* changelog

* alignments

* provide stat description draft

* rephrased provide-stats.md

* linking provide-stats.md from command description

* documentation test

* fix: refactor provide stat command type handling

- add extractSweepingProvider() helper to reduce nested type switching
- extract lowWorkerThreshold constant for worker availability check
- fix --lan error handling to work with buffered providers

* docs: add clarifying comments

* fix(commands): improve provide stat compact mode

- prevent panic when both columns are empty
- fix column alignment with UTF-8 characters
- only track col0MaxWidth for first column (as intended)

* test: add tests for ipfs provide stat command

- test basic functionality, flags, JSON output
- test legacy provider behavior
- test integration with content scheduling
- test disabled provider configurations
- add parseSweepStats helper with t.Helper()

* docs: improve provide command help text

- update tagline to "Control and monitor content providing"
- simplify help descriptions
- make error messages more consistent
- update tests to match new error messages

* metrics rename

```
Next reprovide at:
Next prefix:
```
updated to:
```
Next region prefix:
Next region reprovide:
```

* docs: improve Provide system documentation clarity

Enhance documentation for the Provide system to better explain how provider
records work and the differences between sweep and legacy modes.

Changes to docs/config.md:
- Provide section: add clear explanation of provider records and their role
- Provide.DHT: add provider record lifecycle and two provider systems overview
- Provide.DHT.Interval: explain relationship to expiration, contrast sweep vs legacy behavior
- Provide.DHT.SweepEnabled: rewrite to explain legacy problem, sweep solution, and efficiency gains
- Monitoring section: prioritize command-line tools (ipfs provide stat) before Prometheus

Changes to core/commands/provide.go:
- ipfs provide stat help: add explanation of provider records, TTL expiration, and how sweep batching works

Changes to docs/changelogs/v0.39.md:
- Add context about why stats matter for monitoring provider health
- Emphasize real-time monitoring workflow with watch command
- Explain what users can observe (rates, queues, worker availability)

* depend on latest kad-dht master

* docs: nits

---------

Co-authored-by: Marcin Rataj <lidel@lidel.org>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
* chore(deps): update go-libp2p to v0.44.0

- includes self-healing UPnP port mappings after router restarts
- update go-netroute to v0.3.0
- update quic-go to v0.55.0
- add changelog entry for UPnP fix

* docs: improve provide and UPnP clarity in changelog and docs

- add alert polling rationale to changelog
- add UPnP config note with default clarification
- clarify sweep timing and prefix length explanations
- add concrete examples for time offset and record holders
- improve workers stats formatting
- add See Also section to provide-stats.md

* docs: add RISC-V prebuilt binaries to changelog and README

- highlight linux-riscv64 availability with open hardware context
- update README with arm64 builds, remove 32-bit examples
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4...v5)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](actions/download-artifact@v5...v6)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/setup-node](https://github.com/actions/setup-node) from 5 to 6.
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](actions/setup-node@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* bump kad-dht: resume reprovide cycle

* daemon: --provide-fresh-start flag

* changelog

* docs

* go-fmt

* chore: latest go-libp2p-kad-dht#1170

after conflict resolution, to confirm CI is still green

* kad-dht: depend on latest master

* move daemon flag to Provider.DHT.ResumeEnabled config

* refactor: sweep provider datastore

* bump kad-dht

* bump kad-dht

* bump kad-dht

* make datastore keys constant

* use kad-dht master

* add emoji to changelog entry

* go-fmt

* bump kad-dht

* test(provider): add tests for resume cycle feature

validates Provide.DHT.ResumeEnabled behavior:
- preserves cycle state when enabled (default)
- resets cycle when disabled

tests verify current_time_offset across restarts using JSON output

---------

Co-authored-by: Marcin Rataj <lidel@lidel.org>
addresses stream frame memory pooling issue where StreamFrame objects
weren't properly returned to sync.Pool during stream cancellation

see quic-go/quic-go#5327
* Upgrade to Boxo v0.35.1
* use tagged boxo release
* fix lint error
* provider: protect libp2p connections

Use latest kad-dht version, introducing connection protection and
retention of addresses in peerstore during provide operations.

* depend on kad-dht master
* fix: reprovide alert bug

* number formatting

* show full number for peer count
…11039)

This fix restores dynamic log level control and tail for go-libp2p loggers

Updated to:
https://github.com/libp2p/go-libp2p/releases/tag/v0.45.0
https://github.com/ipfs/go-log/releases/tag/v2.9.0

these changes restore dynamic log level control and tail for go-libp2p
subsystems after the migration to slog, fixing the regression introduced
in libp2p/go-libp2p#3364

Fixes #11035

For details why and how, see explainer in
https://github.com/ipfs/go-log/releases/tag/v2.9.0
* Upgrade to Boxo v0.35.2
lidel and others added 30 commits April 20, 2026 21:58
Provide/reprovide messages from core/node/provider.go were emitted
under core:constructor (the shared core/node constructor subsystem),
making GOLOG_LOG_LEVEL and `ipfs log level` hard to target for
provide visibility. Scope them to "provider", matching boxo's
provider package so a single lever covers both layers.

- core/node/provider.go: new providerLog at the "provider" subsystem,
  applied to 25 keystore/reprovide/strategy/throughput call sites
- test/cli/provider_test.go: reprovide dedup subtest raises
  provider=info instead of core:constructor=info
- docs/debug-guide.md: new "Known logger subsystems" section listing
  provider, dht/provider, dht/provider/lan, dsqueue
- docs/environment-variables.md: link to the new section from under
  GOLOG_LOG_LEVEL
* Upgrade to Boxo v0.39.0
* chore: bump boxo to ipfs/boxo#1140

picks up dspinner fix that snapshots the index before emitting pins,
avoiding the streaming lock convoy.

* docs: changelog entry for pinner stall fix

* docs: clarify pinner snapshot behavior

* chore: bump boxo to include ipfs/boxo#1146

Picks up the fix for "panic: pebble: closed" on shutdown (#11292):
the dspinner streamIndex goroutine now recovers from any datastore
panic and reports it as an error on the output channel, so the
daemon exits cleanly instead of crashing when the datastore closes
before pin enumeration drains.

* fix(provider): quiet keystore-close on shutdown

When the daemon shuts down, the keystore Close fires while the
startup sync goroutine may still be in flight: the OnStart ctx is
not yet cancelled, so ResetCids returning keystore.ErrClosed gets
logged at Error as "sync failed".

Treat keystore.ErrClosed the same as a cancelled ctx and log at
Debug as "interrupted by shutdown". Apply the same rule to the
periodic reprovide GC loop (whose error log got a unified message
in the process).

* test(cli): keystore-close log + pin ls shutdown

Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that:

1. Verifies no shutdown-caused keystore-sync error (err="keystore
   is closed" or err="context canceled") is logged at Error level.
   Scans stderr line-by-line so unrelated Error logs (e.g.
   "reset already in progress" from the startup+periodic overlap
   at tight Intervals) do not false-positive the assertion.

2. Runs `ipfs pin ls --stream` against the live daemon, shuts the
   daemon down mid-stream, and asserts the CLI returns within 15s,
   does not observe a daemon panic, and produces a meaningful
   error message if it exited non-zero.

Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is
always inside ResetCids when StopDaemon fires, making the shutdown
race deterministic enough to catch the regression on most runs
(verified empirically against the pre-fix provider.go).
Provide/reprovide messages from core/node/provider.go were emitted
under core:constructor (the shared core/node constructor subsystem),
making GOLOG_LOG_LEVEL and `ipfs log level` hard to target for
provide visibility. Scope them to "provider", matching boxo's
provider package so a single lever covers both layers.

- core/node/provider.go: new providerLog at the "provider" subsystem,
  applied to 25 keystore/reprovide/strategy/throughput call sites
- test/cli/provider_test.go: reprovide dedup subtest raises
  provider=info instead of core:constructor=info
- docs/debug-guide.md: new "Known logger subsystems" section listing
  provider, dht/provider, dht/provider/lan, dsqueue
- docs/environment-variables.md: link to the new section from under
  GOLOG_LOG_LEVEL

(cherry picked from commit 6059743)
* Upgrade to Boxo v0.39.0

(cherry picked from commit d62ee27)
* chore: bump boxo to ipfs/boxo#1140

picks up dspinner fix that snapshots the index before emitting pins,
avoiding the streaming lock convoy.

* docs: changelog entry for pinner stall fix

* docs: clarify pinner snapshot behavior

* chore: bump boxo to include ipfs/boxo#1146

Picks up the fix for "panic: pebble: closed" on shutdown (#11292):
the dspinner streamIndex goroutine now recovers from any datastore
panic and reports it as an error on the output channel, so the
daemon exits cleanly instead of crashing when the datastore closes
before pin enumeration drains.

* fix(provider): quiet keystore-close on shutdown

When the daemon shuts down, the keystore Close fires while the
startup sync goroutine may still be in flight: the OnStart ctx is
not yet cancelled, so ResetCids returning keystore.ErrClosed gets
logged at Error as "sync failed".

Treat keystore.ErrClosed the same as a cancelled ctx and log at
Debug as "interrupted by shutdown". Apply the same rule to the
periodic reprovide GC loop (whose error log got a unified message
in the process).

* test(cli): keystore-close log + pin ls shutdown

Adds TestProviderKeystoreSyncShutdownQuiet, a CLI test that:

1. Verifies no shutdown-caused keystore-sync error (err="keystore
   is closed" or err="context canceled") is logged at Error level.
   Scans stderr line-by-line so unrelated Error logs (e.g.
   "reset already in progress" from the startup+periodic overlap
   at tight Intervals) do not false-positive the assertion.

2. Runs `ipfs pin ls --stream` against the live daemon, shuts the
   daemon down mid-stream, and asserts the CLI returns within 15s,
   does not observe a daemon panic, and produces a meaningful
   error message if it exited non-zero.

Uses Provide.DHT.Interval=10ms so the periodic reprovide loop is
always inside ResetCids when StopDaemon fires, making the shutdown
race deterministic enough to catch the regression on most runs
(verified empirically against the pre-fix provider.go).

(cherry picked from commit 8416f38)
# Conflicts:
#	docs/changelogs/v0.41.md
#	version.go
0.41.0's httpRouterAddrFunc only resolved 0.0.0.0/:: when AutoNATv2 had a
confirmed reachable address. Otherwise it forwarded raw Addresses.Swarm
strings to HTTP routers, so isolated or LAN-only nodes published
unreachable provider records.

- core/node/libp2p/routingopt.go: fallback now calls host.Addrs(), which
  resolves wildcard binds to concrete interface addrs and applies the
  libp2p AddrsFactory (NoAnnounce CIDR, Swarm.AddrFilters); matches the
  DHT provide path (core/node/provider.go selfAddrsFunc)
- core/node/libp2p/routingopt_test.go: stubHost.Addrs is configurable;
  cases rewritten around resolved host addrs, with a new case pinning
  that NoAnnounce CIDR filtering belongs upstream in host.Addrs
- test/cli/delegated_routing_v1_http_client_test.go: new end-to-end
  case asserts provider records sent over HTTP never contain 0.0.0.0
  or :: when Addresses.Swarm uses the default wildcard bind

Fixes #11213
These tests verify behavior that is independent of who serves the
release JSON: TestUpdate exercises the `ipfs update` command tree, and
TestUpdateWhileDaemonRuns checks that read-only subcommands still work
while the daemon holds the repo lock. They hit the real GitHub
Releases API only by accident, which makes them flake on rate limits,
transient 5xx, or release-asset upload races. A flake panics the
harness and takes every parallel test in test/cli down with it.

Replace the network call with a shared `httptest.Server` helper
(`newMockGitHubReleases`) and point the spawned binary at it via
`TEST_KUBO_UPDATE_GITHUB_URL`, the same hook `TestUpdateInstall`
already uses. The mock returns one stable release with a matching
binary asset and follows the convention used by real kubo releases:
`kubo_<tag>_<os>-<arch>.<ext>`, where ext is `zip` on Windows and
`tar.gz` elsewhere. This must match `assetNameForPlatformTag` in
`core/commands/update_github.go`, otherwise `findReleaseAsset`
reports "no release found with a binary for <os>/<arch>".

No network, no token, no flake. Local runtime drops from ~70s to
under 1s.
Bumps [actions/github-script](https://github.com/actions/github-script) from 8 to 9.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v8...v9)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '9'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Gillis <11790789+gammazero@users.noreply.github.com>
* docs(config): clarify BlockKeyCacheSize and BloomFilterSize

BlockKeyCacheSize was documented as "size in bytes" but the underlying
boxo blockstore wires it directly to lru.New2Q[K,V](size int) which is
an entry count, not a byte budget. Fix the unit and add memory sizing
guidance (~200 B/entry) plus what the cache actually short-circuits
(per-block flatfs Stat on the bitswap server hot path).

BloomFilterSize section expanded with: what the filter answers (negative
Has only), saturation behavior at runtime growth, startup AllKeysChan
rebuild cost (one-time, scales with keyset not data volume), and a
cross-link to BlockKeyCacheSize as the complementary positive-path
cache. Drop the dead go-ipfs-blockstore link.

* docs(datastores): explain flatfs next-to-last/3 for large blockstores

The default next-to-last/2 shard depth (~1024 dirs) becomes a per-shard
file-count problem on nodes growing past a few million blocks: bulk
enumeration (GC, BloomFilterSize rebuild on startup, Provide.Strategy=all
reprovider) and per-block Stat both pay readdir cost proportional to
files-per-shard.

next-to-last/3 (~32k dirs) keeps per-directory counts in a range modern
filesystems handle well and is the recommended choice for pinning
clusters, public gateways, and mirrors. Note that shard depth is fixed
at ipfs init time and re-sharding requires a full export/import.

* docs(config): expand BloomFilterSize sizing with bbloom specifics

Replace the generic worked example with a power-of-two reference table
covering 10M to 500M blocks, and document two kubo-specific behaviors
that the generic bloom-filter math does not capture:

- ipfs/bbloom rounds the bit count up to the next power of two, so
  non-power-of-two BloomFilterSize values silently allocate more
  memory than configured (e.g. the historical 1199120-byte example
  actually allocates a 2 MiB internal filter).
- kubo wires bbloom with k=7 hash positions; the FPR formula is fixed
  at (1 - exp(-7n/m))^7. Memory cost is roughly ~1.2 B/entry at ~1%
  FPR and scales linearly with target FPR.

Add a saturation section showing FPR degradation at 2x / 4x / 8x the
design n (~11% / ~58% / >95% respectively), and a Risks subsection
clarifying that a poorly sized filter is an operational waste rather
than a correctness issue (no false negatives), with a quick
bytes-per-block health check.

Update the hur.st calculator URL from the n=1e6 default (dev-laptop
scale) to n=10e6 (representative of real kubo deployments). Reference
sizes verified empirically against ipfs/bbloom v0.1.0: a 16 MiB filter
at n=10M gave 0.1875% observed FPR vs 0.18% predicted; the historical
~1.14 MiB worked example at n=1M gave 0.0545% vs 0.054% predicted at
the rounded 2 MiB allocation.

* docs(config): define FPR up front in BloomFilterSize section

The BloomFilterSize section uses "FPR" throughout without defining it.
Explain in the intro that the false-positive rate is the probability of
a "maybe present" answer for a CID that is not actually in the
blockstore, that a false positive costs at most one wasted datastore
lookup (no data loss or incorrect retrieval), and that lower FPR means
more inbound Has() calls answered from RAM alone.

* docs: fold rounding penalty into bloom filter budget

byte/entry sizing figures now report the operationally-useful number
after bbloom's power-of-two rounding, so an operator following the
guidance lands close to the true memory footprint instead of the
raw design-point size.

- config.md (BloomFilterSize): bump byte/entry to ~1.8/2.8/4.2 at
  ~1% / 0.1% / 0.01% FPR, state the average ~1.5x rounding penalty
  (worst case ~2x); drop 500M / 1 GiB row whose m/n=17.18 broke the
  uniform 10.74 ratio of the rest of the table; active-voice and
  comma fixes in saturation, risks, and startup prose
- config.md (BlockKeyCacheSize): split a comma splice; active voice
  in 2Q replacement description
- datastores.md (flatfs): align shard table columns; soften reshard
  wording to note kubo ships no in-place tool, not that flatfs
  forbids it
A zero-value Multiaddr (since go-multiaddr v0.15 is a slice type)
encodes to zero bytes on the wire. AddrsFactory was passing such
empty entries through to the host's signed peer record, where peers
that skip the empty-input check render them as "/" and reject the
address. js-libp2p autonatv2 first flagged this against a
kubo/0.39.0/2896aed/docker agent.

AddrsFactory is the central chokepoint for kubo's announced
addresses, so filtering here scrubs every downstream consumer until
the upstream go-libp2p fix lands.

See libp2p/js-libp2p#3478 (comment)
* docs(server-profile): warn about local reverse proxy gotcha

`Swarm.AddrFilters` is consulted on inbound `InterceptAccept` as well
as outbound dials, so loopback CIDRs in the filter list cause Kubo to
reject every incoming connection from a local nginx or Caddy reverse
proxy that fronts a `/ws` (or other libp2p) listener on `127.0.0.1`.
The condition is silent: the OS accepts the TCP, then Kubo closes the
socket before the libp2p handshake.

Add an explicit note to the `Swarm.AddrFilters` section, a new row in
the `server` profile override table for the reverse-proxy case, and a
matching CAUTION block in the v0.41 changelog. Each pointer says:
remove the loopback CIDRs from `Swarm.AddrFilters` only, and keep them
in `Addresses.NoAnnounce`.

* feat(libp2p): log ERROR for listeners blocked by AddrFilters or NoAnnounce

Surface misconfigured listeners at startup and on every libp2p
`EvtLocalAddressesUpdated` event, instead of silently dropping
incoming connections or staying unadvertised.

`findDeadListeners` is a pure function that walks the host's resolved
listen addresses (the output of
`host.Network().InterfaceListenAddresses()`, matching the
post-resolution view used in #11297 for `host.Addrs()`) and matches
each IP component against every CIDR rule in `Swarm.AddrFilters` and
`Addresses.NoAnnounce`. Working from resolved addresses means wildcard
listens like `/ip4/0.0.0.0` and `/ip6/::` are already expanded to
concrete interface addresses, so the check does not flag a listener
just because the unspecified address itself happens to fall inside a
filter CIDR (for example `::` is in `::/3` even though the listener
still accepts inbound from globally-routable peers).

`MonitorDeadListeners` wires the check into fx: it runs once at
startup, subscribes to `event.EvtLocalAddressesUpdated`, and re-runs
the check whenever the host's address set changes (NAT mapping comes
online, new interface, AutoTLS cert ready). Findings are deduplicated
against the previous run so a stable misconfiguration is logged once
until it is resolved or a new finding shows up.

Loopback `Addresses.NoAnnounce` matches are skipped on the grounds
that suppressing loopback advertisement is operator-intent on every
`server`-profile node, not a misconfiguration. Loopback in
`Swarm.AddrFilters` is the bug pattern that motivated this check;
that match is always reported.

Each ERROR line names the offending listener, the matching CIDR rule,
and the field to remove the rule from to revive the listener:

    Addresses.Swarm listener "/ip4/127.0.0.1/tcp/8081/ws" matches
    Swarm.AddrFilters rule "/ip4/127.0.0.0/ipcidr/8", so Kubo
    rejects every incoming connection to it. Remove
    "/ip4/127.0.0.0/ipcidr/8" from Swarm.AddrFilters to allow
    connections to this listener.
* chore(deps): align deps with ipfs/boxo#1152

Bumps boxo to the head of ipfs/boxo#1152 and lands the kubo-only
direct deps from this week's dependabot batch in one go so go.mod
stays consistent.

Direct:
- boxo: v0.39.0 to ipfs/boxo#1152 head
- libp2p-pubsub: 0.15.0 to 0.16.0
- fsnotify: 1.9.0 to 1.10.0
- go-fuse/v2: 2.9.1-pre to 2.10.1
- otelhttp: 0.67.0 to 0.68.0
- otel, otel/sdk, otel/sdk/metric, otel/trace: 1.42.0 to 1.43.0
- otel/exporters/prometheus: 0.56.0 to 0.65.0
- contrib/propagators/autoprop: 0.46.1 to 0.68.0

Pulled in via boxo: zap 1.28.0, go-unixfsnode 1.10.4.

Skipped: cheggaaa/pb v1 to v2 (incompatible API; v2 drops
pb.U_BYTES and pb.New64(...).SetUnits, breaking the progress
bar usage in core/commands/{cat,add,get,dag/export}.go).

Supersedes #11306, #11307, #11308, #11309, #11311, #11312.

* fix(metrics): drop otel_scope_info, expose scope as labels

The otel prometheus exporter v0.59.0 stopped emitting the standalone
otel_scope_info metric. Scope identity is now carried by
otel_scope_name, otel_scope_version, and otel_scope_schema_url
labels on every metric, added in v0.58.0. The bump to v0.65.0 in
this branch crosses that boundary, so the t0119 baseline failed.

Update the sharness baseline, docs/metrics.md, and add a v0.42
changelog highlight so operators scraping otel_scope_info know to
switch their dashboards to the per-metric labels.

* chore(deps): bump boxo to main (incl. ipfs/boxo#1152)

boxo@main now includes ipfs/boxo#1152, replacing the temporary
PR-pinned revision used in 681a4b9.
denylists only block content retrieval and local IPNS resolution.
they do not stop a DHT server from storing or serving provider and
IPNS records for denied keys on behalf of other peers, and they do
not gate /routing/v1/ responses. document this explicitly and point
operators at Routing.Type=autoclient as the way to opt out of acting
as a routing intermediary for blocked content.

Closes #11317
Closes #11318
Closes #11319

these issues track the implementation work to push denylists into
the kad-dht provider store, the IPNS validator and pubsub path, and
the /routing/v1/ HTTP layer. until that lands, autoclient is the
only operator-facing knob with the same effect, so the docs need
to say so.
This upgrades the pebble database to v2.1.5.
* update go-log to v2.9.2
* feat(pinner): close pinner before repo on shutdown

The pinner's streaming goroutines hold a reference to the backing
datastore, and pebble panics on use after Close. Before this change
the panic was recovered inside the pinner (see ipfs/boxo#1146) and the
symptom was only a transient log trace on daemon exit, but the race
remained.

Register a new fx OnStop hook that calls pinner.Close before the repo
(and therefore the datastore) closes. Close drains all in-flight
stream goroutines, so the datastore is closed only after the pinner
is fully quiesced.

Bumps boxo to pick up Pinner.Close from ipfs/boxo#1150.

Fixes #11292

* chore(deps): bump boxo to ipfs/boxo#1150 (70ffcfa)

* chore(deps): bump boxo to ipfs/boxo#1150 (75481f4)

ipfs/boxo#1150 was reworked to use context fan-out instead of a done
channel. Pinner.Close now cancels every admitted op and waits for
them to return, broadening the shutdown contract from "drain
streams" to "drain everything". Comments and changelog reworded to
match.

* chore(deps): bump boxo to latest main (b2b5d8a)
* feat: bound graceful shutdown, add diag healthy

Replace unbounded app.Stop(context.Background()) with a deadline-bounded
context driven by a new Internal.ShutdownTimeout config (default 12h,
0 disables). Add an os.Exit(1) watchdog at the same deadline so an FX
OnStop hook that never returns can no longer hang the daemon.

Add ipfs diag healthy: fails when shutdown has been initiated or when
the DAG pipeline cannot resolve the well-known empty-directory CID.
Dockerfile HEALTHCHECK now uses it so orchestrators recycle half-
shutdown daemons.

- core/shutdown: new pkg; atomic startedAt + CloseWithCtx helper
- core/builder.go: app.Stop bounded by ShutdownTimeout
- cmd/ipfs/kubo/daemon.go: watchdog + MarkStarted on signal
- core/commands/diag.go: new healthy subcommand
- core/node/{bitswap,libp2p/host,libp2p/routing}.go: OnStop hooks wrapped
- config/internal.go: ShutdownTimeout + DefaultShutdownTimeout=12h
- Dockerfile: HEALTHCHECK uses "ipfs diag healthy"
- docs/{config,changelogs/v0.42}.md: documented
- test/cli: enabled + disabled path tests

* feat: bound provider stats and ADD_PROVIDER sends

bumps go-libp2p-kad-dht past v0.39.2 to b73e1e8 to pick up two
related provider bug fixes.

- ipfs provide stat now honors client cancellation and deadlines
  instead of blocking indefinitely behind a slow keystore lookup
- adds Provide.DHT.SendProviderRecordTimeout capping each
  ADD_PROVIDER RPC so unresponsive peers cannot pin a provide
  worker and stall reprovide cycles
- internal reprovide-alert poller bounds its Stats call so a
  hung keystore.Size cannot delay shutdown

* test(shutdown): use synctest for timeout test, document sleep

CloseWithCtx_timesOut now runs in a synctest bubble so the deadline
assertion is exact (no wall-clock slack), and the simulated close uses
a release channel to drain the bubble cleanly after the leak point.
The two happy-path tests stay unchanged because their close funcs
return immediately and gain nothing from a fake clock.

Comment the 2ms sleep in TestMarkStartedPreservesFirstTimestamp so
its role (forcing time.Now() to advance between the two MarkStarted
calls so a CAS to Store regression is detectable) is not lost.

Addresses #11329 (review).

* fix(pinner): bound pinner Close with shutdown deadline

The boxo Pinner.Close contract notes that an in-flight op ignoring
its ctx (a downstream bug) can block Close, so the host must bound
it at the call site. Wrapping the OnStop hook with CloseWithCtx
honors Internal.ShutdownTimeout and surfaces an actionable
"subsystem 'pinner' failed to close" log on hang instead of leaving
only the watchdog os.Exit(1) trace.

* fix(shutdown): bound remaining I/O-touching OnStop hooks

Wrap the OnStop hooks whose Close can plausibly block on disk or
network: repo (datastore flush + lock release), mfs-root (datastore
writes via DAGService), peering (waits on libp2p peer goroutines),
legacy-provider (in-flight reprovide RPCs), and the dht-provider
plus keystore pair under SweepingProvider.

In-memory closes (blockservice, peerstore, resource-manager) are
left as-is since they cannot realistically hang.

For the dht-provider/keystore pair, provider closes first so nothing
can access the keystore afterwards. If the shutdown ctx fires
mid-provider-drain, the keystore close sees an expired ctx and
returns immediately; the watchdog os.Exit(1) is the ultimate
backstop, and keystore writes are fsync'd on put so missing the
explicit close is recoverable on next boot.

* fix(shutdown): bound remaining in-memory OnStop hooks

Wrap blockservice, peerstore, and resource-manager Close hooks with
CloseWithCtx for uniformity. These are pure in-memory operations
unlikely to hang in practice, but wrapping costs nothing and makes
the shutdown audit trail uniform: every OnStop hook now honors the
deadline and surfaces a named subsystem on timeout.

* fix(shutdown): bound autoRelayFeeder OnStop on ctx

OnStop waited on the feeder goroutine via <-done without honoring
the shutdown ctx. The goroutine itself selects on ctx in every
loop case, so cancel() normally suffices, but a stuck downstream
dht.WAN.GetClosestPeers that ignored its ctx could block fx.Stop
indefinitely. Adding the ctx.Done() select case mirrors the
reprovideAlert pattern in provider.go and lets the shutdown
deadline reclaim control even with a misbehaving DHT.

* docs(changelog): merge shutdown entries into one user-facing section

Combine the pinner-on-shutdown paragraph with the bounded-shutdown
section under a single "Reliable shutdown and container health checks"
heading. Lead with the visible symptoms (half-shutdown daemons,
healthy-but-dead container reports, manual docker restart) instead of
fx OnStop jargon. Frame Internal.ShutdownTimeout as a
belt-and-suspenders ceiling, with the 12-hour default sized against
the 22-hour DHT provider record expiration.
…11321)

* feat(provide): add ipfs provide once for ad-hoc announcements

Adds an experimental subcommand that submits provider records for
the given CIDs through the provider system right away, without
waiting for the next reprovide cycle. Use -r to walk the DAG and
announce every reachable block.

Designed against the sweep provider (the default since v0.39):
StartProviding queues to the burst-provide workers, which publish
records to the DHT efficiently. Works with the legacy provider too,
though it queues into the slower serial worker pool.

CIDs must already exist in the local blockstore. Re-announcement on
the regular schedule is governed by Provide.Strategy and
Provide.DHT.Interval; this command does not change either.

* refactor(routing): deprecate ipfs routing provide

Marks `ipfs routing provide` as deprecated and points users at the new
`ipfs provide once`. The command keeps its existing Run, Encoders, and
flags so existing scripts continue to work; only the status flag and
helptext change.

* docs(routing): clarify when ipfs routing reprovide applies

Tightens the helptext and the sweep-mode error message so the
constraint is obvious: this command only triggers a cycle on the
legacy provider, and points users at 'ipfs provide stat --all' for
monitoring the default sweep schedule.

* docs: tighten provide helptext and update routing-provide references

Updates docs/config.md and docs/experimental-features.md to reference
'ipfs provide once' instead of 'ipfs routing provide'. Tightens the
helptext for 'ipfs provide clear' and the 'ipfs provide stat' overview:
drops headings around short paragraphs, prefers active voice, and notes
that the sweep provider is the default.

* docs: changelog entry for ipfs provide once

* docs: use 'provide system' wording consistently

* test(provide): cover --recursive and multi-CID paths for provide once

Adds two subtests under runProviderSuite (run for both Legacy and Sweep):

- --recursive walks the DAG and announces every chunk of a 2 MiB file
  added with --pin=false under Provide.Strategy=roots, so the auto-
  provide path stays out of the way.
- multiple CIDs in a single invocation succeed and the text encoder
  reports 'queued 3 CID(s) for immediate provide'.

* feat(provide): stream cids and per-cid output for ipfs provide once

Each CID flows through the command independently, so stdin can be piped
without buffering and consumers see results as they happen.

- Run reads CIDs from argv and then from BodyArgs (stdin scanner) one at
  a time, calling StartProviding per CID.
- With -r, the dag.Walk visit callback emits per visited block; the walk
  cancels its context on the first announce error to stop fetching.
- A typed ProvideOnceEvent (one per queued CID) replaces the prior batch
  result. JSON output streams {"Queued":"<cid>"} per line.
- Text output via PostRun: when stderr is a tty, the running count is
  redrawn on a single line; otherwise a final count is printed. The text
  encoder still works for HTTP/RPC consumers (one CID per line).
- Adds tests for stdin streaming and --enc=json one-event-per-line.

* feat(provide): dedupe across all roots and recursive walks

Previously the cid set was scoped per root, so a CID shared by two
arguments or by two recursive DAG walks was announced twice. Move the
set out to the Run scope so each unique CID is announced exactly once
per invocation, regardless of how many times it shows up in argv,
stdin, or the DAG walks.

For -r, hitting an already-seen CID also stops descent into that
subtree, avoiding redundant block fetches when DAGs overlap.

* style(provide): rename useTTY to isTTY in PostRun

* refactor(provide): align ipfs provide once with kubo cmds-lib idioms

- Use the existing argumentIterator helper from cid.go to read argv
  followed by stdin, replacing the inlined two-loop variant.
- Document why PostRun forks on encoder type (TTY redraw needs to
  bypass the encoder; json/xml must keep streaming through it).
- Log an ERROR for unexpected response types instead of dropping
  them silently, mirroring the defensive pattern in cat.go's PostRun.

* docs(routing): document streaming limitations of routing provide

Spell out what 'ipfs routing provide' does worse than 'ipfs provide
once' so users on the deprecation path know why to switch: input
buffering, no per-cid output, no dedup across recursive roots, and the
sync dht lookup that defeats sweep batching.

* docs(changelog): rewrite ipfs provide once entry around user impact

Recasts the highlight to lead with what the user can now do, not what
the code does internally. Adds a one-line example showing the streaming
stdin path that the previous version did not surface, and replaces
"namespace" plumbing language with the actual capabilities (running
count, json-per-line, single announcement per shared block under -r).

* feat(provide): use boxo BloomTracker for cross-input dedup

Swaps the cid.Set used by 'ipfs provide once' for the autoscaling
boxo BloomTracker, the same dedup mechanism that powers
Provide.Strategy=+unique.

Run executes on the daemon, not the cli, so this caps daemon memory
under hostile or accidental input: a user piping 100M cids previously
would have grown the daemon's set to ~7 gb of resident memory; with
the bloom chain it plateaus around 700 mb at the default fp rate, and
under 100 mb up to 10m unique cids.

The trade-off is a small false-positive rate (~1 in 4.75m, the kubo
default) that can cause an occasional cid to be silently skipped. For
ad-hoc providing this is acceptable; the regular reprovide cycle will
pick up anything matched by Provide.Strategy on the next pass.

* docs(changelog): use ipfs refs as the provide once example

* style(provide): goimports import order

* docs(provide): soften dedup wording, comment re.Emit gate, cover Provide.Enabled=false

- Change "exactly once per invocation" to acknowledge the bloom
  false-positive rate now that the dedup is probabilistic.
- Add a comment to the text branch of PostRun warning future readers
  not to call re.Emit there, since the encoder would race with the
  TTY counter.
- Add a runProviderSuite subtest that exercises Provide.Enabled=false
  through the new code path (the existing routing-provide test only
  covers the deprecated alias's Run).

* docs(changelog): clarify provide once use case and add second example

- Note that provide once is also for fine-tuned control over which
  CIDs get announced when, alongside the regular reprovide schedule.
- Add a second example using ipfs pin ls so users see the pattern for
  replaying their pinset alongside the dag-walk pattern.

* feat(provide): error on ipfs provide once with Provide.DHT.Interval=0

When Provide.DHT.Interval=0, kubo wires NoopProvider via
OnlineProviders -> OfflineProviders, so StartProviding silently
no-ops and the cid never gets announced. provide once was returning
success without any DHT publish: a footgun.

Add an explicit precondition check that mirrors the routing
reprovide error path. Decoupling the wiring so ad-hoc provide works
under Interval=0 is tracked separately.

* chore(deps): pin go-libp2p-kad-dht to PR #1246 head

Pulls in the WithReprovideInterval(0) burst-only mode from
libp2p/go-libp2p-kad-dht#1246 so the kubo side
of the Provide.DHT.Interval=0 decoupling can be developed against it.

* chore(deps): re-pin go-libp2p-kad-dht to PR #1246 head

Updates to the latest commit on the upstream branch (817031b) which
also relaxes the dual SweepingProvider's reprovide-interval validator
to accept 0, on top of the single-provider relaxation in the previous
pseudo-version.

* feat(provide): decouple Provide.DHT.Interval=0 from the master kill-switch

Provide.Enabled is now the only switch that fully turns off the provide
system. Provide.DHT.Interval=0 disables only the periodic reprovide
schedule; new CIDs still announce via fast-provide-root and
'ipfs provide once'.

- groups.go: drop the Interval=0 factor from isProviderEnabled. The
  real provider (sweep or legacy) is now wired even when Interval=0.
- provider.go: skip the keystore sync goroutine in no-schedule mode.
  The ticker would panic on a zero interval, and with no schedule the
  keystore has no reader.
- cmdenv/env.go: drop the fast-provide-root short-circuit on
  Interval=0. Provide.Enabled=false is now the only short-circuit.
- commands/provide.go: drop the temporary 'cannot provide:
  Provide.DHT.Interval is 0' error from 'ipfs provide once'.
- test/cli: replace the 'Reprovide.Interval=0 disables announcement of
  new CID too' test (premise is now false) with one asserting that
  Interval=0 + Enabled=true keeps announcing. Convert the
  provide-once + Interval=0 test from error path to success path.
  Tighten the legacy 'Manual Reprovide trigger' test to focus on the
  error contract.

Requires upstream go-libp2p-kad-dht support for
WithReprovideInterval(0) (kept under PR #1246).

* feat(config): require explicit Provide.Enabled when Provide.DHT.Interval=0

Provide.DHT.Interval=0 used to disable the entire provide system as a
side effect. After the decoupling it disables only the periodic
reprovide schedule, while new CIDs still announce via fast-provide-root
and 'ipfs provide once'. To prevent silent semantic drift on upgrade,
the daemon now refuses to start when Interval is explicitly set to 0
unless Provide.Enabled is also set explicitly:

  - Provide.Enabled=false fully disables providing (the old behaviour).
  - Provide.Enabled=true keeps ad-hoc providing while skipping the
    periodic reprovide schedule.

The error message names both options so operators can pick the one
that matches their intent without reading the changelog.

* docs: explain new Provide.DHT.Interval=0 semantic

Updates docs/config.md and the v0.42 changelog: Interval=0 now disables
only the periodic reprovide schedule, and the daemon refuses to start
without an explicit Provide.Enabled in that configuration. Calls out
both upgrade paths (Provide.Enabled=false to fully disable, or =true
to keep ad-hoc providing).

* chore(deps): re-pin go-libp2p-kad-dht to amended PR #1246 head

Picks up the timeOffset/timeBetween zero-guards so SweepingProvider.Stats()
no longer panics with reprovideInterval=0. Required for 'ipfs provide stat'
to work in no-schedule mode.

* test(provide): align test expectations with new no-schedule semantic

- core/commands/commands_test.go: register /provide/once in the
  expected command list.
- test/cli/provide_stats_test.go: 'ipfs provide stat' with
  Provide.DHT.Interval=0 now returns valid stats (with the schedule
  timing fields zeroed) instead of erroring out. Update the assertion
  to match.

* chore(deps): re-pin go-libp2p-kad-dht to amended PR #1246 head

Picks up the scheduleEnabled() consistency cleanup so timeOffset and
timeBetween match the rest of the upstream gates.

* chore(deps): re-pin go-libp2p-kad-dht to PR #1246 merge on master

picks up the three follow-up commits guillaumemichel pushed before
merging libp2p/go-libp2p-kad-dht#1246:

- refactor: simplify StartProvide()
- refactor: minimize change diff
- fix: don't remove from keystore on StopProviding()

* fix(provide): use ProvideOnce in `ipfs provide once`

`ipfs provide once` was calling StartProviding, which in sweep mode
persists keys to the keystore and adds them to the periodic reprovide
schedule. that contradicts the command's name and help text.

switch to ProvideOnce so the command publishes once and leaves the
schedule untouched. for the legacy provider StartProviding already
wraps ProvideOnce, so legacy behaviour is unchanged.

also tighten the help text to state plainly that the schedule is not
modified.

* fix(provider): keep keystore inert when Provide.DHT.Interval=0

In no-schedule mode the keystore has no reader (no reprovide loop) and
no writer (kad-dht's burst path skips Put/Delete). Until now we still
opened on-disk leveldb/pebble files for it: wasted disk and noise on
upgrade/downgrade.

Switch the keystore to an in-memory map in no-schedule mode and make
destroyDs a no-op. Also purge any pre-existing keystore directory once
at startup so users who toggle from schedule to no-schedule reclaim
disk.

Replace the literal `reprovideInterval == 0` check at the second call
site with the named noScheduleMode flag for consistency.
Update boxo to ipfs/boxo#1128 which removes io.Seeker from the
files.File interface. Callers that need seeking now type-assert
to io.Seeker.

- core/commands/cat: type-assert before seeking
- core/coreiface/tests: type-assert before seeking
## Problem

On a repo from `go-ipfs` or Kubo older than v0.27, the one-time migration uses `http.DefaultClient` (no timeouts) against a single hardcoded `trustless-gateway.link`. If that gateway is slow or blocked, the daemon hangs indefinitely before the data store opens, with no fallback. Reported in ipfs/ipfs-desktop#3147, where a user with a v11 repo thought they had lost 4,444 added images.

## Fix

- HTTP client gets dial, TLS, and response-header timeouts (15s, 15s, and boxo's `DefaultRetrievalTimeout` of 30s).
- The `"HTTPS"` alias in `Migration.DownloadSources` expands to five trustless community gateways instead of one. Trust is in local per-block multihash verification, not the operator.
- Outbound requests send `?format=car` (or `?format=ipns-record`) alongside `Accept`, since some gateways honor only one.
- `MultiFetcher` gets a session-scoped quarantine: a failing fetcher moves to the back of the rotation; after three full failed loops it latches `ErrMultiFetcherExhausted` pointing the user at `Migration.DownloadSources`. A cancelled context exits the loop early so it never poisons the quarantine.
- `RetryFetcher` is removed; rotation across distinct gateways replaces same-gateway retries.

Also fixes two pre-existing bugs in the same path: `NewHttpFetcher` ignored the `userAgent` argument so every request shipped Go's default `Go-http-client/1.1`, and `resolveIPNS` leaked the response body.

The `Migration` config and `"HTTPS"` alias keep working the same way for users; the alias just expands to more gateways internally.

Closes #7933
Closes #3137
Closes #8911
Closes ipfs/ipfs-desktop#3147
Surfaced by ipfs/service-worker-gateway#1067, where operators behind a
default-deny firewall hit unreachable nodes from browser peers because
UDP/4001 (QUIC, WebTransport, WebRTC-Direct) was not opened alongside
TCP/4001.

- new docs/production/firewall.md: inspect ufw rules, open 4001/tcp
  and 4001/udp, optional Kubo application profile, custom-port and
  rule-removal notes
- daemon health (ipfs diag healthy) split from reachability
  (ipfs swarm addrs autonat), with Swarm.DisableNatPortMap and
  Swarm.EnableHolePunching pointers for nodes that stay Private
- link the walkthrough from Addresses.Swarm and the Security section
  in docs/config.md, and from the Production index in docs/README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.