fix(filesystem): timeouts + max-visited cap for hangs on lazy provider paths (#4162)#4212
Conversation
b871b4f to
2c35146
Compare
|
The The one place the discipline gets dropped is on the cap side of This is the same problem the timeout side carefully avoids. Two shapes that would close it without disturbing the rest of the design:
The throw shape is consistent with what the PR already does for timeouts and is probably the lower-friction choice if you want to keep the public signature stable. Either way, the asymmetric shape is the structural thing worth fixing here, because the typed-error discipline elsewhere in this diff is exactly what makes the silent-partial-result on the cap side stand out. Worth confirming on the test file too. The new |
|
@temurkhan13 — thanks for taking the time to read through this so carefully, the framing around There's one piece I'd gently flag, because I think it might be easy to miss in the diff — the cap path in The flow is:
if (aborted) {
throw new Error(
`Search aborted after visiting ${visited} entries ` +
`(cap FS_SEARCH_MAX_VISITED=${FS_SEARCH_MAX_VISITED}). ` +
`Refine the pattern (e.g. '**/name') or narrow the search path.`
);
}
return results;So a search that hits the cap throws with a clear message including the visit count — the caller doesn't see a silent partial array. On the test side, the cap-reached signal is asserted from outside the function: it('aborts with a clear error once the visited cap is exceeded', async () => {
...
await expect(searchFilesWithValidation(ROOT, '**/*', ALLOWED))
.rejects.toThrow(/Search aborted after visiting/);
});That said, there's a real point in your comment that I do think is worth raising as a separate question for @JonoGitty / the maintainers: the timeout path uses a typed Thanks again for the careful read — the instinct that the timeout and cap paths should give callers equivalent visibility is the right one. |
…r paths (modelcontextprotocol#4162) fs.realpath (in validatePath) and recursive fs.readdir (in search_files / directory_tree) are called with no timeout, and recursive search has no upper bound on entries visited. On macOS provider-backed paths (~/Library/CloudStorage/..., ~/Library/Mobile Documents/com~apple~CloudDocs/..., NFS/SMB) these can block for minutes while the provider fetches metadata; the host eventually reports the connector unresponsive while the subprocess stays stuck. Follow-up to modelcontextprotocol#4181, which only skipped excluded entries before realpath. lib.ts: - withFsTimeout() wraps fs.realpath and fs.readdir; on timeout rejects with a typed FsTimeoutError instead of hanging. A no-op catch on the timeout promise avoids spurious unhandled-rejection reports when the operation wins the race. - FS_SEARCH_MAX_VISITED caps recursive search; on cap it throws a typed FsSearchTruncatedError (carrying visited + maxVisited), mirroring FsTimeoutError so callers can discriminate a safety-cap abort from a genuine empty result via instanceof rather than string-matching the message. - FS_SEARCH_EXCLUDE_PREFIXES (opt-in, empty default) refuses recursive operations on configured slow roots, via the shared boundary-aware assertSearchPathNotExcluded helper (reuses isPathWithinAllowedDirectories, expandHome applied symmetrically to the candidate root). - ENOENT branch in validatePath refactored so a hung parent realpath surfaces as FsTimeoutError instead of being masked as "Parent directory does not exist". - Per-entry catch in search re-throws FsTimeoutError so a search never returns silently-partial results. index.ts: - directory_tree gets the same timeout + visited-cap treatment and shares the exclude-prefix helper; throws FsSearchTruncatedError immediately on cap-exceed rather than threading an aborted flag through unwinding recursion. Config (all env-var, conservative defaults; invalid values warn + fall back): FS_OP_TIMEOUT_MS=15000, FS_SEARCH_MAX_VISITED=50000, FS_SEARCH_EXCLUDE_PREFIXES=(empty) README documents the three vars, the NFS/SMB tuning note, and that FS_SEARCH_EXCLUDE_PREFIXES is a lexical prefilter (symlink-aware excludes tracked in modelcontextprotocol#4208). Tests: new __tests__/4162-timeouts-and-caps.test.ts (14 cases) using vitest fake timers + mocked fs/promises, so no real CloudStorage path is needed in CI; cap test asserts the typed FsSearchTruncatedError + its visited/maxVisited fields. Full suite 160/160. Verified on a real macOS Google Drive tree by @alemuenchen. Deliberately out of scope: libuv threadpool concurrency, fs.opendir streaming for huge single directories, symlink-aware excludes (modelcontextprotocol#4208). Co-authored-by: Alessandro <alemuenchen@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2c35146 to
fc52c33
Compare
|
Thanks both. @alemuenchen — yes, that's the flow exactly: the cap path already throws after @temurkhan13 — the consistency point is fair and I've folded it in: the cap now throws a typed |
|
@alemuenchen — you're right, that's a genuine misread on my end. I grepped the diff for the post-recursion throw and missed it sitting right above The typed-vs-untyped consistency was the substantive thread of it, and |
Description
Bounds two unbounded code paths in the filesystem server that can hang for minutes on macOS CloudStorage / FileProvider-backed directories:
fs.realpathinvalidatePath, and recursivefs.readdirinsearch_files/directory_tree. Addresses the structural concerns from #4162 that #4181 (excludePatterns before realpath) left open.Follow-up design question about symlink-aware excludes is tracked separately in #4208 — intentionally not in scope here.
Server Details
search_files,directory_tree, internalvalidatePath)Motivation and Context
fs.realpathandfs.readdirare called with no timeout. On macOS provider-backed paths (~/Library/CloudStorage/...,~/Library/Mobile Documents/com~apple~CloudDocs/..., and NFS/SMB), these can block while the provider fetches metadata; recursive search additionally has no upper bound on entries visited. The host (e.g. Claude Desktop) eventually reports the connector unresponsive (~4 min) while the subprocess stays alive but stuck. See #4162 for the full report.The defensive shape here originated from @alemuenchen, who has run equivalent mitigations in production against Google Drive + iCloud Drive for several weeks and contributed the reference diff on #4162.
What this adds
src/filesystem/lib.ts:withFsTimeout()wrapsfs.realpath(invalidatePath) andfs.readdir(insearchFilesWithValidation); on timeout it rejects with a typedFsTimeoutErrorrather than hanging.FS_SEARCH_MAX_VISITEDcaps recursive search; abort produces a clear "narrow the path / refine the pattern" error.FS_SEARCH_EXCLUDE_PREFIXES(opt-in, empty default) refuses recursive operations on configured slow roots, via the shared, boundary-awareassertSearchPathNotExcludedhelper (reusesisPathWithinAllowedDirectories).validatePathrefactored so a hung parent realpath surfaces asFsTimeoutErrorinstead of being masked as "Parent directory does not exist".catchin search re-throwsFsTimeoutErrorso a search never returns silently-partial results.src/filesystem/index.ts:directory_treegets the same timeout + visited-cap treatment and shares the exclude-prefix helper.Configuration
FS_OP_TIMEOUT_MS15000fs.realpath/fs.readdir.FS_SEARCH_MAX_VISITED50000search_files/directory_treebefore aborting.FS_SEARCH_EXCLUDE_PREFIXESInvalid numeric values warn to stderr and fall back to the default, so a typo cannot silently disable a guard.
How Has This Been Tested?
__tests__/4162-timeouts-and-caps.test.ts(14 cases): realpath/readdir timeouts, ENOENT preservation + ENOENT-realpath-timeout regression, per-entryFsTimeoutErrorpropagation, max-visited abort, exclude-prefix containment (descendant + boundary-aware sibling), env-var validation. Uses vitest fake timers + mockedfs/promises, so no real CloudStorage path is needed in CI.lib.tscoverage 82% stmts / 88% branch / 94% funcs.Breaking Changes
None. All new behaviour is bounded by env vars that default to safe values for ordinary local trees; existing configurations are unaffected.
Types of changes
Checklist
Additional context
Deliberately out of scope (worth their own follow-ups, flagged so as not to muddy this PR):
fscalls still occupy the threadpool until they settle; a concurrency limiter is a separate mitigation.readdiron a directory with millions of entries is still a single allocation;fs.opendirstreaming would address this.FS_SEARCH_EXCLUDE_PREFIXESis a lexical prefilter; a symlink whose target is under an excluded prefix is not blocked. Design discussion in filesystem: FS_SEARCH_EXCLUDE_PREFIXES is lexical; should excludes be canonical/symlink-aware? #4208.Credit: the core defensive diff is @alemuenchen's, contributed on #4162 with his blessing to take it forward. Tests,
directory_treeintegration, the shared exclude helper, README, and this PR are mine.