feat(indexing): port gitignore-aware file walker from semble#8
Conversation
Port of src/semble/index/file_walker.py.
- walkFiles(root, extensions, ignore?) async generator
- _isIgnored, _loadIgnoreForDir, _walk helpers exported with underscore prefix to mirror semble
- DEFAULT_IGNORED_DIRS includes .csp/ (replacement for .semble/)
- Honours .gitignore and .cspignore (semble's .sembleignore)
- Recreates semble's negation-with-extension bypass: when a winning gitignore
pattern is a negation (!pattern) AND the pattern (stripped of trailing /)
has a file extension suffix, the file bypasses the extension allowlist via
the 'found' flag. Implemented by parsing patterns ourselves and storing
{ pattern, negated, hasExtSuffix, matcher } per line, since the 'ignore'
npm package does not expose the winning rule for negations.
Tests cover: recursion, symlink skipping, default dirs, .gitignore exclusion,
negation-with-extension bypass, .cspignore, nested .gitignore, extra ignore
arg, case-insensitive extension matching.
There was a problem hiding this comment.
Code Review
This pull request ports the file walker and its corresponding tests from Python to TypeScript, implementing recursive file traversal that respects .gitignore and .cspignore files. The reviewer suggests a significant performance optimization for the _isIgnored function: by caching whether an IgnoreSpec contains any negated patterns with file extensions, the walker can use the aggregate matcher's .test() method to bypass the expensive per-pattern loop for the vast majority of files.
There was a problem hiding this comment.
No issues found across 2 files
Architecture diagram
sequenceDiagram
participant Caller as Caller (indexer)
participant walkFiles as walkFiles()
participant _walk as _walk()
participant _loadIgnore as _loadIgnoreForDir()
participant _isIgnored as _isIgnored()
participant FS as File System
participant ignorePkg as ignore package (dynamic import)
Note over Caller,_walk: Entry Point: walkFiles(root, extensions, ignore?)
Caller->>walkFiles: root, [".ts"], optional extra ignore patterns
walkFiles->>walkFiles: Build extensions Set (lowercased)
walkFiles->>walkFiles: Merge DEFAULT_IGNORED_DIRS + extra ignore patterns
walkFiles->>ignorePkg: dynamic import (lazy, cached)
ignorePkg-->>walkFiles: Ignore factory
walkFiles->>walkFiles: buildSpec(root, dirPatterns) → baseSpec
walkFiles->>_walk: yield * _walk(root, [baseSpec], extensionsSet)
loop Per directory (recursive)
_walk->>_loadIgnore: _loadIgnoreForDir(directory)
_loadIgnore->>FS: stat .gitignore
FS-->>_loadIgnore: File info or error
alt .gitignore exists
_loadIgnore->>FS: readFile(.gitignore)
FS-->>_loadIgnore: gitignore content
end
_loadIgnore->>FS: stat .cspignore
FS-->>_loadIgnore: File info or error
alt .cspignore exists
_loadIgnore->>FS: readFile(.cspignore)
FS-->>_loadIgnore: cspignore content
end
_loadIgnore->>_loadIgnore: Merge lines, skip blanks/comments
_loadIgnore->>ignorePkg: buildSpec(lines)
ignorePkg-->>_loadIgnore: Aggregate matcher + ParsedPattern[]
_loadIgnore-->>_walk: IgnoreSpec or null
_walk->>_walk: Merge dirSpec into inheritedSpecs
_walk->>FS: readdir(directory, withFileTypes)
FS-->>_walk: Dirent[]
loop Each entry (sorted)
_walk->>_walk: Skip symlinks
_walk->>_isIgnored: _isIgnored(fullPath, isDir, specs)
Note over _isIgnored: Walks patterns in source order
_isIgnored->>_isIgnored: For each pattern: pattern.matcher.ignores(candidate)
alt Pattern matches
_isIgnored->>_isIgnored: ignored = !pattern.negated
_isIgnored->>_isIgnored: found = !ignored && pattern.hasExtSuffix
end
_isIgnored-->>_walk: { ignored, found }
alt ignored == true
_walk->>_walk: Skip entry
else isDir
_walk->>_walk: yield * _walk(full, specs, extensions)
else isFile
alt found == true (negation bypass)
_walk->>_walk: yield full (even if extension not in allowlist)
else extensionsSet.has(extname)
_walk->>_walk: yield full
else
_walk->>_walk: Skip (extension not allowed)
end
end
end
end
_walk-->>walkFiles: File paths
walkFiles-->>Caller: AsyncGenerator of matching file paths
Apply gemini-code-assist's performance optimization to _isIgnored: - Add hasNegatedExtPattern cache on IgnoreSpec, computed once in buildSpec - Consult the aggregate ignore matcher via .test() as the hot path - Skip the per-pattern walk when the path doesn't match, is plainly ignored, or is unignored but the spec contains no negated extension patterns (so 'found' cannot change) - Fall back to the per-pattern loop only when a negation could win AND the spec carries at least one negated extension pattern — the only case where 'found' may flip to true Behavior is equivalent: the per-pattern loop's last-winning-pattern rule matches gitignore semantics that .test() already enforces. Verified by the existing 17 negation-bypass / nested-gitignore tests plus 4 new tests covering hasNegatedExtPattern and cross-spec state preservation. Mirrors semble's _is_ignored in src/semble/index/file_walker.py. Co-Authored-By: gemini-code-assist[bot] <gemini-code-assist[bot]@users.noreply.github.com>
amondnet
left a comment
There was a problem hiding this comment.
Applied 3, deferred 0.
All three gemini-code-assist comments addressed in 3a43e48:
- Added
hasNegatedExtPattern: booleantoIgnoreSpec(with updated doc comment) buildSpeccomputes the flag once per spec_isIgnoreduses the aggregate matcher's.test()as the hot path, falling back to the per-pattern loop only when a negation could win AND the spec carries a negated ext pattern — preserving the load-bearing negation-bypass semantics from semble's_is_ignored
cubic-dev-ai found no issues.
Tests: 21 pass (17 existing + 4 new covering hasNegatedExtPattern cache and cross-spec state preservation).
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/indexing/file-walker.ts">
<violation number="1" location="src/indexing/file-walker.ts:206">
P2: The new `.test()` double-false shortcut is not semantically safe: `ignore.test()` can return `{ignored:false, unignored:false}` for indirectly-unignored paths, so this early `continue` can preserve stale outer `ignored=true` and skip files that should be included.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
|
||
| const { ignored: isIgnoredBySpec, unignored: isUnignoredBySpec } = aggregateResult | ||
|
|
||
| if (!isIgnoredBySpec && !isUnignoredBySpec) { |
There was a problem hiding this comment.
P2: The new .test() double-false shortcut is not semantically safe: ignore.test() can return {ignored:false, unignored:false} for indirectly-unignored paths, so this early continue can preserve stale outer ignored=true and skip files that should be included.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/indexing/file-walker.ts, line 206:
<comment>The new `.test()` double-false shortcut is not semantically safe: `ignore.test()` can return `{ignored:false, unignored:false}` for indirectly-unignored paths, so this early `continue` can preserve stale outer `ignored=true` and skip files that should be included.</comment>
<file context>
@@ -175,14 +191,47 @@ export function _isIgnored(
+
+ const { ignored: isIgnoredBySpec, unignored: isUnignoredBySpec } = aggregateResult
+
+ if (!isIgnoredBySpec && !isUnignoredBySpec) {
+ // No pattern in this spec matched — preserve outer state.
+ continue
</file context>
Summary
Port of
src/semble/index/file_walker.pyto TypeScript. Provides thegitignore-aware directory traversal used by every indexing entry point in csp.
Unit 8 of the parallel port effort (MinishLab/semble → @pleaseai/csp).
What was ported
walkFiles(root, extensions, ignore?)— async generator yielding filesmatching
extensions, skipping ignored paths_isIgnored,_loadIgnoreForDir,_walk— internal helpers exported underthe same underscore-prefixed names as semble for parity
DEFAULT_IGNORED_DIRS—ReadonlySet<string>mirroring_DEFAULT_IGNORED_DIRSwith.semble/replaced by.csp/IgnoreSpec,IgnoreChecktypesNegation-bypass implementation choice
Semble's Python original iterates over
pathspec.GitIgnoreSpec.patterns,calling
pattern.match_file(...)and inspectingpattern.includeandpattern.patternper match. The npmignorepackage's.test()returnsthe winning rule when ignored is true but not when a negation
un-ignores — making the negation-with-extension detection impossible from
the package's public surface.
To recreate semble's behaviour exactly, we parse the gitignore lines
ourselves and build a
ParsedPattern[]of{ pattern, negated, hasExtSuffix, matcher }, wherematcheris a tinysingle-pattern
Ignoreinstance._isIgnoredthen walks the parsed listin source order, asking each per-pattern matcher whether it matches the
relative path; the last winning pattern decides
ignored, andfound = !ignored && pattern.hasExtSuffixreproduces the Python rule:Trade-off: more pattern compilation than a single aggregate matcher. The
aggregate matcher is still built and kept on
IgnoreSpec.specfor paritywith the Python
GitIgnoreSpecfield and as a reservation for futurefast-paths, but
_isIgnoredis the authoritative decision path.path.extname(pattern.replace(/\/+$/, ''))is used in place ofPath(pat.rstrip("/")).suffix— manual spot-checks show equivalent resultsfor the gitignore-relevant cases (
*.log→.log,vendor/→'',.github/*→'',__init__.py→.py).Tests
bun test src/indexing/file-walker.test.ts— 17 passing, 31 assertions.Coverage:
.tsdiscovery.git/andnode_modules/always ignored.gitignorewith*.logexcludesfoo.log.gitignorewith*.log\n!special.logincludesspecial.log(negation-with-extension bypass).cspignorehonoured alongside.gitignore.gitignoreignoreargument honoured_loadIgnoreForDirnull/merge/comment behaviour_isIgnoredfound-flag semanticsNotes for integrators
package.jsonis unchanged. Theignorepackage is referenced via dynamicimport; the rest of this module compiles even when the dep is not yet
installed. Add
ignoreto dependencies in Unit 0 (or the dep-consolidationPR) and replace the dynamic import with a static one.
await import('ignore')succeeding; when the dep isabsent the
describeblocks become.skip.Source of truth
/Users/lms/.ask/github/github.com/MinishLab/semble/main/src/semble/index/file_walker.pySummary by cubic
Ports the gitignore-aware file walker from semble to TypeScript for indexing. It honors
.gitignore/.cspignore, skips noisy dirs, and preserves the negation-with-extension bypass so!*.extcan include files outside the allowlist.New Features
walkFiles(root, extensions, ignore?)async generator; respects nested ignore files, skips symlinks, and matches extensions case-insensitively._isIgnored,_loadIgnoreForDir,_walk, andDEFAULT_IGNORED_DIRS(uses.csp/instead of.semble/)._isIgnoredadds a fast path via the aggregateignorematcher and ahasNegatedExtPatterncache; per-pattern walk runs only when a negated ext rule could win.ignorearg,hasNegatedExtPattern, and cross-spec state.Dependencies
ignorevia dynamic import; addignorelater and switch to a static import. Tests auto-skip whenignoreis not installed.Written for commit 3a43e48. Summary will update on new commits.