Skip to content

feat(indexing): port gitignore-aware file walker from semble#8

Merged
amondnet merged 2 commits into
mainfrom
feat/unit-8-file-walker
May 28, 2026
Merged

feat(indexing): port gitignore-aware file walker from semble#8
amondnet merged 2 commits into
mainfrom
feat/unit-8-file-walker

Conversation

@amondnet

@amondnet amondnet commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Port of src/semble/index/file_walker.py to TypeScript. Provides the
gitignore-aware directory traversal used by every indexing entry point in csp.

Unit 8 of the parallel port effort (MinishLab/semble → @pleaseai/csp).

What was ported

  • walkFiles(root, extensions, ignore?) — async generator yielding files
    matching extensions, skipping ignored paths
  • _isIgnored, _loadIgnoreForDir, _walk — internal helpers exported under
    the same underscore-prefixed names as semble for parity
  • DEFAULT_IGNORED_DIRSReadonlySet<string> mirroring
    _DEFAULT_IGNORED_DIRS with .semble/ replaced by .csp/
  • IgnoreSpec, IgnoreCheck types

Negation-bypass implementation choice

Semble's Python original iterates over pathspec.GitIgnoreSpec.patterns,
calling pattern.match_file(...) and inspecting pattern.include and
pattern.pattern per match. The npm ignore package's .test() returns
the winning rule when ignored is true but not when a negation
un-ignores — making the negation-with-extension detection impossible from
the package's public surface.

To recreate semble's behaviour exactly, we parse the gitignore lines
ourselves and build a ParsedPattern[] of
{ pattern, negated, hasExtSuffix, matcher }, where matcher is a tiny
single-pattern Ignore instance. _isIgnored then walks the parsed list
in source order, asking each per-pattern matcher whether it matches the
relative path; the last winning pattern decides ignored, and
found = !ignored && pattern.hasExtSuffix reproduces the Python rule:

Bypass the extension filter only for negation patterns with a file
extension suffix (e.g. !special.kjs, !*.py). Patterns without a
suffix (e.g. !vendor/, !.github/*) target directories or broad globs
and should not bypass extension filtering.

Trade-off: more pattern compilation than a single aggregate matcher. The
aggregate matcher is still built and kept on IgnoreSpec.spec for parity
with the Python GitIgnoreSpec field and as a reservation for future
fast-paths, but _isIgnored is the authoritative decision path.

path.extname(pattern.replace(/\/+$/, '')) is used in place of
Path(pat.rstrip("/")).suffix — manual spot-checks show equivalent results
for the gitignore-relevant cases (*.log.log, vendor/'',
.github/*'', __init__.py.py).

Tests

bun test src/indexing/file-walker.test.ts — 17 passing, 31 assertions.

Coverage:

  • Recursive .ts discovery
  • Symlinks skipped
  • .git/ and node_modules/ always ignored
  • .gitignore with *.log excludes foo.log
  • .gitignore with *.log\n!special.log includes special.log (negation-with-extension bypass)
  • .cspignore honoured alongside .gitignore
  • Nested per-directory .gitignore
  • Extra ignore argument honoured
  • Case-insensitive extension matching
  • _loadIgnoreForDir null/merge/comment behaviour
  • _isIgnored found-flag semantics

Notes for integrators

  • package.json is unchanged. The ignore package is referenced via dynamic
    import; the rest of this module compiles even when the dep is not yet
    installed. Add ignore to dependencies in Unit 0 (or the dep-consolidation
    PR) and replace the dynamic import with a static one.
  • Tests are gated on await import('ignore') succeeding; when the dep is
    absent the describe blocks become .skip.

Source of truth

/Users/lms/.ask/github/github.com/MinishLab/semble/main/src/semble/index/file_walker.py


Summary by cubic

Ports the gitignore-aware file walker from semble to TypeScript for indexing. It honors .gitignore/.cspignore, skips noisy dirs, and preserves the negation-with-extension bypass so !*.ext can include files outside the allowlist.

  • New Features

    • walkFiles(root, extensions, ignore?) async generator; respects nested ignore files, skips symlinks, and matches extensions case-insensitively.
    • Exports _isIgnored, _loadIgnoreForDir, _walk, and DEFAULT_IGNORED_DIRS (uses .csp/ instead of .semble/).
    • _isIgnored adds a fast path via the aggregate ignore matcher and a hasNegatedExtPattern cache; per-pattern walk runs only when a negated ext rule could win.
    • Tests cover recursion, default ignores, negation-with-extension bypass, the extra ignore arg, hasNegatedExtPattern, and cross-spec state.
  • Dependencies

    • Uses ignore via dynamic import; add ignore later and switch to a static import. Tests auto-skip when ignore is not installed.

Written for commit 3a43e48. Summary will update on new commits.

Port of src/semble/index/file_walker.py.

- walkFiles(root, extensions, ignore?) async generator
- _isIgnored, _loadIgnoreForDir, _walk helpers exported with underscore prefix to mirror semble
- DEFAULT_IGNORED_DIRS includes .csp/ (replacement for .semble/)
- Honours .gitignore and .cspignore (semble's .sembleignore)
- Recreates semble's negation-with-extension bypass: when a winning gitignore
  pattern is a negation (!pattern) AND the pattern (stripped of trailing /)
  has a file extension suffix, the file bypasses the extension allowlist via
  the 'found' flag. Implemented by parsing patterns ourselves and storing
  { pattern, negated, hasExtSuffix, matcher } per line, since the 'ignore'
  npm package does not expose the winning rule for negations.

Tests cover: recursion, symlink skipping, default dirs, .gitignore exclusion,
negation-with-extension bypass, .cspignore, nested .gitignore, extra ignore
arg, case-insensitive extension matching.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports the file walker and its corresponding tests from Python to TypeScript, implementing recursive file traversal that respects .gitignore and .cspignore files. The reviewer suggests a significant performance optimization for the _isIgnored function: by caching whether an IgnoreSpec contains any negated patterns with file extensions, the walker can use the aggregate matcher's .test() method to bypass the expensive per-pattern loop for the vast majority of files.

Comment thread src/indexing/file-walker.ts
Comment thread src/indexing/file-walker.ts
Comment thread src/indexing/file-walker.ts

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Architecture diagram
sequenceDiagram
    participant Caller as Caller (indexer)
    participant walkFiles as walkFiles()
    participant _walk as _walk()
    participant _loadIgnore as _loadIgnoreForDir()
    participant _isIgnored as _isIgnored()
    participant FS as File System
    participant ignorePkg as ignore package (dynamic import)

    Note over Caller,_walk: Entry Point: walkFiles(root, extensions, ignore?)

    Caller->>walkFiles: root, [".ts"], optional extra ignore patterns
    walkFiles->>walkFiles: Build extensions Set (lowercased)
    walkFiles->>walkFiles: Merge DEFAULT_IGNORED_DIRS + extra ignore patterns
    walkFiles->>ignorePkg: dynamic import (lazy, cached)
    ignorePkg-->>walkFiles: Ignore factory
    walkFiles->>walkFiles: buildSpec(root, dirPatterns) → baseSpec
    walkFiles->>_walk: yield * _walk(root, [baseSpec], extensionsSet)

    loop Per directory (recursive)
        _walk->>_loadIgnore: _loadIgnoreForDir(directory)
        _loadIgnore->>FS: stat .gitignore
        FS-->>_loadIgnore: File info or error
        alt .gitignore exists
            _loadIgnore->>FS: readFile(.gitignore)
            FS-->>_loadIgnore: gitignore content
        end
        _loadIgnore->>FS: stat .cspignore
        FS-->>_loadIgnore: File info or error
        alt .cspignore exists
            _loadIgnore->>FS: readFile(.cspignore)
            FS-->>_loadIgnore: cspignore content
        end
        _loadIgnore->>_loadIgnore: Merge lines, skip blanks/comments
        _loadIgnore->>ignorePkg: buildSpec(lines)
        ignorePkg-->>_loadIgnore: Aggregate matcher + ParsedPattern[]
        _loadIgnore-->>_walk: IgnoreSpec or null

        _walk->>_walk: Merge dirSpec into inheritedSpecs
        _walk->>FS: readdir(directory, withFileTypes)
        FS-->>_walk: Dirent[]

        loop Each entry (sorted)
            _walk->>_walk: Skip symlinks
            _walk->>_isIgnored: _isIgnored(fullPath, isDir, specs)
            Note over _isIgnored: Walks patterns in source order
            _isIgnored->>_isIgnored: For each pattern: pattern.matcher.ignores(candidate)
            alt Pattern matches
                _isIgnored->>_isIgnored: ignored = !pattern.negated
                _isIgnored->>_isIgnored: found = !ignored && pattern.hasExtSuffix
            end
            _isIgnored-->>_walk: { ignored, found }

            alt ignored == true
                _walk->>_walk: Skip entry
            else isDir
                _walk->>_walk: yield * _walk(full, specs, extensions)
            else isFile
                alt found == true (negation bypass)
                    _walk->>_walk: yield full (even if extension not in allowlist)
                else extensionsSet.has(extname)
                    _walk->>_walk: yield full
                else
                    _walk->>_walk: Skip (extension not allowed)
                end
            end
        end
    end

    _walk-->>walkFiles: File paths
    walkFiles-->>Caller: AsyncGenerator of matching file paths
Loading

Re-trigger cubic

Apply gemini-code-assist's performance optimization to _isIgnored:

- Add hasNegatedExtPattern cache on IgnoreSpec, computed once in buildSpec
- Consult the aggregate ignore matcher via .test() as the hot path
- Skip the per-pattern walk when the path doesn't match, is plainly
  ignored, or is unignored but the spec contains no negated extension
  patterns (so 'found' cannot change)
- Fall back to the per-pattern loop only when a negation could win AND
  the spec carries at least one negated extension pattern — the only
  case where 'found' may flip to true

Behavior is equivalent: the per-pattern loop's last-winning-pattern
rule matches gitignore semantics that .test() already enforces. Verified
by the existing 17 negation-bypass / nested-gitignore tests plus 4 new
tests covering hasNegatedExtPattern and cross-spec state preservation.

Mirrors semble's _is_ignored in src/semble/index/file_walker.py.

Co-Authored-By: gemini-code-assist[bot] <gemini-code-assist[bot]@users.noreply.github.com>

@amondnet amondnet left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied 3, deferred 0.

All three gemini-code-assist comments addressed in 3a43e48:

  1. Added hasNegatedExtPattern: boolean to IgnoreSpec (with updated doc comment)
  2. buildSpec computes the flag once per spec
  3. _isIgnored uses the aggregate matcher's .test() as the hot path, falling back to the per-pattern loop only when a negation could win AND the spec carries a negated ext pattern — preserving the load-bearing negation-bypass semantics from semble's _is_ignored

cubic-dev-ai found no issues.

Tests: 21 pass (17 existing + 4 new covering hasNegatedExtPattern cache and cross-spec state preservation).

@amondnet amondnet self-assigned this May 28, 2026

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/indexing/file-walker.ts">

<violation number="1" location="src/indexing/file-walker.ts:206">
P2: The new `.test()` double-false shortcut is not semantically safe: `ignore.test()` can return `{ignored:false, unignored:false}` for indirectly-unignored paths, so this early `continue` can preserve stale outer `ignored=true` and skip files that should be included.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic


const { ignored: isIgnoredBySpec, unignored: isUnignoredBySpec } = aggregateResult

if (!isIgnoredBySpec && !isUnignoredBySpec) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The new .test() double-false shortcut is not semantically safe: ignore.test() can return {ignored:false, unignored:false} for indirectly-unignored paths, so this early continue can preserve stale outer ignored=true and skip files that should be included.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/indexing/file-walker.ts, line 206:

<comment>The new `.test()` double-false shortcut is not semantically safe: `ignore.test()` can return `{ignored:false, unignored:false}` for indirectly-unignored paths, so this early `continue` can preserve stale outer `ignored=true` and skip files that should be included.</comment>

<file context>
@@ -175,14 +191,47 @@ export function _isIgnored(
+
+    const { ignored: isIgnoredBySpec, unignored: isUnignoredBySpec } = aggregateResult
+
+    if (!isIgnoredBySpec && !isUnignoredBySpec) {
+      // No pattern in this spec matched — preserve outer state.
+      continue
</file context>

@amondnet amondnet merged commit d0b1ec5 into main May 28, 2026
1 check passed
@amondnet amondnet deleted the feat/unit-8-file-walker branch May 28, 2026 16:05
This was referenced Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant