Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Directory.Build.props
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<LangVersion>latest</LangVersion>
<Nullable>enable</Nullable>
<TreatWarningsAsErrors>true</TreatWarningsAsErrors>
<VersionPrefix>0.1.2</VersionPrefix>
<VersionPrefix>0.1.3</VersionPrefix>
<VersionSuffix>alpha</VersionSuffix>
</PropertyGroup>
<PropertyGroup>
Expand Down
22 changes: 22 additions & 0 deletions IMPLEMENTATION_PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,10 +205,32 @@ bulldoze priorities.
- [x] `dotnet pack` validation: icon embedded in `.nupkg` confirmed
- [ ] Re-pack to validate icon embeds

### 16. Bash line comments (#25) — 0.1.3-alpha

- [x] `BashTokenKind.Comment` enum member (internal)
- [x] `BashLexer.ConsumeLineComment` helper; `#` dispatch in main scan loop
- [x] `BashCommandParser.FilterSignificant` drops Comment tokens
- [x] SPEC.md §4 BNF note + §5 "Comment handling" subsection
- [x] 10 new lexer unit tests + 8 new parser unit tests
- [x] 9 new corpus entries (123–131) including both Netclaw repros
(sanitized paths per SPEC §14)
- [x] `Directory.Build.props` `VersionPrefix` 0.1.2 → 0.1.3
- [x] `RELEASE_NOTES.md` 0.1.3-alpha section
- [ ] Cut 0.1.3-alpha tag once branch is merged

---

## NEXT (0.1.x — additive, post-alpha)

- Newline-as-statement-separator at the parser level (SPEC §4 gap
surfaced by #25). The lexer already emits Whitespace tokens for
newlines with the intent of acting as separators (see the lexer
comment at the newline branch), but `BashCommandParser.SplitIntoSegments`
only splits on `&&` / `||` / `;` / `|`. As a result, `cmd1\ncmd2`
currently parses to one clause `[cmd1]` with `cmd2` as an argument.
v0.1.3 corpus entry 126 works around this with an explicit `;`;
the long-term fix is to bridge newline-Whitespace tokens to the
segment splitter as synthetic `;` separators.
- Seed 50–100 corpus entries from sanitized real-world dogfood logs
(SPEC §14 workflow)
- Expand verb tables as corpus surfaces real commands
Expand Down
40 changes: 40 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,43 @@
#### 0.1.3-alpha May 12th 2026 ####

Bash line comment handling. Public API unchanged.

**Fixed**

- **Bash line comments are now recognized and skipped (#25).** `BashLexer`
treats `#` at a word boundary (start of input, or preceded by
whitespace, a newline, or any operator) as the start of a comment
that runs to the next newline. The comment text is emitted as a new
internal `BashTokenKind.Comment` token for source fidelity and is
filtered by the parser alongside `Whitespace` / `Continuation`, so
it contributes no verb, args, redirects, or flags to any clause.
Comment-only input parses to `Clauses = []`, `IsUnparseable = false`,
matching the existing empty-/whitespace-only path. Quoting and
escape rules are honored: `#` inside single or double quotes is
literal, `#` in the interior of an unquoted word (e.g. `abc#def`)
is literal, and `\#` outside quotes is literal.

Before this fix, `# Extract worktree branches\ngit worktree list`
parsed to a single clause with verb chain `[#, Extract]` — the
comment text leaked into downstream approval prompts and broke
approval-state caching in consumers that did asymmetric verb-chain
extraction (persistence-time vs. retry-authorization saw different
verb sets, causing tool calls to fail after the user had already
clicked Approve).

**Behavior notes**

- Public API surface is unchanged (no `PublicApiSnapshotTests` delta).
- SPEC.md §4 / §5: new "Comment handling" subsection in §5 documents
the boundary rules; §4 BNF notes that comments are
whitespace-equivalent at the lexer level.
- Corpus: 9 new entries (123–131) pin every case from the issue
report, plus the two Netclaw repros (sanitized paths per §14).
- v0.1 still does not treat top-level newlines as statement separators
(SPEC §4 gap, tracked separately in IMPLEMENTATION_PLAN NEXT) — a
comment between two commands on separate lines requires an explicit
`;` separator to split into two clauses.

#### 0.1.2-alpha May 11th 2026 ####

Three parser correctness fixes. Public API unchanged.
Expand Down
45 changes: 45 additions & 0 deletions SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,11 @@ quoted_string := single-quoted | double-quoted

- Whitespace between tokens is one or more spaces or tabs.
- `\` followed by a newline is a line continuation (treat as whitespace).
- Bash line comments (`#` at a word boundary through end-of-line) are
whitespace-equivalent at the lexer level — they emit a Comment token
for source fidelity but are filtered alongside Whitespace by the
parser, so they do not appear in the grammar. See §5 "Comment
handling" for boundary rules.
- `\` before a metachar inside a double-quoted string escapes the metachar.
- Single-quoted strings preserve all bytes literally — no escape processing.
- Heredocs (`<<EOF ... EOF`) are recognized as a redirect operator but the
Expand Down Expand Up @@ -400,6 +405,14 @@ The lexer produces tokens consumed by the parser. Token kinds:
the matching close (`))` or `}` respectively) and emits a sentinel
whose reason names the rejected construct. The parser consumes this
token by setting outer `ParsedCommand.IsUnparseable = true` (see §11).
- **COMMENT** — `#` at a word boundary (start of input, or preceded by
whitespace, a newline, an operator, or any other lexer-recognized
boundary) starts a line comment running to (but not including) the
next newline. The lexer emits a single Comment token covering the
`#` and the comment text, for source fidelity. The parser drops
Comment tokens in `FilterSignificant` alongside Whitespace and
Continuation — comments produce no clauses, args, redirects, or
flags. See "Comment handling" below for boundary rules.

### Quote handling

Expand Down Expand Up @@ -428,6 +441,38 @@ Operators terminate the current token. `cd /tmp&&ls` lexes as
`[cd, /tmp, &&, ls]` — no whitespace required around operators. The lexer
must handle this.

### Comment handling

- An unquoted `#` that appears at a **word boundary** starts a comment
that runs to (but does not include) the next newline. A word boundary
is: start of input, or the position immediately after a whitespace
run, a newline, an operator (`&&`, `||`, `;`, `|`, `>`, `>>`, `<`,
`2>`, `2>>`, `(`, `)`, `<<`, `<<-`), a quoted string, or an opaque
substitution. Equivalently: `#` is comment-start everywhere the
outer lexer dispatch loop sits, because every other lexer rule has
already consumed its territory before `#` is considered.
- `#` **inside** single or double quotes is a literal character (no
comment).
- `#` in the **interior** of an unquoted word (e.g. `abc#def`) is a
literal character. `ReadWord` consumes the whole word before the
outer loop can see the embedded `#`; there is no re-scanning.
- `\#` (backslash-escaped `#` outside quotes) is consumed by the
normal escape rule — the backslash is dropped and `#` becomes a
regular word character. Equivalent example: `cmd \#abc` produces
one Word token `#abc`.
- The terminating newline is **not** consumed by the Comment token.
It survives as a Whitespace token, preserving statement-boundary
semantics for the parser (see §4).
- A Comment token's `Value` is empty (matching `Whitespace` /
`Continuation`); `SourceStart` / `SourceLength` identify the slice
including the leading `#` so callers that need the literal text can
recover it from the original input span.
- **Effect on parsing**: comment-only input parses to
`Clauses = []`, `IsUnparseable = false` — mirroring empty-input
behavior. A comment leading, trailing, or interleaved with a clause
contributes no tokens to the verb chain, args, or redirects of any
clause.

---

## 6. Verb Tables
Expand Down
32 changes: 32 additions & 0 deletions src/ShellSyntaxTree/Internal/Bash/Lexing/BashLexer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,18 @@ internal static IReadOnlyList<BashToken> Tokenize(string input)
continue;
}

// ---- line comment ----
// Reaching this branch implies a word boundary (quotes,
// operators, and opaque regions are dispatched above), so `#`
// here starts a comment to EOL. Mid-word `#` is consumed by
// ReadWord, and `\#` by its escape handling — neither reaches
// this point. SPEC §5.
if (c == '#')
{
i = ConsumeLineComment(src, i, tokens);
continue;
}

// ---- word ----
i = ReadWord(src, i, tokens);
}
Expand Down Expand Up @@ -392,6 +404,26 @@ private static int ConsumeBacktickSubstitution(
return start + length;
}

// Consume `#` through (but not including) the next newline. The
// terminating newline stays in the stream so the outer loop emits it
// as a Whitespace token, preserving SPEC §4 clause-boundary
// semantics. Value is "" to match Whitespace/Continuation — callers
// that need the literal text can slice the source via
// SourceStart/SourceLength.
private static int ConsumeLineComment(
ReadOnlySpan<char> src, int start, List<BashToken> tokens)
{
var i = start;
while (i < src.Length && src[i] != '\n' && src[i] != '\r')
{
i++;
}

tokens.Add(new BashToken(
BashTokenKind.Comment, "", null, start, i - start, null));
return i;
}

private static int ConsumeArithmetic(
ReadOnlySpan<char> src, int start, List<BashToken> tokens)
{
Expand Down
11 changes: 11 additions & 0 deletions src/ShellSyntaxTree/Internal/Bash/Lexing/BashTokenKind.cs
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,17 @@ internal enum BashTokenKind
/// whitespace by the parser. SPEC §5.</summary>
Continuation,

/// <summary>A bash line comment — <c>#</c> at a word boundary
/// through end-of-line (the terminating newline is preserved as
/// a separate <see cref="Whitespace"/> token so statement
/// boundaries are unaffected). Emitted for source fidelity with
/// empty <see cref="BashToken.Value"/>; <see cref="BashToken.SourceStart"/>
/// and <see cref="BashToken.SourceLength"/> identify the slice.
/// The parser drops these in <c>FilterSignificant</c> alongside
/// <see cref="Whitespace"/> and <see cref="Continuation"/>.
/// SPEC §5.</summary>
Comment,

/// <summary>An opaque region — <c>$(…)</c> or backtick-quoted
/// <c>`…`</c>. The parser consumes one of these as a single
/// <c>Arg{ Kind = DynamicSkip, IsPath = false }</c> per the v0.1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -496,7 +496,9 @@ private static List<BashToken> FilterSignificant(IReadOnlyList<BashToken> tokens
var filtered = new List<BashToken>(tokens.Count);
foreach (var t in tokens)
{
if (t.Kind == BashTokenKind.Whitespace || t.Kind == BashTokenKind.Continuation)
if (t.Kind == BashTokenKind.Whitespace
|| t.Kind == BashTokenKind.Continuation
|| t.Kind == BashTokenKind.Comment)
{
continue;
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "Comment-only input produces zero clauses",
"input": "# just a note",
"expected": {
"isUnparseable": false,
"clauses": []
},
"notes": "Issue #25 / SPEC §5: a comment-only script is whitespace-equivalent — zero significant tokens → zero clauses, not IsUnparseable."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"name": "Leading comment followed by a single command",
"input": "# fetch the latest\ngit pull",
"expected": {
"isUnparseable": false,
"clauses": [
{
"operator": "None",
"verb": ["git", "pull"],
"args": [],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
}
]
},
"notes": "Issue #25: comment-as-verb regression. The leading explanatory line must not leak into the verb chain; v0.1.3 lexer skips `#`-to-EOL."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"name": "Inline trailing comment is dropped from the clause",
"input": "git pull # update local",
"expected": {
"isUnparseable": false,
"clauses": [
{
"operator": "None",
"verb": ["git", "pull"],
"args": [],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
}
]
},
"notes": "Issue #25 / SPEC §5: a `#` preceded by whitespace starts a comment. The trailing text becomes a Comment token and is filtered out by the parser."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"name": "Comment between two ;-separated commands",
"input": "git pull ; # now build\ndotnet build",
"expected": {
"isUnparseable": false,
"clauses": [
{
"operator": "None",
"verb": ["git", "pull"],
"args": [],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
},
{
"operator": "Sequence",
"verb": ["dotnet", "build"],
"args": [],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
}
]
},
"notes": "Issue #25: comment on its own line between two clauses must not pollute either verb chain. v0.1 uses `;` (not `\\n`) as the statement separator — newline-as-separator is a separate SPEC §4 gap tracked in IMPLEMENTATION_PLAN NEXT."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"name": "# inside double quotes is literal, not a comment",
"input": "echo \"hash is #1234\"",
"expected": {
"isUnparseable": false,
"clauses": [
{
"operator": "None",
"verb": ["echo"],
"args": [
{ "raw": "\"hash is #1234\"", "kind": "Literal", "isPath": false, "resolved": "__NULL__" }
],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
}
]
},
"notes": "Issue #25 / SPEC §5: `#` is comment-start only at a word boundary. Inside double quotes it's a literal character."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"name": "# inside single quotes is literal, not a comment",
"input": "echo 'use #foo'",
"expected": {
"isUnparseable": false,
"clauses": [
{
"operator": "None",
"verb": ["echo"],
"args": [
{ "raw": "'use #foo'", "kind": "Literal", "isPath": false, "resolved": "__NULL__" }
],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
}
]
},
"notes": "Issue #25 / SPEC §5: `#` inside single quotes is a literal byte (single-quote literalness rule)."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"name": "# in the interior of an unquoted word is literal",
"input": "echo abc#def",
"expected": {
"isUnparseable": false,
"clauses": [
{
"operator": "None",
"verb": ["echo"],
"args": [
{ "raw": "abc#def", "kind": "Literal", "isPath": false, "resolved": "__NULL__" }
],
"redirects": [],
"isSubshell": false,
"isBashCWrapped": false
}
]
},
"notes": "Issue #25 / SPEC §5: `#` is comment-start only at a word boundary (start-of-input or preceded by whitespace/operator). Mid-word `#` is consumed by ReadWord and stays literal."
}
Loading
Loading