Skip to content

feat: add Bash/Shell (.sh) parsing support (#197)#230

Closed
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/bash-parsing
Closed

feat: add Bash/Shell (.sh) parsing support (#197)#230
azizur100389 wants to merge 1 commit intotirth8205:mainfrom
azizur100389:feat/bash-parsing

Conversation

@azizur100389
Copy link
Copy Markdown
Contributor

Summary

Register .sh / .bash / .zsh / .ksh with tree-sitter-bash. Extract File and Function nodes for shell function definitions, CALLS edges for command invocations (local functions and external binaries), and IMPORTS_FROM edges for source / . dot-includes. test_* prefix is classified as Test kind.

Closes #197.

Root cause

Shell-script-heavy repos (installers, infra, CI tooling) previously indexed to 0 nodes / 0 edges because .sh was not in EXTENSION_TO_LANGUAGE. build_or_update_graph reported parsed 0 files for FragHub-style repos, which blocked architecture mapping and flow detection for the category of projects that most need them.

What's supported

  • Function definitions in both tree-sitter-bash shapes
    foo() { ... }
    function foo { ... }
    function foo() { ... }
  • Call extraction — tree-sitter-bash wraps every invocation in a command node with a command_name > word child. Local calls resolve to qualified names (file.sh::greet); external binaries (curl, grep, awk) are recorded with their raw name, consistent with how unresolved calls work in other languages.
  • Dot-includes as importssource lib.sh and . lib.sh emit IMPORTS_FROM edges. Both bareword (source lib/utils.sh) and quoted-string (source "lib/helpers.sh") arguments are handled.
  • Test detectiontest_* prefix classified as Test kind via the existing _TEST_PATTERNS list.

Implementation

  1. EXTENSION_TO_LANGUAGE — added .sh, .bash, .zsh, .ksh"bash"
  2. _CLASS_TYPES["bash"] = [] (no class concept in bash)
  3. _FUNCTION_TYPES["bash"] = ["function_definition"]
  4. _IMPORT_TYPES["bash"] = [] (source handled via constructs handler)
  5. _CALL_TYPES["bash"] = ["command"]
  6. New _extract_bash_constructs() + _bash_get_source_target() helpers following the _extract_lua_constructs pattern, dispatched next to the existing Lua/Luau handler.
  7. _get_name — new "bash" branch that reads the first word child of a function_definition (handles both paren and function-keyword forms).
  8. _get_call_name — new "bash" branch that reads command_name > word, skipping commands whose name is a variable expansion (can't be resolved statically).

Scope caveats

  • No cross-file call resolution beyond direct source includes (bash has no formal module system).
  • Calls whose name is a variable expansion ($foo, $(cmd)) are not statically resolvable and are skipped by _get_call_name.
  • Pipelines, subshells, heredocs are parsed by tree-sitter but not semantically linked yet.

Tests added (tests/test_multilang.py::TestBashParsing — 10 tests)

  • test_detects_language (.sh, .bash, .zsh)
  • test_finds_function_definitions_paren_form
  • test_finds_function_definitions_function_keyword_form
  • test_finds_source_imports (source + . + quoted string)
  • test_finds_calls_between_local_functions
  • test_finds_external_command_calls
  • test_finds_contains_edges
  • test_detects_test_functions
  • test_nodes_have_bash_language
  • test_calls_inside_functions

New fixture: tests/fixtures/sample.sh exercises both function-definition forms, both import forms, local + external call mix, and test_* naming.

Test results

Stage Result
Stage 1 — new targeted tests 10/10 passed
Stage 2 — tests/test_multilang.py full 140/140 passed — zero regressions across any language
Stage 3 — adjacent tests/test_parser.py 67/67 passed
Stage 4 — full suite 705 passed (up from 699 baseline — +6 new), 6 pre-existing Windows failures in test_incremental / test_notebook (verified identical on unchanged main)
Stage 5 — ruff check on changed files clean
Stage 6 — fixture smoke parse 8 nodes, 26 edges — all expected function, call, import, and test nodes present

Zero regressions. All new code is gated on the bash language check so existing languages are untouched.

Register .sh / .bash / .zsh / .ksh with tree-sitter-bash. Extract File
and Function nodes for shell function definitions, CALLS edges for
command invocations (both to local functions and external binaries),
and IMPORTS_FROM edges for `source` / `.` dot-includes. Test functions
(test_* prefix) are classified as Test kind.

Root cause of tirth8205#197
------------------
Shell-script-heavy repos (installers, infra, CI tooling) previously
indexed to 0 nodes / 0 edges because .sh was not in EXTENSION_TO_LANGUAGE.
build_or_update_graph reported `parsed 0 files` for FragHub-style repos,
which blocked architecture mapping and flow detection for the category
of projects that most need them.

What's supported
----------------
* Function definitions in both tree-sitter-bash shapes:
    foo() { ... }
    function foo { ... }
    function foo() { ... }
* Call extraction for `command` nodes (tree-sitter-bash wraps every
  invocation in a `command` node with a `command_name` > `word` child).
  Local calls resolve to qualified names; external binaries (curl, grep,
  awk) are recorded with their raw name, consistent with how unresolved
  calls work in other languages.
* `source lib.sh` and `. lib.sh` emit IMPORTS_FROM edges (both bareword
  and quoted-string arguments handled).
* `test_*` prefix -> Test kind via the existing _TEST_PATTERNS list.

Implementation
--------------
1. EXTENSION_TO_LANGUAGE: +4 shell extensions -> "bash"
2. _CLASS_TYPES["bash"] = []  (no class concept in bash)
3. _FUNCTION_TYPES["bash"] = ["function_definition"]
4. _IMPORT_TYPES["bash"] = []  (source handled via constructs handler)
5. _CALL_TYPES["bash"] = ["command"]
6. New _extract_bash_constructs() + _bash_get_source_target() helpers
   following the _extract_lua_constructs pattern. Dispatched next to
   the existing Lua/Luau handler.
7. _get_name: new "bash" branch that reads the first `word` child of
   a function_definition (handles both paren and function-keyword forms).
8. _get_call_name: new "bash" branch that reads command_name > word.

Scope caveats documented in PR body
-----------------------------------
* No cross-file call resolution beyond direct source includes (bash has
  no formal module system).
* Calls whose name is a variable expansion ($foo, `$(cmd)`) are not
  statically resolvable and are skipped by _get_call_name.
* Pipelines, subshells, heredocs are parsed by tree-sitter but not
  semantically linked yet.

Tests added (tests/test_multilang.py::TestBashParsing — 10 tests)
-----------------------------------------------------------------
- test_detects_language (.sh, .bash, .zsh)
- test_finds_function_definitions_paren_form
- test_finds_function_definitions_function_keyword_form
- test_finds_source_imports (source + . + quoted string)
- test_finds_calls_between_local_functions
- test_finds_external_command_calls
- test_finds_contains_edges
- test_detects_test_functions
- test_nodes_have_bash_language
- test_calls_inside_functions

New fixture: tests/fixtures/sample.sh exercises both function-definition
forms, both import forms (bareword + quoted), local + external call
mix, and test_* naming.

Test results
------------
Stage 1 (new targeted tests): 10/10 passed.
Stage 2 (tests/test_multilang.py full): 140/140 passed — no regressions
  across any language.
Stage 3 (tests/test_parser.py adjacent): 67/67 passed.
Stage 4 (full suite): 705 passed (up from 699 baseline — +6 new), 6
  pre-existing Windows failures in test_incremental/test_notebook
  (verified identical on unchanged main in PR tirth8205#226 work).
Stage 5 (ruff check): clean.
Stage 6 (fixture smoke parse): 8 nodes, 26 edges. All expected function,
  call, import, and test nodes present.

Zero regressions. All new code lives behind the bash language check so
existing languages are untouched.
@tirth8205
Copy link
Copy Markdown
Owner

Thank you for this @CodeBlackwell — closing as superseded by PR #227 which already shipped Bash/Shell support in v2.3.0 (now on PyPI, along with v2.3.1 that adds the Windows MCP hang fix).

Your PR was in flight at the same time I shipped #227, so this isn't redundant work on your end — the two implementations arrived essentially in parallel. Both approaches are very similar: .sh/.bash/.zsh registered with tree-sitter-bash, function_definition → Function nodes, command → CALLS edges, source/. → IMPORTS_FROM. The shipped version uses _extract_bash_source_command for the import hook and has 7 multilang tests covering the same cases you tested.

If there's anything your implementation does that mine doesn't (e.g. .ksh extension, additional test coverage), please open a follow-up issue pointing at the specific feature and I'll merge it in. The .ksh extension in particular looks worth adding — I didn't include it in #227.

Really appreciate the thorough test breakdown (10/10 targeted, 140/140 multilang, 705 full suite). That's exactly how I'd want a PR tested.

@tirth8205 tirth8205 closed this Apr 11, 2026
azizur100389 added a commit to azizur100389/code-review-graph that referenced this pull request Apr 11, 2026
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries added in tirth8205#227. Korn shell is close enough to
bash syntactically that tree-sitter-bash handles the structural features
the graph captures (function definitions, commands, source/. includes)
correctly.

Context
-------
In the close comment on PR tirth8205#230, @tirth8205 explicitly flagged .ksh as a
missing extension:

    "The .ksh extension in particular looks worth adding — I didn't
     include it in tirth8205#227."

This PR addresses exactly that gap. Issue tirth8205#235 tracks the request.

Why it matters
--------------
Korn shell is still used in legacy AIX/Solaris operations, IBM internal
tooling, and enterprise CI scripts. Repositories that ship .ksh scripts
currently index to 0 nodes because the extension is unrecognized — the
same failure mode that motivated tirth8205#197.

Implementation
--------------
One line added to EXTENSION_TO_LANGUAGE in parser.py:
    ".ksh": "bash"

All of the bash parsing machinery shipped in tirth8205#227 (_FUNCTION_TYPES,
_CALL_TYPES, _extract_bash_source_command, name/call resolution) already
supports any file parsed through the "bash" language path, so no further
changes are needed.

Tests added (tests/test_multilang.py::TestBashParsing)
------------------------------------------------------
1. test_detects_language — extended with a .ksh assertion to lock in
   the extension mapping (regression guard for tirth8205#235).
2. test_ksh_extension_parses_as_bash — end-to-end regression test that
   copies the existing tests/fixtures/sample.sh to a temp .ksh file,
   parses it through the real CodeParser, and asserts:
     - every node's language field is "bash"
     - the set of extracted Function names is identical to the .sh run
     - the CONTAINS / CALLS / IMPORTS_FROM edge counts per kind match
   The second assertion proves the .ksh path is fully wired through to
   the same structural extraction as .sh, not a degenerate zero-result
   read.

Test results
------------
Stage 1 (new targeted tests): 2/2 passed.
Stage 2 (tests/test_multilang.py full): 152/152 passed — zero regressions
  across any language.
Stage 3 (tests/test_parser.py adjacent): 67/67 passed.
Stage 4 (full suite): 733 passed. 8 pre-existing Windows failures in
  test_incremental (3) + test_main async coroutine detection (1) +
  test_notebook Databricks (4) — verified identical on unchanged main.
Stage 5 (ruff check on parser.py and test_multilang.py): clean.
Stage 6 (end-to-end smoke): detect_language("legacy.ksh") -> "bash";
  parsing a real .ksh file produces 6 Function nodes, 18 edges, all
  tagged language=bash.

Zero regressions. Single-line extension mapping change plus a targeted
regression guard against the specific issue the maintainer flagged.
sebastianbreguel added a commit to sebastianbreguel/code-review-graph that referenced this pull request Apr 14, 2026
Adds .ksh to EXTENSION_TO_LANGUAGE so Korn shell scripts are parsed
through the existing tree-sitter-bash grammar. Follow-up to the bash
parser shipped in v2.3.1 (tirth8205#197, PR tirth8205#230). The maintainer explicitly
flagged .ksh as worth adding in his close comment on PR tirth8205#230.

Changes:
- parser.py: add ".ksh": "bash" entry
- tests/test_multilang.py: extend test_detects_language with .ksh
  and add test_ksh_extension_parses_as_bash (functions + CALLS)
- tests/fixtures/sample.ksh: minimal Korn shell fixture exercising
  functions, source, and internal calls

Test plan:
- uv run pytest tests/test_multilang.py::TestBashParsing  -> 8 passed
- uv run pytest tests/test_multilang.py                   -> 152 passed
- uv run ruff check code_review_graph/parser.py tests/test_multilang.py -> clean
azizur100389 added a commit to azizur100389/code-review-graph that referenced this pull request Apr 14, 2026
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries added in tirth8205#227. Korn shell is close enough to
bash syntactically that tree-sitter-bash handles the structural features
the graph captures (function definitions, commands, source/. includes)
correctly.

Context
-------
In the close comment on PR tirth8205#230, @tirth8205 explicitly flagged .ksh as a
missing extension:

    "The .ksh extension in particular looks worth adding — I didn't
     include it in tirth8205#227."

This PR addresses exactly that gap. Issue tirth8205#235 tracks the request.

Why it matters
--------------
Korn shell is still used in legacy AIX/Solaris operations, IBM internal
tooling, and enterprise CI scripts. Repositories that ship .ksh scripts
currently index to 0 nodes because the extension is unrecognized — the
same failure mode that motivated tirth8205#197.

Implementation
--------------
One line added to EXTENSION_TO_LANGUAGE in parser.py:
    ".ksh": "bash"

All of the bash parsing machinery shipped in tirth8205#227 (_FUNCTION_TYPES,
_CALL_TYPES, _extract_bash_source_command, name/call resolution) already
supports any file parsed through the "bash" language path, so no further
changes are needed.

Tests added (tests/test_multilang.py::TestBashParsing)
------------------------------------------------------
1. test_detects_language — extended with a .ksh assertion to lock in
   the extension mapping (regression guard for tirth8205#235).
2. test_ksh_extension_parses_as_bash — end-to-end regression test that
   copies the existing tests/fixtures/sample.sh to a temp .ksh file,
   parses it through the real CodeParser, and asserts:
     - every node's language field is "bash"
     - the set of extracted Function names is identical to the .sh run
     - the CONTAINS / CALLS / IMPORTS_FROM edge counts per kind match
   The second assertion proves the .ksh path is fully wired through to
   the same structural extraction as .sh, not a degenerate zero-result
   read.

Test results
------------
Stage 1 (new targeted tests): 2/2 passed.
Stage 2 (tests/test_multilang.py full): 152/152 passed — zero regressions
  across any language.
Stage 3 (tests/test_parser.py adjacent): 67/67 passed.
Stage 4 (full suite): 733 passed. 8 pre-existing Windows failures in
  test_incremental (3) + test_main async coroutine detection (1) +
  test_notebook Databricks (4) — verified identical on unchanged main.
Stage 5 (ruff check on parser.py and test_multilang.py): clean.
Stage 6 (end-to-end smoke): detect_language("legacy.ksh") -> "bash";
  parsing a real .ksh file produces 6 Function nodes, 18 edges, all
  tagged language=bash.

Zero regressions. Single-line extension mapping change plus a targeted
regression guard against the specific issue the maintainer flagged.
azizur100389 added a commit to azizur100389/code-review-graph that referenced this pull request Apr 14, 2026
…-less scripts (tirth8205#235, tirth8205#237)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (tirth8205#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in tirth8205#227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (tirth8205#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests
azizur100389 added a commit to azizur100389/code-review-graph that referenced this pull request Apr 14, 2026
…-less scripts (tirth8205#235, tirth8205#237)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (tirth8205#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in tirth8205#227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (tirth8205#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests
azizur100389 added a commit to azizur100389/code-review-graph that referenced this pull request Apr 14, 2026
…-less scripts (tirth8205#235, tirth8205#237)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (tirth8205#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in tirth8205#227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (tirth8205#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests
tirth8205 pushed a commit that referenced this pull request Apr 18, 2026
…on-less scripts (#276)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR #230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in #227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests
npkriami18 pushed a commit to npkriami18/code-review-graph that referenced this pull request Apr 21, 2026
…on-less scripts (tirth8205#276)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (tirth8205#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in tirth8205#227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (tirth8205#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests

(cherry picked from commit e6e3144)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Bash/Shell (.sh) parsing for script-heavy repositories

2 participants