Skip to content

fix: threading crash, duplicate symbols, logging, and embedding insert (4 bugs)#11

Merged
kapillamba4 merged 2 commits into
kapillamba4:mainfrom
jimdawdy-hub:fix/threading-duplicate-symbols-logging
May 20, 2026
Merged

fix: threading crash, duplicate symbols, logging, and embedding insert (4 bugs)#11
kapillamba4 merged 2 commits into
kapillamba4:mainfrom
jimdawdy-hub:fix/threading-duplicate-symbols-logging

Conversation

@jimdawdy-hub
Copy link
Copy Markdown
Contributor

@jimdawdy-hub jimdawdy-hub commented May 19, 2026

Summary

Four bugs found while indexing openclaw/openclaw — a large TypeScript/Swift/Kotlin monorepo with 17,212 source files, 945 doc files, and 60+ extensions. All four bugs surface only at scale and were invisible in small test cases. A fifth bug (schema mismatch in get_index_stats) was found via MCP client validation.


Bug 1 — Cross-thread SQLite access crashes ~30% of file parses

Symptom: sqlite3.InterfaceError: bad parameter or other API misuse on roughly 30% of files during the parallel parse phase. Appeared as NoneType: None in logs due to Bug 4 below.

Root cause: _parse_file_for_indexing runs inside ThreadPoolExecutor workers but calls db.execute() on the shared main-thread connection. Python's sqlite3 binding is not safe for concurrent multi-thread access even with check_same_thread=False.

Fix: Pre-fetch all existing file records into a dict[path → mtime] in the main thread before launching the pool. Workers receive the dict and do a dict.get() lookup instead of a DB query. No DB access occurs in any worker thread.


Bug 2 — Duplicate symbols from tree-sitter AST crash DB write

Symptom: sqlite3.IntegrityError: UNIQUE constraint failed: symbols.file_id, symbols.name, symbols.kind, symbols.line_start in Phase 3 (DB write), after all parsing and GPU embedding had already completed.

Root cause: tree-sitter can produce multiple symbols with identical (name, kind, line_start) for a single file. The plain INSERT INTO symbols raised on the second occurrence and killed the entire run.

Fix: INSERT OR IGNORE INTO symbols. Use cursor.rowcount == 1 to detect a real insert. Important: cursor.lastrowid is not reliable here — after a no-op INSERT OR IGNORE it retains the rowid from the previous successful insert, not 0.


Bug 3 — Embedding insert crashes on sqlite-vec virtual table

Symptom: sqlite3.OperationalError: UNIQUE constraint failed on symbol_embeddings primary key immediately after the Bug 2 fix was applied.

Root cause: symbol_embeddings is a sqlite-vec virtual table. It rejects INSERT OR IGNORE at the SQL level (OperationalError, not IntegrityError). When Bug 2's fallback SELECT returned an existing symbol_id, the code still queued an embedding for it — and the existing row already had one.

Fix: Guard embedding_pairs.append() with if is_new — only queue embeddings for freshly inserted symbols. Existing symbols already have an embedding in the virtual table.


Bug 4 — logger.exception() reports all errors as NoneType: None

Symptom: Every parse failure logged as NoneType: None with no traceback, making Bug 1 completely invisible.

Root cause: Exceptions from worker threads are stored as return values. logger.exception("Failed to index %s", fpath) reads sys.exc_info() — the current thread's exception context — which is (None, None, None) since the exception happened in a different thread.

Fix: logger.error("Failed to index %s", fpath, exc_info=error) passes the stored exception explicitly.


Bug 5 — get_index_stats freshness fields fail MCP schema validation

Symptom: MCP clients that validate tool output against api_types.py receive a Pydantic ValidationError on every get_index_stats call. The tool appears to fail even though the server responds successfully.

Root cause: db.py builds the freshness dict with key last_file_indexed (a raw float Unix timestamp from MAX(last_modified)), but api_types.IndexFreshness declares last_code_indexed: str | None. Two mismatches: wrong field name and wrong type.

Fix: Rename key to last_code_indexed and convert the float via datetime.fromtimestamp(...).isoformat().


Test results on openclaw/openclaw

Before After
Files successfully parsed ~70% (~12k/17k) 100% (17,212/17,212)
DB write crash Yes (every run) No
Symbols indexed 111,000
Final DB size 750 MB
Total runtime (RTX 5060 Ti) Never completed ~59 min
get_index_stats MCP validation Fails (schema mismatch) Passes

Verification

# extensions/telegram (337 files) — fast integration test
rm -f extensions/telegram/code_memory.db*
python -m code_memory.parser  # or via MCP index_codebase
# Before patch: 330/337 files, 7 skipped with NoneType: None
# After patch:  337/337 files, 0 errors

🤖 Generated with Claude Code

Four bugs found while indexing openclaw/openclaw (17,212 source files,
945 doc files) on an RTX 5060 Ti. The repo is a large TypeScript/Swift/
Kotlin monorepo (~17k files across 60+ extensions). All bugs surface only
at scale and were invisible in small test cases.

---

Bug 1: cross-thread SQLite access crashes ~30% of file parses

_parse_file_for_indexing ran inside ThreadPoolExecutor workers and called
db.execute() on the shared main-thread connection. This caused:

  sqlite3.InterfaceError: bad parameter or other API misuse

on roughly 30% of files, even though the connection was opened with
check_same_thread=False. Python's sqlite3 binding is not safe for
concurrent access without explicit locking.

Fix: pre-fetch all existing file records into a dict[path → mtime] in the
main thread before launching the pool. Workers receive the dict and do a
dict.get() lookup instead of a DB query. No DB access in any worker thread.

---

Bug 2: duplicate symbols from tree-sitter AST crash DB write

tree-sitter can produce multiple symbols with the same (name, kind,
line_start) for a single file. The plain INSERT INTO symbols raised:

  sqlite3.IntegrityError: UNIQUE constraint failed:
    symbols.file_id, symbols.name, symbols.kind, symbols.line_start

This killed the entire DB write phase after all parsing and GPU embedding
had already completed — wasting the entire indexing run.

Fix: INSERT OR IGNORE INTO symbols. Use cursor.rowcount == 1 to detect
whether the insert actually happened. cursor.lastrowid is NOT reliable
here — after a no-op INSERT OR IGNORE it retains the rowid from the
previous successful insert on the same connection, not 0.

---

Bug 3: embedding insert crashes on sqlite-vec virtual table

After the Bug 2 fix, a duplicate symbol falls through to a SELECT that
returns the existing symbol_id. That ID already has an entry in
symbol_embeddings (a sqlite-vec virtual table). Attempting to insert
another embedding for it raised:

  sqlite3.OperationalError: UNIQUE constraint failed on
    symbol_embeddings primary key

INSERT OR IGNORE does not work on sqlite-vec virtual tables — the
conflict-resolution clause is rejected at the SQL level (OperationalError
instead of the usual IntegrityError).

Fix: guard embedding_pairs.append() with `if is_new` — only freshly
inserted symbols get embeddings queued. Existing symbols already have one.

---

Bug 4: logger.exception() reports all errors as "NoneType: None"

Exceptions from worker threads are stored as return values:
  return (fpath, None, e)

Then in the main thread:
  logger.exception("Failed to index %s", fpath)

logger.exception() reads sys.exc_info() — the current thread's exception
context — which is (None, None, None) since the exception occurred in a
different thread. Every failure logged as "NoneType: None" with no
traceback, making Bug 1 completely invisible.

Fix: logger.error("Failed to index %s", fpath, exc_info=error)

---

Tested against openclaw/openclaw:
  Before: ~30% of files silently skipped; DB write crash on first run
  After:  17,212/17,212 code files indexed, 111,000 symbols, 750 MB DB

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_index_stats was returning `last_file_indexed` (a raw float Unix
timestamp) in the freshness dict, but api_types.py defines the field
as `last_code_indexed: str | None`. This caused a Pydantic validation
error in MCP clients that validate tool output against the schema.

Two changes in get_index_stats():
- Rename key from `last_file_indexed` to `last_code_indexed`
- Convert float timestamp to ISO-8601 string via datetime.fromtimestamp().isoformat()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kapillamba4 kapillamba4 merged commit a9a4be0 into kapillamba4:main May 20, 2026
kapillamba4 added a commit that referenced this pull request May 20, 2026
PR #11 left an unsorted import block in db.py (`from datetime import
datetime` placed among the plain `import` statements), which fails
`ruff check` (I001) and broke CI on main. Move it into the sorted
from-import group.

Bump version 1.0.32 -> 1.0.33 in pyproject.toml, server.json (x2),
and uv.lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jimdawdy-hub jimdawdy-hub deleted the fix/threading-duplicate-symbols-logging branch May 21, 2026 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants