fix: threading crash, duplicate symbols, logging, and embedding insert (4 bugs) by jimdawdy-hub · Pull Request #11 · kapillamba4/code-memory

jimdawdy-hub · 2026-05-19T19:50:53Z

Summary

Four bugs found while indexing openclaw/openclaw — a large TypeScript/Swift/Kotlin monorepo with 17,212 source files, 945 doc files, and 60+ extensions. All four bugs surface only at scale and were invisible in small test cases. A fifth bug (schema mismatch in get_index_stats) was found via MCP client validation.

Bug 1 — Cross-thread SQLite access crashes ~30% of file parses

Symptom: sqlite3.InterfaceError: bad parameter or other API misuse on roughly 30% of files during the parallel parse phase. Appeared as NoneType: None in logs due to Bug 4 below.

Root cause: _parse_file_for_indexing runs inside ThreadPoolExecutor workers but calls db.execute() on the shared main-thread connection. Python's sqlite3 binding is not safe for concurrent multi-thread access even with check_same_thread=False.

Fix: Pre-fetch all existing file records into a dict[path → mtime] in the main thread before launching the pool. Workers receive the dict and do a dict.get() lookup instead of a DB query. No DB access occurs in any worker thread.

Bug 2 — Duplicate symbols from tree-sitter AST crash DB write

Symptom: sqlite3.IntegrityError: UNIQUE constraint failed: symbols.file_id, symbols.name, symbols.kind, symbols.line_start in Phase 3 (DB write), after all parsing and GPU embedding had already completed.

Root cause: tree-sitter can produce multiple symbols with identical (name, kind, line_start) for a single file. The plain INSERT INTO symbols raised on the second occurrence and killed the entire run.

Fix: INSERT OR IGNORE INTO symbols. Use cursor.rowcount == 1 to detect a real insert. Important: cursor.lastrowid is not reliable here — after a no-op INSERT OR IGNORE it retains the rowid from the previous successful insert, not 0.

Bug 3 — Embedding insert crashes on sqlite-vec virtual table

Symptom: sqlite3.OperationalError: UNIQUE constraint failed on symbol_embeddings primary key immediately after the Bug 2 fix was applied.

Root cause: symbol_embeddings is a sqlite-vec virtual table. It rejects INSERT OR IGNORE at the SQL level (OperationalError, not IntegrityError). When Bug 2's fallback SELECT returned an existing symbol_id, the code still queued an embedding for it — and the existing row already had one.

Fix: Guard embedding_pairs.append() with if is_new — only queue embeddings for freshly inserted symbols. Existing symbols already have an embedding in the virtual table.

Bug 4 — `logger.exception()` reports all errors as `NoneType: None`

Symptom: Every parse failure logged as NoneType: None with no traceback, making Bug 1 completely invisible.

Root cause: Exceptions from worker threads are stored as return values. logger.exception("Failed to index %s", fpath) reads sys.exc_info() — the current thread's exception context — which is (None, None, None) since the exception happened in a different thread.

Fix: logger.error("Failed to index %s", fpath, exc_info=error) passes the stored exception explicitly.

Bug 5 — `get_index_stats` freshness fields fail MCP schema validation

Symptom: MCP clients that validate tool output against api_types.py receive a Pydantic ValidationError on every get_index_stats call. The tool appears to fail even though the server responds successfully.

Root cause: db.py builds the freshness dict with key last_file_indexed (a raw float Unix timestamp from MAX(last_modified)), but api_types.IndexFreshness declares last_code_indexed: str | None. Two mismatches: wrong field name and wrong type.

Fix: Rename key to last_code_indexed and convert the float via datetime.fromtimestamp(...).isoformat().

Test results on openclaw/openclaw

	Before	After
Files successfully parsed	~70% (~12k/17k)	100% (17,212/17,212)
DB write crash	Yes (every run)	No
Symbols indexed	—	111,000
Final DB size	—	750 MB
Total runtime (RTX 5060 Ti)	Never completed	~59 min
`get_index_stats` MCP validation	Fails (schema mismatch)	Passes

Verification

# extensions/telegram (337 files) — fast integration test
rm -f extensions/telegram/code_memory.db*
python -m code_memory.parser  # or via MCP index_codebase
# Before patch: 330/337 files, 7 skipped with NoneType: None
# After patch:  337/337 files, 0 errors

🤖 Generated with Claude Code

Four bugs found while indexing openclaw/openclaw (17,212 source files, 945 doc files) on an RTX 5060 Ti. The repo is a large TypeScript/Swift/ Kotlin monorepo (~17k files across 60+ extensions). All bugs surface only at scale and were invisible in small test cases. --- Bug 1: cross-thread SQLite access crashes ~30% of file parses _parse_file_for_indexing ran inside ThreadPoolExecutor workers and called db.execute() on the shared main-thread connection. This caused: sqlite3.InterfaceError: bad parameter or other API misuse on roughly 30% of files, even though the connection was opened with check_same_thread=False. Python's sqlite3 binding is not safe for concurrent access without explicit locking. Fix: pre-fetch all existing file records into a dict[path → mtime] in the main thread before launching the pool. Workers receive the dict and do a dict.get() lookup instead of a DB query. No DB access in any worker thread. --- Bug 2: duplicate symbols from tree-sitter AST crash DB write tree-sitter can produce multiple symbols with the same (name, kind, line_start) for a single file. The plain INSERT INTO symbols raised: sqlite3.IntegrityError: UNIQUE constraint failed: symbols.file_id, symbols.name, symbols.kind, symbols.line_start This killed the entire DB write phase after all parsing and GPU embedding had already completed — wasting the entire indexing run. Fix: INSERT OR IGNORE INTO symbols. Use cursor.rowcount == 1 to detect whether the insert actually happened. cursor.lastrowid is NOT reliable here — after a no-op INSERT OR IGNORE it retains the rowid from the previous successful insert on the same connection, not 0. --- Bug 3: embedding insert crashes on sqlite-vec virtual table After the Bug 2 fix, a duplicate symbol falls through to a SELECT that returns the existing symbol_id. That ID already has an entry in symbol_embeddings (a sqlite-vec virtual table). Attempting to insert another embedding for it raised: sqlite3.OperationalError: UNIQUE constraint failed on symbol_embeddings primary key INSERT OR IGNORE does not work on sqlite-vec virtual tables — the conflict-resolution clause is rejected at the SQL level (OperationalError instead of the usual IntegrityError). Fix: guard embedding_pairs.append() with `if is_new` — only freshly inserted symbols get embeddings queued. Existing symbols already have one. --- Bug 4: logger.exception() reports all errors as "NoneType: None" Exceptions from worker threads are stored as return values: return (fpath, None, e) Then in the main thread: logger.exception("Failed to index %s", fpath) logger.exception() reads sys.exc_info() — the current thread's exception context — which is (None, None, None) since the exception occurred in a different thread. Every failure logged as "NoneType: None" with no traceback, making Bug 1 completely invisible. Fix: logger.error("Failed to index %s", fpath, exc_info=error) --- Tested against openclaw/openclaw: Before: ~30% of files silently skipped; DB write crash on first run After: 17,212/17,212 code files indexed, 111,000 symbols, 750 MB DB Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

get_index_stats was returning `last_file_indexed` (a raw float Unix timestamp) in the freshness dict, but api_types.py defines the field as `last_code_indexed: str | None`. This caused a Pydantic validation error in MCP clients that validate tool output against the schema. Two changes in get_index_stats(): - Rename key from `last_file_indexed` to `last_code_indexed` - Convert float timestamp to ISO-8601 string via datetime.fromtimestamp().isoformat() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PR #11 left an unsorted import block in db.py (`from datetime import datetime` placed among the plain `import` statements), which fails `ruff check` (I001) and broke CI on main. Move it into the sorted from-import group. Bump version 1.0.32 -> 1.0.33 in pyproject.toml, server.json (x2), and uv.lock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jimdawdy-hub mentioned this pull request May 19, 2026

[Suggestion/Dev Tool] Code-Memory DB for OpenClaw development use openclaw/openclaw#84295

Closed

kapillamba4 approved these changes May 20, 2026

View reviewed changes

kapillamba4 merged commit a9a4be0 into kapillamba4:main May 20, 2026

jimdawdy-hub deleted the fix/threading-duplicate-symbols-logging branch May 21, 2026 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: threading crash, duplicate symbols, logging, and embedding insert (4 bugs)#11

fix: threading crash, duplicate symbols, logging, and embedding insert (4 bugs)#11
kapillamba4 merged 2 commits into
kapillamba4:mainfrom
jimdawdy-hub:fix/threading-duplicate-symbols-logging

jimdawdy-hub commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimdawdy-hub commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1 — Cross-thread SQLite access crashes ~30% of file parses

Bug 2 — Duplicate symbols from tree-sitter AST crash DB write

Bug 3 — Embedding insert crashes on sqlite-vec virtual table

Bug 4 — logger.exception() reports all errors as NoneType: None

Bug 5 — get_index_stats freshness fields fail MCP schema validation

Test results on openclaw/openclaw

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimdawdy-hub commented May 19, 2026 •

edited

Loading

Bug 4 — `logger.exception()` reports all errors as `NoneType: None`

Bug 5 — `get_index_stats` freshness fields fail MCP schema validation