feat(types): port Chunk/SearchResult/ContentType from semble#5
Conversation
There was a problem hiding this comment.
Code Review
This pull request ports Python types and serialization helpers to TypeScript in src/types.ts, along with corresponding unit tests in src/types.test.ts. The feedback suggests adding runtime validation to chunkFromDict to safely handle malformed or untrusted input data, and adding unit tests to verify that invalid inputs correctly throw errors.
There was a problem hiding this comment.
No issues found across 2 files
Architecture diagram
sequenceDiagram
participant Client as Client/CLI
participant Types as Types Module
participant Storage as Index Storage
participant Telemetry as Telemetry Logger
Note over Client,Telemetry: NEW: Type definitions imported from semble
Client->>Types: Import ContentType/CallType
Types-->>Client: String literal values ('code'/'docs'/'config', 'search'/'find_related')
Note over Client,Types: Chunk lifecycle
Client->>Types: chunkToDict(chunk)
alt language is undefined
Types->>Types: Emit language: null
else language is set
Types->>Types: Emit language value
end
Types->>Types: Compute location from filePath:startLine-endLine
Types-->>Client: ChunkDict (includes location)
alt Persisting to storage
Client->>Storage: Write ChunkDict (JSON)
Storage-->>Client: Confirmation
end
alt Reading from storage
Client->>Storage: Read ChunkDict (JSON)
Storage-->>Client: ChunkDict data
Client->>Types: chunkFromDict(data)
alt location present
Types->>Types: Strip location (derived, not trusted)
end
alt language is null
Types->>Types: Convert to undefined
end
alt language is string
Types->>Types: Keep as string
end
Types-->>Client: Chunk (immutable, readonly)
end
Note over Client,Types: SearchResult serialization
Client->>Types: searchResultToDict(result)
Types->>Types: chunkToDict(chunk) for nested chunk
Types-->>Client: SearchResultDict (chunk + score)
Note over Client,Telemetry: Telemetry parity
Client->>Telemetry: Log with CallType ('search'/'find_related')
Telemetry-->>Client: Acknowledged
Note over Client,Types: Embedding matrix operations
Client->>Types: Create EmbeddingMatrix (Float32Array)
Types-->>Client: Flat row-major buffer
Client->>Types: Create EmbeddingShape { rows, dim }
Types-->>Client: Shape descriptor
Client->>Client: Compute embeddings @ query (contiguous BLAS sweep)
Add runtime validation to chunkFromDict to fail loudly with TypeError on malformed JSON / untrusted payloads, since TypeScript's compile-time ChunkDictInput is bypassed at the JSON boundary. Cover null/non-object, missing fields, wrong-typed fields, and bad language with tests. Identified by gemini-code-assist.
amondnet
left a comment
There was a problem hiding this comment.
Applied 2 (both gemini-code-assist findings), deferred 0.
- src/types.ts: chunkFromDict now validates input at runtime — null/non-object, missing required fields, and wrong-typed language all throw TypeError. The compile-time ChunkDictInput is bypassed at the JSON boundary, so failing loudly here prevents bad data (e.g. NaN line numbers) from polluting the index.
- src/types.test.ts: added 3 new tests covering invalid inputs (null/non-object, missing/wrong-typed required fields, wrong-typed language).
cubic found no issues. All 12 tests pass under bun test.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
cubic-dev-ai P2 follow-up: typeof === 'number' permits NaN, Infinity, -Infinity, which would propagate as broken line numbers downstream (file-saturation math, location strings, find-related boundary checks). Add Number.isFinite() checks alongside the existing typeof guard and cover NaN, +Infinity, -Infinity in the test matrix.
Port of
src/semble/types.pyto TypeScript — Unit 1 of the parallel semble → csp port effort.What's ported
ContentType— string-literal const (Code = 'code',Docs = 'docs',Config = 'config'). Values match Pythonstrenum so CLI flags and persisted indices round-trip.CallType— string-literal const (Search = 'search',FindRelated = 'find_related'). Values match Pythonstrenum for~/.csp/savings.jsonltelemetry parity.Chunkinterface —content,filePath,startLine,endLine,language?. Public fields are camelCase, perARCHITECTURE.md("Public field names are camelCase, not snake_case"). Allreadonlyto mirror the Pythonfrozen=Truedataclass.SearchResultinterface —{ chunk, score }.IndexStatsinterface —{ indexedFiles, totalChunks, languages }.EmbeddingMatrix = Float32Array(flat row-major) +EmbeddingShape = { rows, dim }companion. Comment explains the rationale: dense retrieval is one contiguous BLAS-style sweep, so a flat buffer beatsFloat32Array[]for cache locality and persistence simplicity.Helper functions
chunkLocation(chunk)→filePath:startLine-endLine(port of PythonChunk.location@property; kept as a free function becauseChunkis a plain interface).chunkToDict(chunk)→ChunkDictincludinglocation. Emitslanguage: null(not omitted) to mirror Pythondataclasses.asdictJSON shape.chunkFromDict(data)→Chunk. Stripslocationbefore reconstruction (it's derived; trusting it would let a malformed payload desync from the line range). Acceptsnull | undefined | stringforlanguage(wire-format tolerant).searchResultToDict(result)→SearchResultDict.Tests (
src/types.test.ts)9 tests covering:
ContentType/CallTypeenum-value parity with Python.chunkLocationformatting (multi-line and single-line).chunkToDict↔chunkFromDictroundtrip withlanguageset and omitted.chunkFromDictstripslocation(verified by passing a deliberately-wrong location).chunkFromDictacceptslanguage: null(wire format).searchResultToDictshape.Verification
bun test src/types.test.ts— 9 pass, 0 fail.tsc --noEmit— clean (strict +noUncheckedIndexedAccess+exactOptionalPropertyTypes+verbatimModuleSyntax).Notes
package.jsonchanges.from './types') to match the existingcli.tsconvention; tsconfig does not haveallowImportingTsExtensionsenabled./Users/lms/.ask/github/github.com/MinishLab/semble/main/src/semble/types.py.Summary by cubic
Ports core types and helpers from semble (Python) to TypeScript to keep wire-format and telemetry parity. Adds enums, interfaces, embeddings, serialization helpers, and runtime validation for untrusted JSON, including rejecting non-finite line numbers.
ContentTypeandCallTypestring-literal consts matching Python enums.Chunk(camelCase, readonly),SearchResult,IndexStats.EmbeddingMatrix = Float32ArrayandEmbeddingShape { rows, dim }.chunkLocation,chunkToDict(emitslanguage: null),chunkFromDict(ignoreslocation, acceptslanguage: null, validates input and throwsTypeErroron malformed payloads; rejectsNaN/±InfinityforstartLine/endLine),searchResultToDict.chunkLocation, roundtrips, error handling for bad input (including non-finite line numbers), and result shape.Written for commit 7dc025d. Summary will update on new commits.