DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests)#150
Merged
Conversation
Word frequency files from AdaWorldAPI/DeepNSM: word_rank_lookup.csv (101 KB) — vocabulary.rs load source lemmas_5k.csv (670 KB) — 5051 COCA lemmas with PoS + frequency forms_5k.csv (586 KB) — word forms lemmas_compact.csv (156 KB) — compact format forms_compact.csv (145 KB) — compact forms word_forms.csv (378 KB) — all forms Binary files gitignored (*.bin, *.json, subgenres_5k.csv). 47 DeepNSM tests passing (including 8 fingerprint16k). https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
AdaWorldAPI
pushed a commit
that referenced
this pull request
Apr 17, 2026
…eam #150 ## Stale artifact removal (182 files, 3 MB) `AdaWorldAPI-lance-graph-d9df43b/` was a committed snapshot of an older upstream version (48 .rs files vs our 98). Full audit confirmed: - ZERO files exist only in the artifact (every file has a counterpart) - Every differing file: ours >= artifact in LOC (ours is strictly ahead) - All upstream features (#125 parameter_substitution, #140 lance_vector_search) are already in our src tree The directory created GitHub path confusion — duplicate navigation paths for datafusion_planner, spo, blasgraph, neighborhood, arigraph. Removing it eliminates that confusion with zero content loss. ## Cherry-pick: spark_dialect.rs from upstream PR #150 The ONE file upstream has that we didn't: - `crates/lance-graph/src/spark_dialect.rs` (107 LOC) Spark SQL dialect for DataFusion unparser: backtick quoting, STRING type casting, EXTRACT for dates, BIGINT/INT types, LENGTH(), derived table aliases. - `crates/lance-graph/tests/test_to_spark_sql.rs` (293 LOC) Full test suite for Spark SQL output. - `pub mod spark_dialect;` added to lib.rs Adapted from upstream's DF 50.3 to our DF 51 — same API surface, no changes needed. ## Upstream audit result (for the record) Upstream (lance-format/lance-graph) is at v0.5.4. Our fork is at v0.5.3 with newer deps (arrow 57 vs 56.2, datafusion 51 vs 50.3). Other than spark_dialect, every upstream feature and fix is already present in our source tree — parameter_substitution (#125), lance_vector_search (#140), complex RETURN clauses (#142), duplicate columns fix (#128) are all in `crates/lance-graph/src/`. Their deleted `simple_executor` was a prototype cold-path executor we never had. Our `ExecutionStrategy::DataFusion` path (6K LOC planner) + `ExecutionStrategy::BlasGraph` (semiring algebra) subsume it. The user has flagged adding a deliberate `ExecutionStrategy::Simple` cold path as a 4th strategy for trivial queries — that's a separate PR per the documented matrix of execution strategies. https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
4 tasks
AdaWorldAPI
added a commit
that referenced
this pull request
Apr 17, 2026
chore: remove stale upstream snapshot + port spark_dialect from upstream #150
7 tasks
AdaWorldAPI
pushed a commit
that referenced
this pull request
Apr 23, 2026
## Summary
- Add `SqlDialect` enum (`Default`, `Spark`, `PostgreSql`, `MySql`,
`Sqlite`) and `SparkDialect` implementation using DataFusion's unparser
`Dialect` trait
- Refactor `to_sql()` to accept an optional `dialect` parameter instead
of a separate method per dialect
- Add Python API support: `query.to_sql(datasets, dialect="spark")`
### Spark SQL dialect differences
- Backtick identifier quoting
- `STRING` type instead of `VARCHAR`
- `EXTRACT(field FROM expr)` for date parts
- `LENGTH()` instead of `CHARACTER_LENGTH()`
- `TIMESTAMP` without timezone info
- Subqueries in FROM require aliases
### Usage
**Rust:**
```rust
use lance_graph::{CypherQuery, SqlDialect};
let sql = query.to_sql(datasets, Some(SqlDialect::Spark)).await?;
```
**Python:**
```python
sql = query.to_sql(datasets, dialect="spark")
```
## Test plan
- [x] 7 new Spark SQL integration tests (backtick quoting, filters,
relationships, complex queries, dialect comparison, PostgreSQL dialect)
- [x] 5 unit tests for SparkDialect trait implementation
- [x] 12 existing `to_sql` tests updated and passing
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Yu Chen <yu.chen@databricks.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
COCA Vocabulary Migrated to DeepNSM
Word frequency data from AdaWorldAPI/DeepNSM →
crates/deepnsm/word_frequency/:word_rank_lookup.csv— vocabulary.rs load source (4096 words)lemmas_5k.csv— 5051 COCA lemmas with PoS + frequencyforms_5k.csv,word_forms.csv, compact variantsBinary files gitignored. ~2 MB CSVs in git.
16Kbit VSA Fingerprint (from previous merge)
Already in
deepnsm/src/fingerprint16k.rs:DeepNSM = Complete Grammar+Vocabulary+Fingerprint+SPO Crate
47 tests passing. No separate grammar crate needed.
https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A