feat(rel): IR expression lowering to DataFusion Expr#691
Conversation
- Add ExprLowerer<'a> with lower(ExprId) -> Result<DfExpr, LoweringError> - Add VarMap: VarId → DataFusion column name string - Add LoweringError: UnknownFunction, UnsupportedExpr, UnboundVar - Mappings: all IrLiteral variants, VarRef, PropertyAccess, all BinaryOpKind variants (including Mod/Pow/In/StringOps/RegexMatch), all UnaryOpKind variants, FunctionCall (built-in table), Parameter (Expr::Placeholder), Case, ListLiteral stub, MapLiteral stub - Built-in function table: toUpper/toLower/trim/concat, toString/ toInteger/toFloat, abs/ceil/floor/round/sqrt/power, char_length - Add datafusion + gf-ontology deps to gf-rel - 18 unit tests covering all variants, compound predicates, unbound variable error, unknown function error Closes #574 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughImplements ExprLowerer in gf-rel to lower GF IR ExprArena expressions to DataFusion Exprs, adds VarMap and LoweringError, wires the module and dependencies, and includes comprehensive unit tests for literals, variables, operators, functions, parameters, and case expressions. ChangesExpression Lowering Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #691 +/- ##
=======================================
Coverage 97.08% 97.08%
=======================================
Files 2 2
Lines 274 274
Branches 41 41
=======================================
Hits 266 266
Misses 5 5
Partials 3 3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/gf-rel/src/expr.rs (1)
152-160: 💤 Low valueComment doesn't match behavior.
The comment says elements are encoded "as a JSON-string literal for now," but the arm actually lowers the elements only to discard them and return
UnsupportedExpr. Consider correcting the comment to reflect that this is a validation-then-reject path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/gf-rel/src/expr.rs` around lines 152 - 160, The comment for the IrExpr::ListLiteral arm is inaccurate: it claims elements are encoded "as a JSON-string literal for now" while the code actually lowers each element (via self.lower on ids), discards the result (elems?), then returns LoweringError::UnsupportedExpr; update the comment to state this is performing validation (lowering each id to ensure subexpressions are valid) and then intentionally rejecting ListLiteral with an UnsupportedExpr error (reference IrExpr::ListLiteral, ids, self.lower, elems, and LoweringError::UnsupportedExpr) so the comment matches the implemented behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/gf-rel/src/expr.rs`:
- Around line 152-160: The comment for the IrExpr::ListLiteral arm is
inaccurate: it claims elements are encoded "as a JSON-string literal for now"
while the code actually lowers each element (via self.lower on ids), discards
the result (elems?), then returns LoweringError::UnsupportedExpr; update the
comment to state this is performing validation (lowering each id to ensure
subexpressions are valid) and then intentionally rejecting ListLiteral with an
UnsupportedExpr error (reference IrExpr::ListLiteral, ids, self.lower, elems,
and LoweringError::UnsupportedExpr) so the comment matches the implemented
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cbc22d80-62d3-4df9-82b5-37ff11b13e87
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (3)
crates/gf-rel/Cargo.tomlcrates/gf-rel/src/expr.rscrates/gf-rel/src/lib.rs
| "size" | "length" | "char_length" => Some( | ||
| datafusion::functions::unicode::expr_fn::char_length(a.remove(0)), | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect how size/length are produced in the IR to confirm argument typing.
rg -nP -C3 '\b(size|length)\b' --type=rust -g '*ir*' -g '!**/gf-rel/**'Repository: DecisionNerd/graphforge
Length of output: 696
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find other occurrences of the "size"/"length" function mapping
rg -nP '\"(size|length)\"' crates/gf-rel -S
# 2) Inspect the implementation around gf-rel/src/expr.rs where the mapping exists
sed -n '300,380p' crates/gf-rel/src/expr.rs
# 3) See how `char_length` is used elsewhere (possible evidence of string-only routing)
rg -nP 'char_length\(' -S crates/gf-rel
# 4) Locate IR/function construction sites for openCypher functions `size`/`length`
rg -nP '\b(size|length)\b' --type=rust -S crates | head -n 200
# 5) If there are type-checking passes/IR nodes, search for "String"/"List"/"Path" in the same area
rg -nP 'String|list|path' crates/gf-rel -S | head -n 200Repository: DecisionNerd/graphforge
Length of output: 5513
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where resolve_builtin is called (to see what IR/type constraints exist)
rg -n "resolve_builtin\(" crates/gf-rel/src/expr.rs
# Inspect the surrounding code paths where built-ins are lowered
sed -n '240,420p' crates/gf-rel/src/expr.rs
# Inspect the expression lowerer for any type checks/casts around function calls
sed -n '420,650p' crates/gf-rel/src/expr.rs
# Search in gf-ir (or plan/lowering) for the built-in function typing rules
rg -n "size\"|\"length\"|\"char_length" crates/gf-ir crates/gf-plan crates/gf-core crates/gf-rel -SRepository: DecisionNerd/graphforge
Length of output: 15112
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect how FunctionCall lowering happens around the earlier resolve_builtin call.
sed -n '1,220p' crates/gf-rel/src/expr.rs
# Locate the IrExpr::FunctionCall definition (and any type info attached).
rg -n "enum IrExpr|IrExpr::FunctionCall|FunctionCall" crates/gf-ir -S
# Inspect gf-ir expr module around FunctionCall.
rg -n "FunctionCall" crates/gf-ir/src -SRepository: DecisionNerd/graphforge
Length of output: 9898
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n --hidden --no-ignore -S 'char_length|\"size\"|\"length\"' crates/gf-ir crates/gf-plan crates/gf-ast crates/gf-cypherRepository: DecisionNerd/graphforge
Length of output: 177
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n --hidden --no-ignore -S '"size"' crates/gf-ir crates/gf-plan crates/gf-cypher
rg -n --hidden --no-ignore -S '"length"' crates/gf-ir crates/gf-plan crates/gf-cypher
rg -n --hidden --no-ignore -S 'char_length' crates/gf-ir crates/gf-plan crates/gf-cypher
rg -n --hidden --no-ignore -S 'FunctionCall \{[^}]*name' crates/gf-ir/src/binder.rs
sed -n '520,620p' crates/gf-ir/src/binder.rsRepository: DecisionNerd/graphforge
Length of output: 49
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Inspect the binder section that handles FunctionCall lowering (for any built-in remapping)
ls -la crates/gf-ir/src/binder.rs
sed -n '480,620p' crates/gf-ir/src/binder.rs
# 2) Search for openCypher built-in names in gf-ir tests/golden IR snapshots
rg -n -S -F '"size"' crates/gf-ir/tests crates/gf-*/tests || true
rg -n -S -F '"length"' crates/gf-ir/tests crates/gf-*/tests || true
rg -n -S -F 'char_length' crates/gf-ir/tests crates/gf-*/tests || true
# 3) Search in generated/golden snapshots for FunctionCall names
rg -n -S 'FunctionCall' crates/gf-ir/tests/ir_goldens || true
rg -n -S -F 'name: "size"' crates/gf-ir/tests crates/gf-*/tests || true
rg -n -S -F 'name: "length"' crates/gf-ir/tests crates/gf-*/tests || true
# 4) Search parser/AST handling for these function names
rg -n -S -F 'size' crates/gf-cypher/src crates/gf-ast/src || true
rg -n -S -F 'length' crates/gf-cypher/src crates/gf-ast/src || true
rg -n -S 'char_length' crates/gf-cypher crates/gf-ast || trueRepository: DecisionNerd/graphforge
Length of output: 8863
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect binder around FunctionCall handling (likely earlier in the file than the snippet we saw)
rg -n "Expr::FunctionCall\\(FunctionCall" crates/gf-ir/src/binder.rs
sed -n '400,620p' crates/gf-ir/src/binder.rs
# Direct string searches for built-in names across gf-ir / gf-cypher / gf-plan
rg -n -S -F 'char_length' crates/gf-ir crates/gf-plan crates/gf-cypher || true
rg -n -S -F '"size"' crates/gf-ir crates/gf-plan crates/gf-cypher || true
rg -n -S -F '"length"' crates/gf-ir crates/gf-plan crates/gf-cypher || true
rg -n -S -F 'size(' crates/gf-ir crates/gf-plan crates/gf-cypher || true
rg -n -S -F 'length(' crates/gf-ir crates/gf-plan crates/gf-cypher || true
# Look for any central built-in function / signature dispatch tables
rg -n -S -i 'builtin|built-in|function signature|signature|resolve_function|FunctionCall' crates/gf-ir/src crates/gf-plan/src crates/gf-rel/src | head -n 200Repository: DecisionNerd/graphforge
Length of output: 9695
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where FunctionCall is handled outside gf-rel/expr.rs
rg -n "FunctionCall" crates/gf-plan crates/gf-*/src -S
# Look for built-in function name literals in the repo
rg -n -S -F '"size"' crates || true
rg -n -S -F '"length"' crates || true
rg -n -S -F 'char_length' crates || true
# Look for any function signature/type checking tables that might mention "size"/"length"/"char_length"
rg -n -S -i "builtin|built-in|function signature|signature|typecheck|type check|validate|semantics|cypher.*function" crates/gf-* || trueRepository: DecisionNerd/graphforge
Length of output: 7848
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for special-casing of built-in functions in gf-cypher parser/binder
rg -n -S -i '\bsize\b' crates/gf-cypher/src crates/gf-ir/src || true
rg -n -S -i '\blength\b' crates/gf-cypher/src crates/gf-ir/src || true
rg -n -S -i 'char_length' crates/gf-cypher/src crates/gf-ir/src || true
# Also inspect gf-cypher parser expr around any FunctionCall handling that could rewrite size/length
rg -n -S 'FunctionCall' crates/gf-cypher/src/parser/expr.rsRepository: DecisionNerd/graphforge
Length of output: 1339
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect gf-ir FunctionCall definition for any type/signature info
sed -n '280,380p' crates/gf-ir/src/expr.rs
# Inspect gf-ast FunctionCall struct for any typing hints
sed -n '540,640p' crates/gf-ast/src/ast.rs
# Inspect binder around the FunctionCall lowering site (approx where fn_name is joined)
sed -n '420,590p' crates/gf-ir/src/binder.rs
# Search for any typechecking/validation pass mentioning FunctionCall or built-in functions
rg -n -S -i 'typecheck|type check|signature|built-in|builtin|function call|FunctionCall' crates/gf-ir/src crates/gf-plan/src crates/gf-core/src || true
# Also search in gf-plan for any DataType/Uft8 usage
rg -n -S 'DataType::|Utf8' crates/gf-plan/src crates/gf-ir/src crates/gf-core/src crates/gf-rel/src || trueRepository: DecisionNerd/graphforge
Length of output: 14797
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Any tests or special-cases in gf-rel for size/length/char_length
rg -n -S -F '"size"' crates/gf-rel/src/expr.rs crates/gf-rel/src || true
rg -n -S -F '"length"' crates/gf-rel/src/expr.rs crates/gf-rel/src || true
rg -n -S -F 'char_length' crates/gf-rel/src || true
# 2) Search for any semantic/typechecking around FunctionCall in gf-ir/gf-plan
rg -n -S -i 'functioncall.*type|type.*functioncall|typecheck.*function|built-in.*function|signature.*function' crates/gf-ir/src crates/gf-plan/src crates/gf-core/src || true
# 3) Find any occurrence of size/length as raw function names in code (without requiring quotes)
rg -n -S -i '\bsize\s*\(' crates/gf-* || true
rg -n -S -i '\blength\s*\(' crates/gf-* || true
rg -n -S -i '\bchar_length\s*\(' crates/gf-* || trueRepository: DecisionNerd/graphforge
Length of output: 769
🌐 Web query:
DataFusion unicode expr_fn char_length expected input type
💡 Result:
In Apache DataFusion, the char_length function (also available as character_length and length) is designed to operate on string expressions [1][2]. The expected input type for these functions is a string-compatible type, primarily Utf8 or LargeUtf8 [1][3][2]. Key points regarding the input type and usage include: 1. Input Requirement: The function expects a text/string expression [1][2]. Attempting to use it on non-string types (such as binary types) without explicit casting may result in errors or unexpected behavior, as the function is specifically intended to count the number of characters in a string, not the number of bytes in a binary buffer [1][4][2]. 2. Function Aliases: char_length, character_length, and length are functional equivalents in DataFusion and perform the same operation [2]. 3. Implementation: As of recent versions, these functions are maintained within the unicode module of the datafusion-functions crate [5][6][7]. 4. Best Practice for Non-String Data: If you need to calculate the length of a binary (bytea) field, it is recommended to explicitly cast the input to text (e.g., URL::TEXT) if you require character length, or use octet_length if you require the byte count [4].
Citations:
- 1: https://datafusion.apache.org/user-guide/sql/scalar_functions.html
- 2: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/expressions.md
- 3: https://dev.to/alexmercedcoder/single-node-data-engineering-duckdb-datafusion-polars-and-lakesail-mai
- 4: Internal error: The "character_length" function can only accept strings apache/datafusion#7344
- 5: Create unicode module in datafusion/functions/src/unicode and unicode_expressions feature flag, move char_length function apache/datafusion#9816
- 6: digadeesh/datafusion@2cca371
- 7: move the Translate, SubstrIndex, FindInSet functions to datafusion-functions apache/datafusion#9864
Fix Cypher size/length lowering to be type-aware
crates/gf-rel/src/expr.rs currently maps "size" | "length" | "char_length" to DataFusion unicode::expr_fn::char_length(...) in resolve_builtin, and ExprLowerer forwards IrExpr::FunctionCall arguments without any typing/casts. This makes size()/length() incorrect for non-string inputs (e.g. Cypher collections/paths where size/length are element-count semantics) and can produce DataFusion type errors because char_length is defined for string (Utf8/LargeUtf8) expressions. Implement a type-aware dispatch: keep char_length for strings, and lower size/length to the appropriate collection/path-length semantics.
- Split char_length (string-only, keep mapped) from size/length (type-polymorphic: strings OR lists — defer to M13 type inference rather than silently mapping to char_length and producing DataFusion type errors on collection inputs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes Applied SuccessfullyFixed 1 file based on 1 CodeRabbit feedback item. Files modified:
Change: Split the Commit: |
Summary
ExprLowereringf-rel/src/expr.rs— pure, I/O-free transformation fromExprArena→ DataFusionExprVarMap(VarId → column name) andLoweringError(UnknownFunction, UnsupportedExpr, UnboundVar)IrLiteralvariants (Null, Bool, Int, Float, Str, Duration, DateTime)BinaryOpKindvariants: comparison, logical, arithmetic (Add/Sub/Mul/Div/Mod/Pow), collection (In), string predicates (StartsWith/EndsWith/Contains/RegexMatch)UnaryOpKindvariants: Not, Neg, IsNull, IsNotNullFunctionCallvia a built-in lookup table: toUpper/toLower/trim/ltrim/rtrim/concat, toString/toInteger/toFloat/toBoolean, abs/ceil/floor/round/sqrt/power, char_lengthParameter→Expr::Placeholder;Case→ DataFusion Case expr;ListLiteral/MapLiteral→UnsupportedExpr(deferred)ListLiteralis stubbed asUnsupportedExpr— full DataFusion array support deferred to rust: lower NodeScan and fixed-hop Expand to DataFusion LogicalPlan #576Closes #574
🤖 Generated with Claude Code
Note
Add IR expression lowering to DataFusion
Expringf-relExprLowererincrates/gf-rel/src/expr.rsto convert IR expressions from anExprArenainto executable DataFusionExprvalues.VarMapto map IRVarIds to DataFusion column name strings, andLoweringErrorfor unbound variables, unknown functions, and unsupported IR variants.ExprLowerer,VarMap, andLoweringErrorfrom thegf-relcrate root for downstream consumers.ListLiteralandMapLiteralare not supported and returnUnsupportedExprerrors.Macroscope summarized 110ea62.
Summary by CodeRabbit
New Features
Documentation