Conversation
So with literal constants, ART is clearly being used and is much faster. The prepared benchmark is still mostly tied, which suggests the prepared/parameter path or Python execution overhead is masking the index difference. Hash being slower than base in the literal run is also notable. https://gist.github.com/adsharma/83a1b7c9c320d829e349135d396ab4d3 |
|
ART index also enables range scans. Previously ladybug/kuzu didn't support range scans via indexes. |
aheev
left a comment
There was a problem hiding this comment.
queries after COPY FROM are failing
-CASE ArtIndexCopyFrom
-STATEMENT CALL enable_default_hash_index=false;
---- ok
-STATEMENT CREATE NODE TABLE art_copy_person (ID INT64, name STRING, PRIMARY KEY(ID));
---- 1
Table art_copy_person has been created.
-STATEMENT CREATE ART INDEX art_copy_person_pk FOR (p:art_copy_person) ON (p.ID);
---- 1
Index art_copy_person_pk has been created.
-STATEMENT COPY art_copy_person FROM "${LBUG_ROOT_DIRECTORY}/dataset/art-index-test/person.csv";
---- 1
4 tuples have been copied to the art_copy_person table.
-STATEMENT MATCH (p:art_copy_person) WHERE p.ID = 2 RETURN p.name;
---- 1
Grace
-STATEMENT MATCH (p:art_copy_person) WHERE p.ID >= 1 AND p.ID <= 10 RETURN p.ID, p.name ORDER BY p.ID;
---- 3
1|Ada
2|Grace
10|Barbara
-STATEMENT MATCH (p:art_copy_person) WHERE p.ID < 2 RETURN p.ID, p.name ORDER BY p.ID;
---- 2
-1|Edsger
1|Ada
-STATEMENT CALL enable_default_hash_index=true;
---- ok
We should also look at thread safety
|
Key changes:
|
|
Need refactor the pathological unique_ptr<> deallocation pattern in this PR. Eventually we should use a more efficient purpose designed memory allocator for the nodes. But for now, I'll try to avoid the long pause on shell exit. |
- Allocation: one new Node() per inserted trie child.
- Storage: each Node has a tagged active child layout: small, node48, or node256. It only keeps the child array for its current kind.
- Ownership: parent owns its child pointers. There is no shared ownership.
- Destruction: ArtPrimaryKeyIndex::clear() walks the tree iteratively with an explicit stack and deletes every reachable node. This avoids recursive destructor
chains.
- Child deletion: removeChild() calls deleteTree(child) before removing the pointer from the parent, so removed subtrees are freed immediately.
- Growth/rebalancing: when NODE16 -> NODE48 or NODE48 -> NODE256, the code moves raw child pointers into the new active layout and then switches kind. It does
not delete child nodes during growth, because ownership is transferred in-place to the new layout.
What prevents leaks is the invariant that every allocated child pointer is either reachable from exactly one parent layout or is immediately passed to
deleteTree() during deletion. Growth only relocates pointers; deletion nulls/compacts after freeing.
Details in the included docs.