Skip to content

Implement Adaptive Radix Tree (ART) Indexes#492

Merged
adsharma merged 12 commits into
mainfrom
art_index
May 17, 2026
Merged

Implement Adaptive Radix Tree (ART) Indexes#492
adsharma merged 12 commits into
mainfrom
art_index

Conversation

@adsharma
Copy link
Copy Markdown
Contributor

@adsharma adsharma commented May 16, 2026

Details in the included docs.

du -sh /tmp/test1*.db
634M    /tmp/test1-hash.db
512M    /tmp/test1-noindex.db
604M    /tmp/test1-art.db
➜  ladybug git:(art_index) ✗ ./build/release/tools/shell/lbug -r /tmp/test1-hash.db
Opening the database at path: /tmp/test1-hash.db in read-only mode.
Enter ":help" for usage hints.
lbug> CALL show_indexes() return *;
┌────────────┬──────────────┬────────────┬────────────────┬──────────────────┬──────────────────────────────────────────────────────────────┐
│ table_name │ index_name   │ index_type │ property_names │ extension_loaded │ index_definition                                             │
│ STRING     │ STRING       │ STRING     │ STRING[]       │ BOOL             │ STRING                                                       │
├────────────┼──────────────┼────────────┼────────────────┼──────────────────┼──────────────────────────────────────────────────────────────┤
│ User       │ user_hash_pk │ HASH       │ [id]           │ True             │ CREATE HASH INDEX `user_hash_pk` FOR (n:`User`) ON (n.`id`); │
└────────────┴──────────────┴────────────┴────────────────┴──────────────────┴──────────────────────────────────────────────────────────────┘
(1 tuple)
(6 columns)
Time: 5.56ms (compiling), 1.69ms (executing)
lbug>
➜  ladybug git:(art_index) ✗ ./build/release/tools/shell/lbug -r /tmp/test1-art.db
Opening the database at path: /tmp/test1.db in read-only mode.
Enter ":help" for usage hints.
lbug> CALL show_indexes() return *;
┌────────────┬───────────────┬────────────┬────────────────┬──────────────────┬──────────────────────────────────────────────────────────────┐
│ table_name │ index_name    │ index_type │ property_names │ extension_loaded │ index_definition                                             │
│ STRING     │ STRING        │ STRING     │ STRING[]       │ BOOL             │ STRING                                                       │
├────────────┼───────────────┼────────────┼────────────────┼──────────────────┼──────────────────────────────────────────────────────────────┤
│ User       │ idx_person_pk │ ART        │ [id]           │ True             │ CREATE ART INDEX `idx_person_pk` FOR (n:`User`) ON (n.`id`); │
└────────────┴───────────────┴────────────┴────────────────┴──────────────────┴──────────────────────────────────────────────────────────────┘
(1 tuple)
(6 columns)
Time: 2.08ms (compiling), 0.21ms (executing)
lbug>
➜  ladybug git:(art_index) ✗ ./build/release/tools/shell/lbug -r /tmp/test1-noindex.db
Opening the database at path: /tmp/test1-save.db in read-only mode.
Enter ":help" for usage hints.
lbug> CALL show_indexes() return *;
┌────────────┬────────────┬────────────┬────────────────┬──────────────────┬──────────────────┐
│ table_name │ index_name │ index_type │ property_names │ extension_loaded │ index_definition │
│ STRING     │ STRING     │ STRING     │ STRING[]       │ BOOL             │ STRING           │
├────────────┼────────────┼────────────┼────────────────┼──────────────────┼──────────────────┤
└────────────┴────────────┴────────────┴────────────────┴──────────────────┴──────────────────┘
(0 tuples)
(6 columns)
Time: 1.50ms (compiling), 0.16ms (executing)

@adsharma
Copy link
Copy Markdown
Contributor Author

adsharma commented May 16, 2026

random_lookup_bench.py --backend pybind --literal --lookups 500 --warmup 50


  Literal benchmark cross-check, 500 lookups:

  base  1996.8/s  avg=0.501ms p95=0.615ms
  hash  1227.3/s  avg=0.815ms p95=1.802ms
  art  12029.0/s  avg=0.083ms p95=0.095ms

Without --literal  
  
  base  1405.7/s avg=0.711ms p95=0.842ms
  hash  1350.4/s avg=0.740ms p95=0.959ms
  art   1424.0/s avg=0.702ms p95=0.817ms

So with literal constants, ART is clearly being used and is much faster. The prepared benchmark is still mostly tied, which suggests the prepared/parameter path or Python execution overhead is masking the index difference. Hash being slower than base in the literal run is also notable.

https://gist.github.com/adsharma/83a1b7c9c320d829e349135d396ab4d3

@adsharma
Copy link
Copy Markdown
Contributor Author

ART index also enables range scans. Previously ladybug/kuzu didn't support range scans via indexes.

Copy link
Copy Markdown
Contributor

@aheev aheev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queries after COPY FROM are failing

-CASE ArtIndexCopyFrom
-STATEMENT CALL enable_default_hash_index=false;
---- ok
-STATEMENT CREATE NODE TABLE art_copy_person (ID INT64, name STRING, PRIMARY KEY(ID));
---- 1
Table art_copy_person has been created.
-STATEMENT CREATE ART INDEX art_copy_person_pk FOR (p:art_copy_person) ON (p.ID);
---- 1
Index art_copy_person_pk has been created.
-STATEMENT COPY art_copy_person FROM "${LBUG_ROOT_DIRECTORY}/dataset/art-index-test/person.csv";
---- 1
4 tuples have been copied to the art_copy_person table.
-STATEMENT MATCH (p:art_copy_person) WHERE p.ID = 2 RETURN p.name;
---- 1
Grace
-STATEMENT MATCH (p:art_copy_person) WHERE p.ID >= 1 AND p.ID <= 10 RETURN p.ID, p.name ORDER BY p.ID;
---- 3
1|Ada
2|Grace
10|Barbara
-STATEMENT MATCH (p:art_copy_person) WHERE p.ID < 2 RETURN p.ID, p.name ORDER BY p.ID;
---- 2
-1|Edsger
1|Ada
-STATEMENT CALL enable_default_hash_index=true;
---- ok

person.csv

We should also look at thread safety

Comment thread src/include/storage/index/art_index.h
Comment thread src/optimizer/filter_push_down_optimizer.cpp
Comment thread src/storage/index/art_index.cpp Outdated
Comment thread src/include/storage/index/art_index.h Outdated
Comment thread docs/art_index.md Outdated
Comment thread src/include/storage/index/art_index.h Outdated
Comment thread src/storage/index/art_index.cpp Outdated
@adsharma
Copy link
Copy Markdown
Contributor Author

Key changes:

  • ART index creation now validates the physical contract: built-in primary-key index, exactly one indexed column, exactly one key type, supported scalar key
    type.
  • Range pushdown now only uses ART range scan for validated simple shapes: at most one constant lower bound and one constant upper bound. Duplicate same-side
    bounds or complex/non-constant predicates stay as normal filters.
  • Consumed valid range predicates are now removed from the residual predicate set.
  • ART public lookup/scan/insert/discard/checkpoint paths now take a coarse mutex for thread safety.
  • Moved initInsertState out of the header and marked ArtPrimaryKeyIndex::load with LBUG_API.
  • Reworked range traversal to iterate actual children instead of probing all 256 byte values through getChild.
  • Rollback/discard cleanup now prunes empty child nodes; it does not shrink node kind layouts, which I’d keep as a separate memory-layout refactor.
  • Updated docs/art_index.md with unsupported shapes and optimizer limitations.
  • Added e2e coverage for unsupported ART creation, duplicate/complex range predicate behavior, and kept the COPY regression.

@adsharma
Copy link
Copy Markdown
Contributor Author

Need refactor the pathological unique_ptr<> deallocation pattern in this PR. Eventually we should use a more efficient purpose designed memory allocator for the nodes. But for now, I'll try to avoid the long pause on shell exit.

adsharma added 2 commits May 16, 2026 22:17
  - Allocation: one new Node() per inserted trie child.
  - Storage: each Node has a tagged active child layout: small, node48, or node256. It only keeps the child array for its current kind.
  - Ownership: parent owns its child pointers. There is no shared ownership.
  - Destruction: ArtPrimaryKeyIndex::clear() walks the tree iteratively with an explicit stack and deletes every reachable node. This avoids recursive destructor
    chains.
  - Child deletion: removeChild() calls deleteTree(child) before removing the pointer from the parent, so removed subtrees are freed immediately.
  - Growth/rebalancing: when NODE16 -> NODE48 or NODE48 -> NODE256, the code moves raw child pointers into the new active layout and then switches kind. It does
    not delete child nodes during growth, because ownership is transferred in-place to the new layout.

What prevents leaks is the invariant that every allocated child pointer is either reachable from exactly one parent layout or is immediately passed to
deleteTree() during deletion. Growth only relocates pointers; deletion nulls/compacts after freeing.
@adsharma adsharma merged commit 7d6335a into main May 17, 2026
4 checks passed
@adsharma adsharma deleted the art_index branch May 17, 2026 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants