Fix ice-disk table scans by aheev · Pull Request #491 · LadybugDB/ladybug

aheev · 2026-05-15T11:51:54Z

fixed nodeID offset in node table scan by calculating calc global row offset using parquet metadata
fixed early break issue in rel table scans by refactoring ice-disk internal scan to full table based rather than rowGroup based
enumized STORAGE_FORMAT

aheev · 2026-05-15T11:54:49Z

@adsharma could you PTAL?

Re: duplicate boundNodes in unordered_map

Two cases:

 1. Source mode (MATCH (a:user)-[:follows]->(b) — direct node scan child): fetchNextBoundNodeBatch generates unique sequential offsets [nextOffset, nextOffset+N). No duplicates by construction.
 2. Non-source mode (multi-hop (a)-[r1]->(b)-[r2]->(c)):
 - r1's scan processes one source node a at a time (the break when boundOffset != activeBoundOffset)
 - So each call to r1.getNextTuple produces neighbors of exactly one a
 - A single source node's neighbor list has no duplicates in a well-formed CSR file
 - IceDisk node table emits 1 node per scan call. Even if it emits in a batch they would be distinct
 - Therefore r2's bound node vector always has distinct b values in each batch

aheev · 2026-05-15T11:56:57Z

dataset PR: LadybugDB/dataset#3

aheev · 2026-05-16T03:05:45Z

@adsharma should we add a get_icebug_disk_supported_version CALL?

adsharma · 2026-05-16T15:31:21Z

We already have db_version() and storage_version(). Users can detect if icebug-disk is supported by trying ATTACH.

adsharma · 2026-05-16T16:09:33Z

-    }
-
-    // Load shared indptr data - thread-safe to read
-    if (!indptrFilePath.empty()) {


This guard was significant?

indptr and indices path validation is done during table creation phase

adsharma · 2026-05-16T16:11:19Z

+    // calc current global row index based on assigned row group and local row index within that
+    // group
+    auto metadata = iceDiskScanState.parquetReader->getMetadata();
+    offset_t startOffset = 0;


startOffset for a given nodeGroupIdx is constant?

startOffset for a nodeGroupIdx(rowGroup) is calc just below. We can avoid this repeated calc by populating startOffsets in initGlobalStateInternal. I will add it in refactor post release. Keeping changes minimal right now

adsharma · 2026-05-16T16:17:08Z

+
+    // Create DataChunk matching the indices parquet file schema
+    auto numIndicesColumns = indicesReader->getNumColumns();
+    cachedBatchData = std::make_unique<DataChunk>(numIndicesColumns);


Can these allocations be done once on reset() and reused?

DataChunk doesn't offer a reset out of the box. All it offers is resetAuxiliaryBuffer. We need to manually reset state in DataChunk and other state objects in ValueVectors which requires tinkering with ParquetReader and/or ValueVector. Maybe refactor it later?

adsharma · 2026-05-16T16:17:22Z

+    for (uint32_t colIdx = 0; colIdx < numIndicesColumns; ++colIdx) {
+        const auto& columnTypeRef = indicesReader->getColumnType(colIdx);
+        auto columnType = columnTypeRef.copy();
+        auto vector = std::make_shared<ValueVector>(std::move(columnType), memoryManager);


same as above

aheev · 2026-05-17T05:43:14Z

new dataset PR: LadybugDB/dataset#4

adsharma · 2026-05-18T17:00:02Z

Nice improvements! Ok to handle minor unresolved comments separately.

aheev mentioned this pull request May 15, 2026

update icebug-disk demo-db datasets LadybugDB/dataset#3

Merged

adsharma reviewed May 16, 2026

View reviewed changes

aheev requested a review from adsharma May 17, 2026 05:48

aheev added 10 commits May 18, 2026 19:11

fix minor issues

851f47f

add self-join tests

56929d3

fix rowIndex in icebug node table scan

0ab3158

fix ice-disk rel table scan

6e7529d

enumize storage format

1d6b9fe

update dataset submodule

cc2ec2e

fix scans

305fcd8

add ice_disk complex_queries tests

fb80f5c

update dataset submodule

336a9c1

re-add support for object store files

fe19240

aheev force-pushed the fix-icedisk-scans branch from dc0195c to fe19240 Compare May 18, 2026 15:04

adsharma merged commit 61dedb9 into LadybugDB:main May 18, 2026
4 checks passed

Conversation

aheev commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aheev commented May 15, 2026

Uh oh!

aheev commented May 15, 2026

Uh oh!

aheev commented May 16, 2026

Uh oh!

adsharma commented May 16, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aheev commented May 17, 2026

Uh oh!

Uh oh!

adsharma commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aheev commented May 15, 2026 •

edited

Loading