Skip to content

Fix ice-disk table scans#491

Merged
adsharma merged 10 commits into
LadybugDB:mainfrom
aheev:fix-icedisk-scans
May 18, 2026
Merged

Fix ice-disk table scans#491
adsharma merged 10 commits into
LadybugDB:mainfrom
aheev:fix-icedisk-scans

Conversation

@aheev
Copy link
Copy Markdown
Contributor

@aheev aheev commented May 15, 2026

  • fixed nodeID offset in node table scan by calculating calc global row offset using parquet metadata
  • fixed early break issue in rel table scans by refactoring ice-disk internal scan to full table based rather than rowGroup based
  • enumized STORAGE_FORMAT

context: #476 (review)

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 15, 2026

@adsharma could you PTAL?

Re: duplicate boundNodes in unordered_map

Two cases:

 1. Source mode (MATCH (a:user)-[:follows]->(b) — direct node scan child): fetchNextBoundNodeBatch generates unique sequential offsets [nextOffset, nextOffset+N). No duplicates by construction.
 2. Non-source mode (multi-hop (a)-[r1]->(b)-[r2]->(c)):
 - r1's scan processes one source node a at a time (the break when boundOffset != activeBoundOffset)
 - So each call to r1.getNextTuple produces neighbors of exactly one a
 - A single source node's neighbor list has no duplicates in a well-formed CSR file
 - IceDisk node table emits 1 node per scan call. Even if it emits in a batch they would be distinct
 - Therefore r2's bound node vector always has distinct b values in each batch

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 15, 2026

dataset PR: LadybugDB/dataset#3

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 16, 2026

@adsharma should we add a get_icebug_disk_supported_version CALL?

@adsharma
Copy link
Copy Markdown
Contributor

We already have db_version() and storage_version(). Users can detect if icebug-disk is supported by trying ATTACH.

Comment thread src/include/catalog/catalog_entry/node_table_catalog_entry.h
Comment thread src/storage/table/ice_disk_rel_table.cpp
}

// Load shared indptr data - thread-safe to read
if (!indptrFilePath.empty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard was significant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indptr and indices path validation is done during table creation phase

// calc current global row index based on assigned row group and local row index within that
// group
auto metadata = iceDiskScanState.parquetReader->getMetadata();
offset_t startOffset = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startOffset for a given nodeGroupIdx is constant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startOffset for a nodeGroupIdx(rowGroup) is calc just below. We can avoid this repeated calc by populating startOffsets in initGlobalStateInternal. I will add it in refactor post release. Keeping changes minimal right now

Comment thread src/processor/operator/scan/scan_rel_table.cpp Outdated
Comment thread docs/icebug-disk.md
Comment thread src/common/enums/storage_format.cpp Outdated

// Create DataChunk matching the indices parquet file schema
auto numIndicesColumns = indicesReader->getNumColumns();
cachedBatchData = std::make_unique<DataChunk>(numIndicesColumns);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these allocations be done once on reset() and reused?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataChunk doesn't offer a reset out of the box. All it offers is resetAuxiliaryBuffer. We need to manually reset state in DataChunk and other state objects in ValueVectors which requires tinkering with ParquetReader and/or ValueVector. Maybe refactor it later?

for (uint32_t colIdx = 0; colIdx < numIndicesColumns; ++colIdx) {
const auto& columnTypeRef = indicesReader->getColumnType(colIdx);
auto columnType = columnTypeRef.copy();
auto vector = std::make_shared<ValueVector>(std::move(columnType), memoryManager);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 17, 2026

new dataset PR: LadybugDB/dataset#4

@aheev aheev requested a review from adsharma May 17, 2026 05:48
@aheev aheev force-pushed the fix-icedisk-scans branch from dc0195c to fe19240 Compare May 18, 2026 15:04
@adsharma adsharma merged commit 61dedb9 into LadybugDB:main May 18, 2026
4 checks passed
@adsharma
Copy link
Copy Markdown
Contributor

Nice improvements! Ok to handle minor unresolved comments separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants