Skip to content

[WIP] Add icebug mem impl#496

Open
aheev wants to merge 2 commits into
LadybugDB:mainfrom
aheev:add-icebug-mem-impl
Open

[WIP] Add icebug mem impl#496
aheev wants to merge 2 commits into
LadybugDB:mainfrom
aheev:add-icebug-mem-impl

Conversation

@aheev
Copy link
Copy Markdown
Contributor

@aheev aheev commented May 18, 2026

Closes: #329

@adsharma
Copy link
Copy Markdown
Contributor

I realize this is not ready for review. But thanks for sharing it early for feedback!

Two high level concerns and details:

  • Adds a lot of new code. One of my concerns as a maintainer is to minimize the amount of code to maintain
  • Ice-mem branding. These things change and I would keep them on the docs/communication and use more technical names such as arrow-csr in the code.

Recommendations:

  • Keep ArrowNodeTable; do not add IceMemNodeTable.
  • Extend ArrowRelTable with a layout enum, e.g. TRIPLES vs CSR.
  • Put the layout-specific edge cursor behind a small helper, not a new table class.
  • For TRIPLES, keep current behavior: read from/to, lookup node PKs, emit matches.
  • For CSR, store indices and indptr Arrow arrays in the Arrow rel table data. For FWD scans, use the bound node offset to directly slice CSR. For BWD scans, either require/provide reverse CSR or explicitly fall back to a full scan.
  • Extend ArrowTableSupport registration with a tagged rel payload, rather than inventing icebug-memory catalog format.

@aheev
Copy link
Copy Markdown
Contributor Author

aheev commented May 19, 2026

  • Ice-mem branding. These things change and I would keep them on the docs/communication and use more technical names such as arrow-csr in the code.

Brand names tend to change far less frequently than technical details, wouldn’t you agree? Perhaps only on rare occasions. Also, if we were to name it arrow-csr, it might give the impression that we are confining the ice-mem format within both arrow and CSR

Recommendations:

  • Keep ArrowNodeTable; do not add IceMemNodeTable.
  • Extend ArrowRelTable with a layout enum, e.g. TRIPLES vs CSR.
  • Put the layout-specific edge cursor behind a small helper, not a new table class.
  • For TRIPLES, keep current behavior: read from/to, lookup node PKs, emit matches.
  • For CSR, store indices and indptr Arrow arrays in the Arrow rel table data. For FWD scans, use the bound node offset to directly slice CSR. For BWD scans, either require/provide reverse CSR or explicitly fall back to a full scan.

There is already a significant amount of redundant state and data in both the table and scan state classes due to inheritance from the native classes. For instance, if you look at setToTable, much of the code is not strictly necessary but is retained to remain compliant with the native class structure. Consequently, any changes to the native classes can propagate and affect the external storage classes as well. I have faced these issues while working on icebug-disk implementation

void ArrowRelTableScanState::setToTable(const transaction::Transaction* transaction, Table* table_,
    std::vector<column_id_t> columnIDs_, std::vector<ColumnPredicateSet> columnPredicateSets_,
    RelDataDirection direction_) {
    // Same behavior as IceDiskRelTable: no local table for external data sources.
    TableScanState::setToTable(transaction, table_, std::move(columnIDs_),
        std::move(columnPredicateSets_));
    columns.resize(columnIDs.size());
    direction = direction_;
    for (size_t i = 0; i < columnIDs.size(); ++i) {
        auto columnID = columnIDs[i];
        if (columnID == INVALID_COLUMN_ID || columnID == ROW_IDX_COLUMN_ID) {
            columns[i] = nullptr;
        } else {
            columns[i] = table->cast<RelTable>().getColumn(columnID, direction);
        }
    }
    csrOffsetColumn = table->cast<RelTable>().getCSROffsetColumn(direction);
    csrLengthColumn = table->cast<RelTable>().getCSRLengthColumn(direction);
    nodeGroupIdx = INVALID_NODE_GROUP_IDX;
}

Even with the ColumnarTableBase abstraction and the majority of common utilities in ArrowUtils, the Arrow classes remain large and complex. Introducing new data and states from icebug-memory only adds to this complexity, increasing redundant compliance code, more conditional checks, and potential risk.

  • Extend ArrowTableSupport registration with a tagged rel payload, rather than inventing icebug-memory catalog format.

didn't get this part

@adsharma
Copy link
Copy Markdown
Contributor

Extend ArrowTableSupport registration with a tagged rel payload, rather than inventing icebug-memory catalog format.

The Tag in this context is TRIPLES or CSR (feel free to rename). The status right now:

Parquet Arrow
Triples
CSR

In the future there may be a legit use case for running cypher over a triple table stored in parquet (we support this for duckdb/sqlite/postgres).

I'm thinking about how to implement this matrix with the most code reuse and maintainability. I agree that sometimes its necessary to duplicate code for clarity. Inheritance is not the best way to share code etc.

But I'm not convinced that we should have 4 cases above * {node, rel} = 8 classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Implement icebug-memory tables

2 participants