feat: blob v2 descriptor read support by geruh · Pull Request #548 · lance-format/lance-spark

geruh · 2026-05-21T00:51:32Z

related to #539 and #505

Adds read support for blob v2 columns. Instead of hiding blob metadata behind virtual columns, we surface the raw descriptor struct directly to Spark like what the original issue was stating.

struct<kind: short, position: long, size: long, blob_id: long, blob_uri: string>

Querying blob metadata is just a column projection now, no byte fetch:

SELECT id, payload.size, payload.kind FROM lance.ns.tbl;

A column is blob v2 when any of these hold:

Arrow extension name lance.blob.v2 is set in lance
metadata key lance-encoding:blob-v2 = true

Schema rewrite lives in BlobUtils.applyBlobV2DescriptorSchema(...), called from LanceDataset.schema(), LanceDataSource.inferSchema(), and the LanceScanBuilder constructor.

Filter pushdown

Suppressed for any table with a v2 column. The previous per-predicate gate had to be reverted because calc_eager_projection panics on filters that never reference blob columns. Zonemap fragment pruning still runs. Need more investigation here

Testing

Added some tests for BlobUtils
Added some tests for LanceArrowColumnVector

No end-to-end pylance interop tests here. The connector can't write v2 blobs yet, so producing test data requires pylance as an external dep. E2E coverage lands naturally once the write path exists, extending BaseBlobCreateTableTest's pattern.

Local verification

Wrote a v2 dataset with pylance hitting all four BlobKind tiers (Inline, Packed, Dedicated, External) plus a null row, then read it back through Spark on this branch:

write_blob_v2.py

import lance, pyarrow as pa

with open("/tmp/blob_v2_external.bin", "wb") as f:
    f.write(b"external blob payload " * 64)

values = [
    b"hi",                                    # 2 B → Inline (kind=0, ≤ 64KB)
    b"x" * (200 * 1024),                      # 200KB → Packed (kind=1, > 64KB, ≤ 4MB)
    b"y" * (5 * 1024 * 1024),                 # 5 MB → Dedicated (kind=2, > 4MB)
    "file:///tmp/blob_v2_external.bin",        # str → External (kind=3)
    None,                                      # null
]

table = pa.table({
    "id": pa.array([0, 1, 2, 3, 4], type=pa.int32()),
    "label": pa.array(["inline", "packed", "dedicated", "external", "null"]),
    "payload": lance.blob_array(values),
})

lance.write_dataset(table, "/tmp/blob_v2_kinds.lance",
    data_storage_version="2.2",
    allow_external_blob_outside_bases=True)

Spark output

> printSchema
root
 |-- id: integer (nullable = true)
 |-- label: string (nullable = true)
 |-- payload: struct (nullable = true)
 | |-- kind: short (nullable = true)
 | |-- position: long (nullable = true)
 | |-- size: long (nullable = true)
 | |-- blob_id: long (nullable = true)
 | |-- blob_uri: string (nullable = true)
 
 
> SELECT *
+---+---------+----------------------------------------------+
|id |label    |payload                                       |
+---+---------+----------------------------------------------+
|0  |inline   |{0, 0, 2, 0, }                               |
|1  |packed   |{1, 0, 204800, 1, }                          |
|2  |dedicated|{2, 0, 5242880, 2, }                         |
|3  |external |{3, 0, 0, 0, file:///tmp/blob_v2_external.bin}|
|4  |null     |{0, 0, 0, 0, }                               |
+---+---------+----------------------------------------------+

> projected fields
+---+---------+----+-------+-------+--------+--------------------------------+
|id |label    |kind|size   |blob_id|position|blob_uri                        |
+---+---------+----+-------+-------+--------+--------------------------------+
|0  |inline   |0   |2      |0      |0       |                                |
|1  |packed   |1   |204800 |1      |0       |                                |
|2  |dedicated|2   |5242880|2      |0       |                                |
|3  |external |3   |0      |0      |0       |file:///tmp/blob_v2_external.bin|
|4  |null     |0   |0      |0      |0       |                                |
+---+---------+----+-------+-------+--------+--------------------------------+

filter id >= 2
+---+---------+----+
|id |label    |kind|
+---+---------+----+
|2  |dedicated|2   |
|3  |external |3   |
|4  |null     |0   |
+---+---------+----+

feat: blob v2 descriptor read support

0fc1d66

github-actions Bot added the enhancement New feature or request label May 21, 2026

geruh added 2 commits May 20, 2026 18:19

remove old table properties logic

1cbb116

fix tests

df9fa03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: blob v2 descriptor read support#548

feat: blob v2 descriptor read support#548
geruh wants to merge 3 commits into
lance-format:mainfrom
geruh:v2read-descriptor

geruh commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

geruh commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant