Skip to content

feat: blob v2 descriptor read support#548

Open
geruh wants to merge 3 commits into
lance-format:mainfrom
geruh:v2read-descriptor
Open

feat: blob v2 descriptor read support#548
geruh wants to merge 3 commits into
lance-format:mainfrom
geruh:v2read-descriptor

Conversation

@geruh
Copy link
Copy Markdown
Collaborator

@geruh geruh commented May 21, 2026

related to #539 and #505

Adds read support for blob v2 columns. Instead of hiding blob metadata behind virtual columns, we surface the raw descriptor struct directly to Spark like what the original issue was stating.

struct<kind: short, position: long, size: long, blob_id: long, blob_uri: string>

Querying blob metadata is just a column projection now, no byte fetch:

SELECT id, payload.size, payload.kind FROM lance.ns.tbl;

A column is blob v2 when any of these hold:

  • Arrow extension name lance.blob.v2 is set in lance
  • metadata key lance-encoding:blob-v2 = true

Schema rewrite lives in BlobUtils.applyBlobV2DescriptorSchema(...), called from LanceDataset.schema(), LanceDataSource.inferSchema(), and the LanceScanBuilder constructor.

Filter pushdown

Suppressed for any table with a v2 column. The previous per-predicate gate had to be reverted because calc_eager_projection panics on filters that never reference blob columns. Zonemap fragment pruning still runs. Need more investigation here

Testing

  • Added some tests for BlobUtils
  • Added some tests for LanceArrowColumnVector

No end-to-end pylance interop tests here. The connector can't write v2 blobs yet, so producing test data requires pylance as an external dep. E2E coverage lands naturally once the write path exists, extending BaseBlobCreateTableTest's pattern.

Local verification

Wrote a v2 dataset with pylance hitting all four BlobKind tiers (Inline, Packed, Dedicated, External) plus a null row, then read it back through Spark on this branch:

write_blob_v2.py
import lance, pyarrow as pa

with open("/tmp/blob_v2_external.bin", "wb") as f:
    f.write(b"external blob payload " * 64)

values = [
    b"hi",                                    # 2 B → Inline (kind=0, ≤ 64KB)
    b"x" * (200 * 1024),                      # 200KB → Packed (kind=1, > 64KB, ≤ 4MB)
    b"y" * (5 * 1024 * 1024),                 # 5 MB → Dedicated (kind=2, > 4MB)
    "file:///tmp/blob_v2_external.bin",        # str → External (kind=3)
    None,                                      # null
]

table = pa.table({
    "id": pa.array([0, 1, 2, 3, 4], type=pa.int32()),
    "label": pa.array(["inline", "packed", "dedicated", "external", "null"]),
    "payload": lance.blob_array(values),
})

lance.write_dataset(table, "/tmp/blob_v2_kinds.lance",
    data_storage_version="2.2",
    allow_external_blob_outside_bases=True)
Spark output
> printSchema
root
 |-- id: integer (nullable = true)
 |-- label: string (nullable = true)
 |-- payload: struct (nullable = true)
 | |-- kind: short (nullable = true)
 | |-- position: long (nullable = true)
 | |-- size: long (nullable = true)
 | |-- blob_id: long (nullable = true)
 | |-- blob_uri: string (nullable = true)
 
 
> SELECT *
+---+---------+----------------------------------------------+
|id |label    |payload                                       |
+---+---------+----------------------------------------------+
|0  |inline   |{0, 0, 2, 0, }                               |
|1  |packed   |{1, 0, 204800, 1, }                          |
|2  |dedicated|{2, 0, 5242880, 2, }                         |
|3  |external |{3, 0, 0, 0, file:///tmp/blob_v2_external.bin}|
|4  |null     |{0, 0, 0, 0, }                               |
+---+---------+----------------------------------------------+

> projected fields
+---+---------+----+-------+-------+--------+--------------------------------+
|id |label    |kind|size   |blob_id|position|blob_uri                        |
+---+---------+----+-------+-------+--------+--------------------------------+
|0  |inline   |0   |2      |0      |0       |                                |
|1  |packed   |1   |204800 |1      |0       |                                |
|2  |dedicated|2   |5242880|2      |0       |                                |
|3  |external |3   |0      |0      |0       |file:///tmp/blob_v2_external.bin|
|4  |null     |0   |0      |0      |0       |                                |
+---+---------+----+-------+-------+--------+--------------------------------+

filter id >= 2
+---+---------+----+
|id |label    |kind|
+---+---------+----+
|2  |dedicated|2   |
|3  |external |3   |
|4  |null     |0   |
+---+---------+----+

@github-actions github-actions Bot added the enhancement New feature or request label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant