Skip to content

refactor: remove Java-side dataset cache, rely on Rust-side Session#353

Merged
hamersaw merged 12 commits into
lance-format:mainfrom
LuciferYang:consolidate-read-caches-335
Apr 14, 2026
Merged

refactor: remove Java-side dataset cache, rely on Rust-side Session#353
hamersaw merged 12 commits into
lance-format:mainfrom
LuciferYang:consolidate-read-caches-335

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Mar 27, 2026

Summary

  • Delete LanceDatasetCache; the Rust-side Session already caches metadata and index blocks, so a Java-side Dataset cache only saves manifest deserialization (microseconds).
  • Load fragment metadata on demand via dataset.getFragment(id) per partition, instead of eagerly building Map<Integer, Fragment> from dataset.getFragments() on every cache miss — O(N) in total fragment count, even though each partition reads one fragment.
  • Each LanceFragmentScanner opens, owns, and closes its own Dataset through a new LanceRuntime.openDataset() helper.

Changes

File Change
LanceDatasetCache.java Deleted (~291 lines).
LanceRuntime.java Add openDataset() — wires the catalog Session, reconstructs the namespace from the (namespaceImpl, namespaceProperties) strings carried in LanceInputPartition, merges storage options, and pins the version.
LanceFragmentScanner.java create() opens its own Dataset and loads only the target fragment; close() closes both scanner and dataset, using Throwable.addSuppressed so the first failure isn't masked by cleanup.

Before / after

Before (main)                          After
-------------                          -----
LanceFragmentScanner.create()          LanceFragmentScanner.create()
  LanceDatasetCache.getFragment()        LanceRuntime.openDataset()   // per partition
    Guava cache.get()                      (Rust Session: metadata + index)
      new CachedDataset(dataset)           dataset.getFragment(id)    // single fragment
        dataset.getFragments()  ← O(N)
    cached.getFragment(id)

The per-catalog Rust Session (configured via LANCE_INDEX_CACHE_SIZE / LANCE_METADATA_CACHE_SIZE) and the global Arrow BufferAllocator are unchanged.

@github-actions github-actions Bot added the performance Features that improves performance label Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would make sense to just remove the LanceDatasetCache and house fragment caching in the LanceRuntime? It feels like there's just multiple layers of cache here.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Do you think it would make sense to just remove the LanceDatasetCache and house fragment caching in the LanceRuntime? It feels like there's just multiple layers of cache here.

625df7c followed this approach

@hamersaw
Copy link
Copy Markdown
Collaborator

I think overall this looks reasonable, but we need some relatively comprehensive benchmarks to support any performance implications.

…ches-335

# Conflicts:
#	lance-spark-base_2.12/src/main/java/org/lance/spark/internal/LanceDatasetCache.java
#	lance-spark-base_2.12/src/main/java/org/lance/spark/read/LanceCountStarPartitionReader.java
Replace non-existent LanceNamespaceStorageOptionsProvider with
LanceNamespace from getOrCreateNamespace(), matching the namespace
API used by OpenDatasetBuilder. Add missing java.util.List import.
Adds FragmentLoadingBenchmarkTest to quantify the performance
difference between getFragments() (old eager approach, O(N)) and
getFragment(id) (new lazy approach, O(1)).

Results on datasets with 10-1000 fragments show 10x-609x speedup
for the lazy approach, confirming the motivation for PR lance-format#353.

Tagged with @tag("benchmark") to exclude from normal test runs.
Add test.excludedGroups property (default: benchmark) to surefire
config so @tag("benchmark") tests are excluded from normal mvn test
runs. Override with -Dtest.excludedGroups= to include them.

Run benchmarks with:
  mvn test -Dtest=FragmentLoadingBenchmarkTest \
    -Dtest.excludedGroups= -Dgroups=benchmark
* <p>Tagged with "benchmark" so it is excluded from normal test runs.
*/
@Tag("benchmark")
public class FragmentLoadingBenchmarkTest {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this microbenchmark sufficient to demonstrate the effect? We can run

mvn test -pl lance-spark-base_2.12 \
  -Dtest=FragmentLoadingBenchmarkTest \
  -Dtest.excludedGroups= \
  -Dgroups=benchmark

to verify:

=== Fragment Loading Benchmark ===
Fragments    |  getFragments() (ms) | getFragment(id) (ms) |    Speedup
----------------------------------------------------------------------
10           |             0.082 ms |             0.007 ms |      11.0x
50           |             0.329 ms |             0.008 ms |      43.2x
100          |             0.630 ms |             0.010 ms |      65.9x
500          |             2.929 ms |             0.008 ms |     380.9x
1000         |             6.064 ms |             0.008 ms |     760.4x

Notes:

  • getFragments(): loads ALL fragment metadata (old eager approach)
  • getFragment(id): loads ONE fragment by ID (new lazy approach)
  • Each worker partition only needs one fragment, so the lazy approach avoids
    loading metadata for all other fragments in the dataset.

The project does not integrate JMH yet, so this benchmark is relatively simple.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If end-to-end impact is needed, we are running tests on a 1TB TPC-DS dataset, but this improvement likely won’t show a noticeable difference there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@summaryzb Do you have any test results about this ?

BatchScanExec.equals() compares batch objects via equals(), which
delegates to LanceScan since it implements Batch. Without overriding
equals/hashCode, Object identity is used, so two scans of the same
table are never equal and Spark cannot reuse exchanges.

Compare schema, readOptions, filters, limit, offset, topN, and
aggregation. Exclude scanId (per-instance UUID for tracing only).
@hamersaw
Copy link
Copy Markdown
Collaborator

Ok, so diving a little deeper. The rust-side implementation has it's own caches; for metadata (LANCE_METADATA_CACHE_SIZE) and indexes (LANCE_INDEX_CACHE_SIZE). By caching the actual dataset handle, we're really only saving time in deserializing the dataset manifest bytes, which should be measureable in microseconds (or milliseconds at worst). I'm wondering if we should just remove the notion of Spark-side caches for everything (dataset + fragment) and rely on the rust-side Lance cache. This is basically what the LanceRuntime handles, rust-size Lance cache.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

That makes sense. Since the Session already caches metadata and index blocks on the Rust side, the Java-side LoadingCache<DatasetCacheKey, Dataset> only saves manifest deserialization time. I agree we should remove it and let each scanner open/close its own Dataset handle, relying on the Rust-side cache for the heavy lifting. I'll update the pr to:

  1. Remove the Guava LoadingCache, DatasetCacheKey, getCachedDataset(), and getFragment()
  2. Have LanceFragmentScanner open its own Dataset (via openDataset()) and close it in close()
  3. Keep the per-catalog Session caching as-is

I'll also enhance openDataset() to honor per-query blockSize/indexCacheSize/metadataCacheSize settings in the same change — the cache path currently silently ignores these, and since we're already rewiring the open path, this is the natural place to fix it.

Do you think this is reasonable?

@hamersaw
Copy link
Copy Markdown
Collaborator

I'll also enhance openDataset() to honor per-query blockSize/indexCacheSize/metadataCacheSize settings in the same change — the cache path currently silently ignores these, and since we're already rewiring the open path, this is the natural place to fix it.

Do you think this is reasonable?

I'm not following exactly where these are going to be inserted. This gets a little bit tricky because these a global session configuration options and may not be reasonably applied to a single dataset? Maybe it's worth hacking together a proposal and we can iterate?

@LuciferYang LuciferYang changed the title perf: consolidate read caches and remove eager fragment pre-loading refactor: remove Java-side dataset cache, rely on Rust-side Session Apr 14, 2026
@LuciferYang
Copy link
Copy Markdown
Contributor Author

@hamersaw I have refactored the code, removed LanceDatasetCache, and updated the PR description. Can we do further iterations based on this version?

readOptions,
fragmentId,
dataset =
LanceRuntime.openDataset(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check if we can use the utility methods from Utils here.

inputPartition.getInitialStorageOptions(),
inputPartition.getNamespaceImpl(),
inputPartition.getNamespaceProperties());
dataset =
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hamersaw On your earlier concern about blockSize/indexCacheSize/metadataCacheSize: this pr doesn't add new API for them now, but switching to Utils.openDatasetBuilder does mean the fragment-scan path now passes these through (previously they were dropped by LanceDatasetCache). This matches what LanceCountStarPartitionReader and ~27 other call sites already do. WDYT?

The refactor to remove LanceDatasetCache dropped namespace reconstruction
on executors. Vended credentials (STS tokens) need the namespace client
to refresh during long-running scans.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this looks great! I added the namespace build back in so that we can ensure credential refresh on long-running operations.

@hamersaw hamersaw merged commit be64654 into lance-format:main Apr 14, 2026
16 checks passed
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thank you @hamersaw

fangbo pushed a commit that referenced this pull request May 7, 2026
…ead options not applied (#476)

## Summary

Fixes `org.apache.thrift.transport.TTransportException: GSS initiate
failed` thrown on Spark executors when reading Lance tables registered
in a Kerberized Hive Metastore (both plain `SELECT` and SQL DML).

This PR does two things:

1. **Stops executors from rebuilding the namespace client
unconditionally.** Adds a new read option `executor_credential_refresh`
(default `true`, preserving current behavior). When set to `false`,
executors skip the eager `namespace.describeTable()` RPC and open the
dataset directly by URI using the storage options the driver already
obtained.
2. **Makes catalog-level read options actually reach the typed fields.**
Catalog-level conf (`--conf
spark.sql.catalog.<name>.executor_credential_refresh=false`) is now
parsed in `withCatalogDefaults()`, so `spark.sql(...)` queries
(including `SELECT` and SQL DML) — which have no
`spark.read.option(...)` attach point — pick up the flag the same way as
DataFrameReader-based reads.

## Root Cause

Since #353 removed `LanceDatasetCache`, `LanceFragmentScanner.create()`
unconditionally rebuilds the `LanceNamespace` client on each Spark
executor and binds it back onto `LanceSparkReadOptions`. This forces the
dataset open through `Utils.OpenDatasetBuilder`'s `namespaceClient`
branch, which in turn calls
`OpenDatasetBuilder.buildFromNamespaceClient()` in the Lance Java SDK —
and that path issues an eager `namespace.describeTable()` RPC before
handing off to Rust.

For catalogs where the backing service authenticates per-call (HMS over
Kerberos, some REST catalogs), Spark executors typically do not have a
Kerberos TGT — the `--keytab` / `--principal` credentials only reach the
driver / ApplicationMaster, while executors run with Hadoop delegation
tokens that cannot be used for HMS Thrift SASL. The `describeTable` RPC
therefore fails with:

```
org.apache.thrift.transport.TTransportException: GSS initiate failed
  at org.lance.namespace.hive2.Hive2ClientPool.newClient(Hive2ClientPool.java:42)
  at org.lance.namespace.hive2.Hive2Namespace.describeTable(Hive2Namespace.java:285)
  at org.lance.OpenDatasetBuilder.buildFromNamespaceClient(OpenDatasetBuilder.java:205)
  at org.lance.OpenDatasetBuilder.build(OpenDatasetBuilder.java:191)
  at org.lance.spark.utils.Utils$OpenDatasetBuilder.build(Utils.java:140)
  at org.lance.spark.internal.LanceFragmentScanner.create(LanceFragmentScanner.java:67)
```

Driver-side operations (metadata-only queries, count-via-manifest)
succeed because the driver has the TGT. The failure only manifests
during fragment scans.

## Why the Existing Behavior Exists

Rebuilding the namespace client on the executor is not dead code — it
keeps the Rust `LanceNamespaceStorageOptionsProvider` attached so that
short-lived vended credentials (STS tokens for S3 / GCS / Azure)
returned by `describeTable()` can be refreshed when they expire
mid-scan. Simply removing the rebuild would break long-running scans
against object stores that use credential vending.

## Fix

### 1. Gate the executor-side rebuild behind a new option

Add a boolean read option `executor_credential_refresh`, defaulting to
`true`:

- `true` (default): unchanged — executor rebuilds the namespace client
and routes through the `namespaceClient` branch, preserving credential
refresh. Safe for all existing users.
- `false`: executor skips the rebuild, reads remain open via URI using
the `initialStorageOptions` the driver already obtained from
`describeTable()` at scan-plan time.

### 2. Make catalog-level conf actually reach the typed field

Before this PR, `Builder.withCatalogDefaults(catalogConfig)` only merged
the storage-options map and never parsed typed flags. As a result, the
catalog-conf syntax looked like it should work but silently ignored the
flag. This PR extracts a `parseTypedFlags(Map<String, String>)` helper
and calls it from both `fromOptions()` and `withCatalogDefaults()`, so
every recognized read option (not just `executor_credential_refresh`)
now flows from catalog conf into the typed field.

This is what makes the fix usable from SQL DML. Without the
`withCatalogDefaults` parse, a user running `DELETE FROM
kerberized_hms_lance_table WHERE id = 1` has no way to disable the
rebuild — SQL DML has no per-statement `.option(...)` attach point.

### Configuration surfaces after this PR

| Surface | Example | Works? |
|---|---|---|
| Per-read option | `spark.read.option("executor_credential_refresh",
"false").table(...)` (DataFrameReader) | Yes (already worked before this
PR; not available for `spark.sql("SELECT ...")`) |
| Catalog conf + plain SELECT | `--conf
spark.sql.catalog.lance.executor_credential_refresh=false` +
`spark.sql("SELECT * FROM lance.db.t")` | Yes (fixed by this PR) |
| Catalog conf + SQL DML | `--conf
spark.sql.catalog.lance.executor_credential_refresh=false` +
`spark.sql("DELETE FROM lance.db.t WHERE id=1")` | Yes (fixed by this
PR) |

Intended usage for HMS + Kerberos deployments:

```
spark-submit ... \
  --keytab /etc/keytabs/my.keytab \
  --principal my/principal@REALM \
  --conf spark.sql.catalog.lance.executor_credential_refresh=false
```

## Per-Namespace Trade-off Analysis

The refresh callback is meaningful only for namespaces that actually
return `storage_options` from `describeTable()`. Survey of the impls in
`lance-namespace-impls`:

| Namespace | `describeTable()` populates `storage_options`? | Cost of
`executor_credential_refresh=false` |

|----------------------------|------------------------------------------------|---------------------------------------------------------------------------------|
| `Hive2Namespace` | No — `setLocation` only | **None.** The refresh
callback is a no-op for HMS regardless of underlying storage. |
| `Hive3Namespace` | No — `setLocation` only | **None.** Same as Hive2.
|
| `GlueNamespace` | Static `config.getStorageOptions()` | Effectively
none for plain Glue. Use the default if you rely on LakeFormation vended
creds. |
| `IcebergNamespace` (REST) | Yes — vended creds typical | Long scans
against vended creds will fail when the credential expires. |
| `PolarisNamespace` | Yes — vended creds typical | Same as Iceberg
REST. |
| `UnityNamespace` | Yes — Databricks-vended temp creds | Same as
Iceberg REST. |

Concretely, for the HMS + S3 case: HMS does **not** vend S3 credentials
(`describeTable()` only sets `location`), so the executor's S3 access is
governed entirely by the AWS SDK credential chain (instance profile /
`hive-site.xml` / env vars / `~/.aws/credentials`) and the AWS SDK
handles all STS rotation independently. The Lance refresh callback would
have nothing to refresh, so disabling it costs nothing in practice.

## Scope of Change

- `LanceSparkReadOptions.java`:
- New constant `CONFIG_EXECUTOR_CREDENTIAL_REFRESH`, new field
`executorCredentialRefresh` (default `true`), builder / getter /
`withVersion` propagation / `equals` / `hashCode`, Javadoc covering
per-namespace trade-off.
- Extracted `Builder.parseTypedFlags(Map<String, String>)` helper from
the previously duplicated `fromOptions` body, now called from both
`fromOptions()` and `withCatalogDefaults()`. This incidentally also
fixes silent ignores of `push_down_filters`, `batch_size`,
`topN_push_down`, etc. when set at the catalog level — a pre-existing
latent issue uncovered while fixing the primary bug.
- `LanceFragmentScanner.java`: add `&&
readOptions.isExecutorCredentialRefresh()` to the existing rebuild `if`,
inline comment explaining the trade-off.
- `LanceSparkReadOptionsSerializationTest.java`: six new tests covering
default value, map parsing, serialization round-trip, `withVersion`
propagation, catalog-defaults path, and per-read override precedence.

No public API signature is changed; no existing behavior is altered for
users who do not set the new option.

## Test Plan

New unit tests (all 305 tests in `lance-spark-base_2.12` pass locally):

- `testExecutorCredentialRefreshDefaultsToTrue` — default value
preserved.
- `testExecutorCredentialRefreshParsedFromOptions` — flag honored from
both `"true"` and `"false"` map entries.
- `testExecutorCredentialRefreshSurvivesSerialization` — flag survives
Java serialization (critical: it must reach the executor).
- `testExecutorCredentialRefreshPreservedByWithVersion` — flag
propagated by `withVersion()` used during scan-plan version pinning.
- `testExecutorCredentialRefreshFromCatalogDefaults` — new; guards the
catalog-conf path used by SQL DML.
- `testPerReadOptionOverridesCatalogDefaults` — new; pins the precedence
rule "per-read `.option(...)` wins over catalog default".
- `LanceFragmentScannerTest.java`: one new end-to-end test that locks in
the executor-branch contract through `LanceFragmentScanner.create` (per
reviewer suggestion).


Integration test (out-of-band, on internal YARN + HMS + Kerberos + HDFS
cluster):

1. `spark-submit` with `--keytab` / `--principal`, Kerberized HMS,
`SELECT * FROM lance_hms_table` via `lance-namespace-hive2`.
2. **Before fix**: executor task fails with `GSS initiate failed` on
`describeTable`. Reproducible across multiple partitions / runs.
3. **After fix** with `--conf
spark.sql.catalog.lance.executor_credential_refresh=false`: scan
completes, returns expected row count and sample rows.
4. Default path (`true`) with the same fix jar: behavior unchanged.

## Backward Compatibility

Default is `true`, so every existing job behaves identically without
touching configs. Only users who explicitly set the new option to
`false` opt into the new path.

Co-authored-by: xiaguanglei <xiaguanglei@qiyi.com>
@LuciferYang
Copy link
Copy Markdown
Contributor Author

LuciferYang commented May 19, 2026

@hamersaw — following up on this. Removing the Java-side dataset cache to rely on Rust-side Session reuse exposed a thundering-herd issue in ObjectStoreRegistry::get_store: concurrent default-opens against the same URI under a shared Session all fall through to fresh ObjectStore builds because the registry's Weak<Arc<ObjectStore>> dedup is post-build.

Repro on a 34 GiB MinIO TPC-DS store_sales workload (Spark local-cluster[4,3,4096], 234 Lance fragments), three full-scan SQL queries via spark.read.format("lance"):

  • q1SELECT sum(ss_quantity) FROM store_sales (single int32 column, full scan)
  • q2SELECT sum(ss_item_sk), sum(ss_ext_sales_price), sum(ss_quantity) FROM store_sales (multi-column incl. decimal128, full scan)
  • q3SELECT sum(ss_net_profit) FROM store_sales WHERE ss_sold_date_sk BETWEEN <a> AND <b> (date-range predicate; Lance has no row-group stats so still scans)

Without the fix, every JMH iteration pays the full cold open cost (q1-lance 12.7 s ± 146 ms, CV ≈ 1% — i.e. all 10 iterations identical because each one rebuilds the credential chain + HTTP client).

Fix is in lance-format/lance#6839: per-CacheKey build-lock in get_store + a process-wide LazyLock<Arc<ObjectStoreRegistry>> for the JNI default-open path. After the fix:

query baseline fixed speedup
q1 sum(ss_quantity) 12.7 s 2.5 s 5.0x
q2 multi-col incl. decimal128 7.3 s 4.3 s 1.7x
q3 date-range filter 12.9 s 2.9 s 4.5x

Standalone concurrent_open_bench (no Spark, 144 concurrent opens under one Session) drops 9.50 s -> 0.18 s (53x).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Features that improves performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants