Skip to content

Views, Spark: Add support for Materialized Views; Integrate with Spark SQL#9830

Open
wmoustafa wants to merge 18 commits into
apache:mainfrom
wmoustafa:materialized-views
Open

Views, Spark: Add support for Materialized Views; Integrate with Spark SQL#9830
wmoustafa wants to merge 18 commits into
apache:mainfrom
wmoustafa:materialized-views

Conversation

@wmoustafa

@wmoustafa wmoustafa commented Feb 29, 2024

Copy link
Copy Markdown
Contributor

Summary

This PR adds support for materialized views in Iceberg and integrates the implementation with Spark SQL.

Spec

Full Materialized View Spec can be found in #11041. A materialized view is an Iceberg view whose current version has a storage-table field: a struct with namespace and name identifying an Iceberg table that holds the precomputed results. The storage table is used to return the precomputed results of the view as long as the results are "fresh".

Freshness is tracked through a refresh-state JSON string stored in the storage table's snapshot summary. The refresh state captures:

  • The view version ID at the time of refresh
  • The state of each source table or view (snapshot ID, version ID, UUID)
  • The refresh start timestamp

A materialized view is considered fresh when the view version ID and all source snapshot/version IDs in the refresh state match their current values.

Core

New model classes:

  • ViewVersion.storageTable() — nullable TableIdentifier on the view version; non-null indicates a materialized view
  • RefreshState / RefreshStateParser — model and JSON serialization for refresh state stored in snapshot summaries
  • SourceState / SourceTableState / SourceViewState — polymorphic source state model discriminated by a type field (table or view)

Spark SQL

This PR adds support for CREATE MATERIALIZED VIEW and extends DROP VIEW to handle materialized views:

  • CREATE MATERIALIZED VIEW creates the storage table first, then registers the view metadata with a storage-table reference on the view version. The storage table identifier can be specified via a STORED AS '<identifier>' clause; otherwise a default <name>__storage identifier is used.
  • DROP VIEW on a materialized view removes both the view metadata and its associated storage table.
  • REFRESH MATERIALIZED VIEW is left as a future enhancement.

Spark Catalog

The SparkCatalog determines whether to serve precomputed data from the storage table or fall back to the view's SQL query:

  • loadTable() checks if the requested identifier corresponds to a fresh materialized view. If so, it returns a SparkMaterializedView backed by the storage
    table, allowing queries to read the precomputed data directly.
  • loadView() checks if the materialized view is fresh. If fresh, it defers to loadTable(). If stale, it returns a SparkView, triggering the usual Spark view logic that re-executes the query against the current state of the source tables.

Notes

  • The InMemoryCatalog has been extended with a test LocalFileIO to support data file operations required by the storage table.

Comment thread core/src/main/java/org/apache/iceberg/view/ViewVersionReplace.java Outdated
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Outdated
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/MaterializedViewUtil.java Outdated

override protected def run(): Seq[InternalRow] = {
catalog.loadTable(ident) match {
catalog

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant change

Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Outdated
@wmoustafa wmoustafa mentioned this pull request Mar 28, 2024
6 tasks
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Outdated

@singhpk234 singhpk234 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.

1/ was wondering if auto-refresh of MV on staleness detection should be an opt-in feature ?
2/ Any ideas / plans for incremental refresh ?

@wmoustafa

Copy link
Copy Markdown
Contributor Author

However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.

1/ was wondering if auto-refresh of MV on staleness detection should be an opt-in feature ? 2/ Any ideas / plans for incremental refresh ?

These are very good questions. To me looks like if there is an external process that guarantees the freshness, then the current implementation still holds. Manual REFRESH will boil down to no-op, and isFresh will always return true.

For (2): We have not discussed incremental refresh plans in the Iceberg community, but there is some relevant work here. You can review some of the test cases here.

@singhpk234

Copy link
Copy Markdown
Contributor

For (2): We have not discussed incremental refresh plans in the Iceberg community, but there is some relevant work here. You can review some of the test cases here.

@wmoustafa, Read this today, was wondering if there is something we can utilize from CDC (considering iceberg has support for that) perspective ? how expensive the refreshes of a PB size tables are and what is the ideal frequency of updates in this model, if you can share some datapoints ? rewrite to get incremental refresh by computing deltas between the snapshots and then joining it with other deltas and having union of those does seems user-friendly though

@wmoustafa

Copy link
Copy Markdown
Contributor Author

@wmoustafa, Read this today, was wondering if there is something we can utilize from CDC (considering iceberg has support for that) perspective ? how expensive the refreshes of a PB size tables are and what is the ideal frequency of updates in this model, if you can share some datapoints ? rewrite to get incremental refresh by computing deltas between the snapshots and then joining it with other deltas and having union of those does seems user-friendly though

It really depends on the query and the size of the delta and whole table etc. There is an extension of that work that is currently taking place to get an idea about the cost of some basic queries (e.g., a few joins/aggregations + filters & projections), and coming up with a reasonable cost model (including choosing to not perform incremental at all if incremental is deemed more expensive).

@github-actions

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Oct 21, 2024
@github-actions

Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Oct 28, 2024
@wmoustafa

Copy link
Copy Markdown
Contributor Author

Pushed updated changes to the upstream branch. They may not be reflected here since the PR was closed.

@manuzhang manuzhang reopened this Mar 19, 2026
@github-actions github-actions Bot added API Specification Issues that may introduce spec changes. labels Mar 19, 2026
@manuzhang

Copy link
Copy Markdown
Member

@wmoustafa I reopened it. Can you rebase to resolve the conflicts?

@github-actions github-actions Bot removed the stale label Mar 20, 2026
- Use static imports for assertj Assertions in TestRefreshStateParser
- Rename parameter to avoid hidden field in BaseMetastoreViewCatalog
- Use ExtensionsTestBase instead of SparkExtensionsTestBase
- Replace JUnit 4 annotations with JUnit 5 (TestTemplate, BeforeEach, AfterEach)
- Use ParameterizedTestExtension and Parameters instead of Parameterized
- Remove JUnit 4 constructor-based parameter injection

View view = loadIcebergView();
// storage-table should be set on the view version, not as a property
assertThat(view.currentVersion().storageTable()).isNotNull();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executing the following statement

sql("SHOW TABLES")

returns:

  • default.table
  • default.materialized_view__storage

What is the added value of seeing default.materialized_view__storage in the table listing ?
For reference the Trino MV do not store the MV storage table in the metastore - see trinodb/trino#18853

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a direction that was aligned by the community (to express MVs as separate view and table objects). This implementation should be compliant with the spec from that perspective.

Checkstyle requires assertThatThrownBy to include a .hasMessage() check.
Applied to both v3.5 and v4.1 TestMaterializedViews.
Comment thread api/src/main/java/org/apache/iceberg/view/ViewVersion.java
Set up validationCatalog manually instead of calling super.before(),
matching the v4.1 approach, to avoid IllegalArgumentException from
the base class not recognizing InMemoryCatalogWithLocalFileIO.
tableState.name());
try {
org.apache.iceberg.Table sourceTable =
((org.apache.iceberg.catalog.Catalog) icebergCatalog()).loadTable(sourceId);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove (org.apache.iceberg.catalog.Catalog)

* this identifies the storage table that holds the precomputed data. The storage table must be in
* the same catalog as the materialized view.
*/
default TableIdentifier storageTable() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing the storage table in the metastore, this could be storageMetadataLocation
However, before refreshing the MV, the metadata location would not exist, so we'd need something to be able to distinguish whether we're dealing with a regular view or a materialized view.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the other comment, the spec treats the storage table as a first class object. I think we should keep the abstraction as TableIdentifier.

Comment thread api/src/main/java/org/apache/iceberg/view/ViewBuilder.java
Comment on lines +252 to +263
// Write data to storage table with refresh-state in the snapshot summary
String storageTableRef =
String.format("%s.%s.%s", catalogName, NAMESPACE, storageTableId.name());
try {
spark
.sql(String.format("SELECT id, data FROM %s.%s.%s", catalogName, NAMESPACE, tableName))
.writeTo(storageTableRef)
.option("snapshot-property." + RefreshState.REFRESH_STATE_SUMMARY_KEY, refreshStateJson)
.append();
} catch (NoSuchTableException e) {
throw new RuntimeException("Storage table not found during simulated refresh", e);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will eventually be integrated in RESTCatalog logic right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which aspect? Overall, I think yes, we should implement the MV spec in the REST catalog too.

findinpath added a commit to findinpath/trino that referenced this pull request Mar 25, 2026
This contribution serves as a proof of concept for integrating
with the draft reference implementation for Iceberg
materialized view specification done in the PR
apache/iceberg#9830
findinpath added a commit to findinpath/trino that referenced this pull request Apr 15, 2026
This contribution serves as a proof of concept for integrating
with the draft reference implementation for Iceberg
materialized view specification done in the PR
apache/iceberg#9830
findinpath added a commit to findinpath/trino that referenced this pull request Apr 15, 2026
This contribution serves as a proof of concept for integrating
with the draft reference implementation for Iceberg
materialized view specification done in the PR
apache/iceberg#9830
findinpath added a commit to findinpath/trino that referenced this pull request Apr 15, 2026
This contribution serves as a proof of concept for integrating
with the draft reference implementation for Iceberg
materialized view specification done in the PR
apache/iceberg#9830

(cherry picked from commit 16c41d94fdfbea03b9e066f90cee9ada13785f11)
Comment on lines +56 to +59
val storageTable = v.currentVersion().storageTable()
if (storageTable != null) {
val storageIdent = Identifier.of(storageTable.namespace().levels(), storageTable.name())
sparkCatalog.dropTable(storageIdent)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What needs to be done in RESTCatalog to ensure that both the view and the storage table get dropped?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify if this is a feedback to this particular line or overall direction on supporting MVs in the REST catalog? I think we need to have a standalone discussion on the latter first.

@github-actions

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 17, 2026
@github-actions

Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this May 24, 2026
@stevenzwu stevenzwu reopened this Jun 11, 2026
@stevenzwu stevenzwu removed the stale label Jun 11, 2026
@wmoustafa

Copy link
Copy Markdown
Contributor Author

Just added the view refresh logic.

return new SparkView(catalogName, view);
// Check if the view is a materialized view. If it is, and storage table is fresh, throw
// IllegalStateException
if (isMaterializedView(view) && isFresh(view)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me. Why would we throw if it's fresh? I feel like we should be returning a MV implementation with the associated table, not using exceptions as control flow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this might be the only option for a DSv2-based implementation (I understand Spark 4.2 has a loadTableOrView method that may alleviate this, but for now this is the 4.1 behavior). For this particular reason, I have also put together an alternate routing method here wmoustafa#2 that does not use the exception as the control flow, by leveraging analysis rules instead of DSv2 routing. Let me know your thoughts on the alternate method.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @danielcweeks too. I don't see why we need to do a freshness check here when the purpose of this method is to return the view definition if it exists.

val table = icebergCatalog.loadTable(icebergId)
val snapshotId =
if (table.currentSnapshot() != null) {
table.currentSnapshot().snapshotId()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the source table is modified after planning, could this return the wrong snapshot id?

} else {
-1L
}
states += new SourceTableState(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the view logical nodes?

return new SparkView(catalogName, view);
// Check if the view is a materialized view. If it is, and storage table is fresh, throw
// IllegalStateException
if (isMaterializedView(view) && isFresh(view)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @danielcweeks too. I don't see why we need to do a freshness check here when the purpose of this method is to return the view definition if it exists.

return false;
}
} catch (Exception e) {
return false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be logged with stacktrace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API core spark Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants