Views, Spark: Add support for Materialized Views; Integrate with Spark SQL by wmoustafa · Pull Request #9830 · apache/iceberg

wmoustafa · 2024-02-29T03:06:36Z

Summary

This PR adds support for materialized views in Iceberg and integrates the implementation with Spark SQL.

Spec

Full Materialized View Spec can be found in #11041. A materialized view is an Iceberg view whose current version has a storage-table field: a struct with namespace and name identifying an Iceberg table that holds the precomputed results. The storage table is used to return the precomputed results of the view as long as the results are "fresh".

Freshness is tracked through a refresh-state JSON string stored in the storage table's snapshot summary. The refresh state captures:

The view version ID at the time of refresh
The state of each source table or view (snapshot ID, version ID, UUID)
The refresh start timestamp

A materialized view is considered fresh when the view version ID and all source snapshot/version IDs in the refresh state match their current values.

Core

New model classes:

ViewVersion.storageTable() — nullable TableIdentifier on the view version; non-null indicates a materialized view
RefreshState / RefreshStateParser — model and JSON serialization for refresh state stored in snapshot summaries
SourceState / SourceTableState / SourceViewState — polymorphic source state model discriminated by a type field (table or view)

Spark SQL

This PR adds support for CREATE MATERIALIZED VIEW and extends DROP VIEW to handle materialized views:

CREATE MATERIALIZED VIEW creates the storage table first, then registers the view metadata with a storage-table reference on the view version. The storage table identifier can be specified via a STORED AS '<identifier>' clause; otherwise a default <name>__storage identifier is used.
DROP VIEW on a materialized view removes both the view metadata and its associated storage table.
REFRESH MATERIALIZED VIEW is left as a future enhancement.

Spark Catalog

The SparkCatalog determines whether to serve precomputed data from the storage table or fall back to the view's SQL query:

loadTable() checks if the requested identifier corresponds to a fresh materialized view. If so, it returns a SparkMaterializedView backed by the storage
table, allowing queries to read the precomputed data directly.
loadView() checks if the materialized view is fresh. If fresh, it defers to loadTable(). If stale, it returns a SparkView, triggering the usual Spark view logic that re-executes the query against the current state of the source tables.

Notes

The InMemoryCatalog has been extended with a test LocalFileIO to support data file operations required by the storage table.

manuzhang · 2024-03-14T03:18:27Z


  override protected def run(): Seq[InternalRow] = {
-    catalog.loadTable(ident) match {
+    catalog


Redundant change

singhpk234

However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.

1/ was wondering if auto-refresh of MV on staleness detection should be an opt-in feature ?
2/ Any ideas / plans for incremental refresh ?

wmoustafa · 2024-04-20T22:01:16Z

However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.

1/ was wondering if auto-refresh of MV on staleness detection should be an opt-in feature ? 2/ Any ideas / plans for incremental refresh ?

These are very good questions. To me looks like if there is an external process that guarantees the freshness, then the current implementation still holds. Manual REFRESH will boil down to no-op, and isFresh will always return true.

For (2): We have not discussed incremental refresh plans in the Iceberg community, but there is some relevant work here. You can review some of the test cases here.

singhpk234 · 2024-05-02T21:27:47Z

For (2): We have not discussed incremental refresh plans in the Iceberg community, but there is some relevant work here. You can review some of the test cases here.

@wmoustafa, Read this today, was wondering if there is something we can utilize from CDC (considering iceberg has support for that) perspective ? how expensive the refreshes of a PB size tables are and what is the ideal frequency of updates in this model, if you can share some datapoints ? rewrite to get incremental refresh by computing deltas between the snapshots and then joining it with other deltas and having union of those does seems user-friendly though

wmoustafa · 2024-05-22T23:35:27Z

@wmoustafa, Read this today, was wondering if there is something we can utilize from CDC (considering iceberg has support for that) perspective ? how expensive the refreshes of a PB size tables are and what is the ideal frequency of updates in this model, if you can share some datapoints ? rewrite to get incremental refresh by computing deltas between the snapshots and then joining it with other deltas and having union of those does seems user-friendly though

It really depends on the query and the size of the delta and whole table etc. There is an extension of that work that is currently taking place to get an idea about the cost of some basic queries (e.g., a few joins/aggregations + filters & projections), and coming up with a reasonable cost model (including choosing to not perform incremental at all if incremental is deemed more expensive).

github-actions · 2024-10-21T00:16:14Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-10-28T00:16:20Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

wmoustafa · 2026-03-19T06:37:32Z

Pushed updated changes to the upstream branch. They may not be reflected here since the PR was closed.

manuzhang · 2026-03-19T08:15:02Z

@wmoustafa I reopened it. Can you rebase to resolve the conflicts?

- Use static imports for assertj Assertions in TestRefreshStateParser - Rename parameter to avoid hidden field in BaseMetastoreViewCatalog

- Use ExtensionsTestBase instead of SparkExtensionsTestBase - Replace JUnit 4 annotations with JUnit 5 (TestTemplate, BeforeEach, AfterEach) - Use ParameterizedTestExtension and Parameters instead of Parameterized - Remove JUnit 4 constructor-based parameter injection

findinpath · 2026-03-25T05:33:32Z

+
+    View view = loadIcebergView();
+    // storage-table should be set on the view version, not as a property
+    assertThat(view.currentVersion().storageTable()).isNotNull();


executing the following statement

sql("SHOW TABLES")

returns:

default.table

default.materialized_view__storage

What is the added value of seeing default.materialized_view__storage in the table listing ?
For reference the Trino MV do not store the MV storage table in the metastore - see trinodb/trino#18853

Relevant logic in TrinoHiveCatalog

https://github.com/trinodb/trino/blob/0e9abf3d052e9ef913988a99069d4fe66fd4f676/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/hms/TrinoHiveCatalog.java#L771-L792

This is a direction that was aligned by the community (to express MVs as separate view and table objects). This implementation should be compliant with the spec from that perspective.

Checkstyle requires assertThatThrownBy to include a .hasMessage() check. Applied to both v3.5 and v4.1 TestMaterializedViews.

Set up validationCatalog manually instead of calling super.before(), matching the v4.1 approach, to avoid IllegalArgumentException from the base class not recognizing InMemoryCatalogWithLocalFileIO.

findinpath · 2026-03-25T14:50:43Z

+                tableState.name());
+        try {
+          org.apache.iceberg.Table sourceTable =
+              ((org.apache.iceberg.catalog.Catalog) icebergCatalog()).loadTable(sourceId);


remove (org.apache.iceberg.catalog.Catalog)

findinpath · 2026-03-25T15:01:29Z

+   * this identifies the storage table that holds the precomputed data. The storage table must be in
+   * the same catalog as the materialized view.
+   */
+  default TableIdentifier storageTable() {


Instead of exposing the storage table in the metastore, this could be storageMetadataLocation
However, before refreshing the MV, the metadata location would not exist, so we'd need something to be able to distinguish whether we're dealing with a regular view or a materialized view.

As discussed in the other comment, the spec treats the storage table as a first class object. I think we should keep the abstraction as TableIdentifier.

findinpath · 2026-03-25T22:19:45Z

+    // Write data to storage table with refresh-state in the snapshot summary
+    String storageTableRef =
+        String.format("%s.%s.%s", catalogName, NAMESPACE, storageTableId.name());
+    try {
+      spark
+          .sql(String.format("SELECT id, data FROM %s.%s.%s", catalogName, NAMESPACE, tableName))
+          .writeTo(storageTableRef)
+          .option("snapshot-property." + RefreshState.REFRESH_STATE_SUMMARY_KEY, refreshStateJson)
+          .append();
+    } catch (NoSuchTableException e) {
+      throw new RuntimeException("Storage table not found during simulated refresh", e);
+    }


this will eventually be integrated in RESTCatalog logic right?

Which aspect? Overall, I think yes, we should implement the MV spec in the REST catalog too.

This contribution serves as a proof of concept for integrating with the draft reference implementation for Iceberg materialized view specification done in the PR apache/iceberg#9830

This contribution serves as a proof of concept for integrating with the draft reference implementation for Iceberg materialized view specification done in the PR apache/iceberg#9830 (cherry picked from commit 16c41d94fdfbea03b9e066f90cee9ada13785f11)

findinpath · 2026-04-16T10:00:41Z

+          val storageTable = v.currentVersion().storageTable()
+          if (storageTable != null) {
+            val storageIdent = Identifier.of(storageTable.namespace().levels(), storageTable.name())
+            sparkCatalog.dropTable(storageIdent)


What needs to be done in RESTCatalog to ensure that both the view and the storage table get dropped?

Could you clarify if this is a feedback to this particular line or overall direction on supporting MVs in the REST catalog? I think we need to have a standalone discussion on the latter first.

github-actions · 2026-05-17T00:41:46Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2026-05-24T00:43:58Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

wmoustafa · 2026-06-23T18:11:37Z

Just added the view refresh logic.

danielcweeks · 2026-06-24T21:18:21Z

-        return new SparkView(catalogName, view);
+        // Check if the view is a materialized view. If it is, and storage table is fresh, throw
+        // IllegalStateException
+        if (isMaterializedView(view) && isFresh(view)) {


This doesn't seem right to me. Why would we throw if it's fresh? I feel like we should be returning a MV implementation with the associated table, not using exceptions as control flow.

I believe this might be the only option for a DSv2-based implementation (I understand Spark 4.2 has a loadTableOrView method that may alleviate this, but for now this is the 4.1 behavior). For this particular reason, I have also put together an alternate routing method here wmoustafa#2 that does not use the exception as the control flow, by leveraging analysis rules instead of DSv2 routing. Let me know your thoughts on the alternate method.

I agree with @danielcweeks too. I don't see why we need to do a freshness check here when the purpose of this method is to return the view definition if it exists.

bennychow · 2026-06-25T03:36:08Z

+            val table = icebergCatalog.loadTable(icebergId)
+            val snapshotId =
+              if (table.currentSnapshot() != null) {
+                table.currentSnapshot().snapshotId()


If the source table is modified after planning, could this return the wrong snapshot id?

bennychow · 2026-06-25T03:39:29Z

+              } else {
+                -1L
+              }
+            states += new SourceTableState(


What about the view logical nodes?

bennychow · 2026-06-25T03:45:15Z

-        return new SparkView(catalogName, view);
+        // Check if the view is a materialized view. If it is, and storage table is fresh, throw
+        // IllegalStateException
+        if (isMaterializedView(view) && isFresh(view)) {


I agree with @danielcweeks too. I don't see why we need to do a freshness check here when the purpose of this method is to return the view definition if it exists.

bennychow · 2026-06-25T03:49:05Z

+            return false;
+          }
+        } catch (Exception e) {
+          return false;


Should this be logged with stacktrace?

github-actions Bot added spark core labels Feb 29, 2024

nastra self-requested a review February 29, 2024 08:14

rdblue reviewed Mar 11, 2024

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/view/ViewVersionReplace.java Outdated

bennychow reviewed Mar 13, 2024

View reviewed changes

Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Outdated

wmoustafa mentioned this pull request Mar 13, 2024

[Proposal] Iceberg Materialized View Spec #6420

Closed

manuzhang reviewed Mar 14, 2024

View reviewed changes

Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/MaterializedViewUtil.java Outdated

manuzhang reviewed Mar 14, 2024

View reviewed changes

bennychow reviewed Mar 25, 2024

View reviewed changes

Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Outdated

wmoustafa mentioned this pull request Mar 28, 2024

Iceberg Materialized Views #10043

Open

6 tasks

singhpk234 reviewed Apr 20, 2024

View reviewed changes

Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java Outdated

singhpk234 reviewed Apr 20, 2024

View reviewed changes

wmoustafa mentioned this pull request May 8, 2024

[Spec] Add Iceberg Materialized View Spec #10280

Closed

github-actions Bot added the stale label Oct 21, 2024

github-actions Bot closed this Oct 28, 2024

manuzhang reopened this Mar 19, 2026

github-actions Bot added API Specification Issues that may introduce spec changes. labels Mar 19, 2026

github-actions Bot removed the stale label Mar 20, 2026

wmoustafa and others added 4 commits March 24, 2026 01:26

[Views] Implement Materialized Views; Integrate with Spark SQL

6a72ddc

Represent the storage table using its catalog identifier

72aa497

Add support for replacing view version

ad5f435

Update MV implementation to use new spec elements

59740f1

wmoustafa added 3 commits March 24, 2026 17:15

Fix spotless Scala formatting in v3.5 RewriteViewCommands and parser

26d1c01

Fix checkstyle violations in core module

191001e

- Use static imports for assertj Assertions in TestRefreshStateParser - Rename parameter to avoid hidden field in BaseMetastoreViewCatalog

findinpath reviewed Mar 25, 2026

View reviewed changes

Add message checks to assertThatThrownBy in MV tests

dd3764a

Checkstyle requires assertThatThrownBy to include a .hasMessage() check. Applied to both v3.5 and v4.1 TestMaterializedViews.

findinpath reviewed Mar 25, 2026

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/view/ViewVersion.java

Fix v3.5 TestMaterializedViews to skip configureValidationCatalog

e21bace

Set up validationCatalog manually instead of calling super.before(), matching the v4.1 approach, to avoid IllegalArgumentException from the base class not recognizing InMemoryCatalogWithLocalFileIO.

findinpath reviewed Mar 25, 2026

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/view/ViewBuilder.java

findinpath reviewed Mar 25, 2026

View reviewed changes

findinpath mentioned this pull request Mar 25, 2026

PoC: Add support for materialized views in Iceberg REST Catalog trinodb/trino#28866

Closed

Propagate storage table identifier in REST CatalogHandlers.createView

2a1faa1

findinpath reviewed Apr 16, 2026

View reviewed changes

github-actions Bot added the stale label May 17, 2026

github-actions Bot closed this May 24, 2026

stevenzwu reopened this Jun 11, 2026

stevenzwu removed the stale label Jun 11, 2026

Spark: Add REFRESH MATERIALIZED VIEW support in v3.5 and v4.1 extensions

c63cbc1

Address review comments

e0a1b8b

wmoustafa mentioned this pull request Jun 23, 2026

Spark: Route materialized view reads at the analyzer level wmoustafa/iceberg-1#2

Open

danielcweeks reviewed Jun 24, 2026

View reviewed changes

bennychow reviewed Jun 25, 2026

View reviewed changes

Uh oh!

Conversation

wmoustafa commented Feb 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Spec

Core

Spark SQL

Spark Catalog

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

wmoustafa commented Apr 20, 2024

Uh oh!

singhpk234 commented May 2, 2024

Uh oh!

wmoustafa commented May 22, 2024

Uh oh!

github-actions Bot commented Oct 21, 2024

Uh oh!

github-actions Bot commented Oct 28, 2024

Uh oh!

wmoustafa commented Mar 19, 2026

Uh oh!

manuzhang commented Mar 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

wmoustafa commented Jun 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

wmoustafa commented Feb 29, 2024 •

edited

Loading