Views, Spark: Add support for Materialized Views; Integrate with Spark SQL#9830
Views, Spark: Add support for Materialized Views; Integrate with Spark SQL#9830wmoustafa wants to merge 18 commits into
Conversation
|
|
||
| override protected def run(): Seq[InternalRow] = { | ||
| catalog.loadTable(ident) match { | ||
| catalog |
singhpk234
left a comment
There was a problem hiding this comment.
However, if the materialized view is stale, the method simply returns to allow SparkCatalog's loadView to run. In turn, loadView returns the metadata for the virtual view itself, triggering the usual Spark view logic that computes the result set based on the current state of the base tables.
1/ was wondering if auto-refresh of MV on staleness detection should be an opt-in feature ?
2/ Any ideas / plans for incremental refresh ?
These are very good questions. To me looks like if there is an external process that guarantees the freshness, then the current implementation still holds. Manual For (2): We have not discussed incremental refresh plans in the Iceberg community, but there is some relevant work here. You can review some of the test cases here. |
@wmoustafa, Read this today, was wondering if there is something we can utilize from CDC (considering iceberg has support for that) perspective ? how expensive the refreshes of a PB size tables are and what is the ideal frequency of updates in this model, if you can share some datapoints ? rewrite to get incremental refresh by computing deltas between the snapshots and then joining it with other deltas and having union of those does seems user-friendly though |
It really depends on the query and the size of the delta and whole table etc. There is an extension of that work that is currently taking place to get an idea about the cost of some basic queries (e.g., a few joins/aggregations + filters & projections), and coming up with a reasonable cost model (including choosing to not perform incremental at all if incremental is deemed more expensive). |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
Pushed updated changes to the upstream branch. They may not be reflected here since the PR was closed. |
|
@wmoustafa I reopened it. Can you rebase to resolve the conflicts? |
- Use static imports for assertj Assertions in TestRefreshStateParser - Rename parameter to avoid hidden field in BaseMetastoreViewCatalog
- Use ExtensionsTestBase instead of SparkExtensionsTestBase - Replace JUnit 4 annotations with JUnit 5 (TestTemplate, BeforeEach, AfterEach) - Use ParameterizedTestExtension and Parameters instead of Parameterized - Remove JUnit 4 constructor-based parameter injection
|
|
||
| View view = loadIcebergView(); | ||
| // storage-table should be set on the view version, not as a property | ||
| assertThat(view.currentVersion().storageTable()).isNotNull(); |
There was a problem hiding this comment.
executing the following statement
sql("SHOW TABLES")
returns:
default.tabledefault.materialized_view__storage
What is the added value of seeing default.materialized_view__storage in the table listing ?
For reference the Trino MV do not store the MV storage table in the metastore - see trinodb/trino#18853
There was a problem hiding this comment.
There was a problem hiding this comment.
This is a direction that was aligned by the community (to express MVs as separate view and table objects). This implementation should be compliant with the spec from that perspective.
Checkstyle requires assertThatThrownBy to include a .hasMessage() check. Applied to both v3.5 and v4.1 TestMaterializedViews.
Set up validationCatalog manually instead of calling super.before(), matching the v4.1 approach, to avoid IllegalArgumentException from the base class not recognizing InMemoryCatalogWithLocalFileIO.
| tableState.name()); | ||
| try { | ||
| org.apache.iceberg.Table sourceTable = | ||
| ((org.apache.iceberg.catalog.Catalog) icebergCatalog()).loadTable(sourceId); |
There was a problem hiding this comment.
remove (org.apache.iceberg.catalog.Catalog)
| * this identifies the storage table that holds the precomputed data. The storage table must be in | ||
| * the same catalog as the materialized view. | ||
| */ | ||
| default TableIdentifier storageTable() { |
There was a problem hiding this comment.
Instead of exposing the storage table in the metastore, this could be storageMetadataLocation
However, before refreshing the MV, the metadata location would not exist, so we'd need something to be able to distinguish whether we're dealing with a regular view or a materialized view.
There was a problem hiding this comment.
As discussed in the other comment, the spec treats the storage table as a first class object. I think we should keep the abstraction as TableIdentifier.
| // Write data to storage table with refresh-state in the snapshot summary | ||
| String storageTableRef = | ||
| String.format("%s.%s.%s", catalogName, NAMESPACE, storageTableId.name()); | ||
| try { | ||
| spark | ||
| .sql(String.format("SELECT id, data FROM %s.%s.%s", catalogName, NAMESPACE, tableName)) | ||
| .writeTo(storageTableRef) | ||
| .option("snapshot-property." + RefreshState.REFRESH_STATE_SUMMARY_KEY, refreshStateJson) | ||
| .append(); | ||
| } catch (NoSuchTableException e) { | ||
| throw new RuntimeException("Storage table not found during simulated refresh", e); | ||
| } |
There was a problem hiding this comment.
this will eventually be integrated in RESTCatalog logic right?
There was a problem hiding this comment.
Which aspect? Overall, I think yes, we should implement the MV spec in the REST catalog too.
This contribution serves as a proof of concept for integrating with the draft reference implementation for Iceberg materialized view specification done in the PR apache/iceberg#9830
This contribution serves as a proof of concept for integrating with the draft reference implementation for Iceberg materialized view specification done in the PR apache/iceberg#9830
This contribution serves as a proof of concept for integrating with the draft reference implementation for Iceberg materialized view specification done in the PR apache/iceberg#9830
This contribution serves as a proof of concept for integrating with the draft reference implementation for Iceberg materialized view specification done in the PR apache/iceberg#9830 (cherry picked from commit 16c41d94fdfbea03b9e066f90cee9ada13785f11)
| val storageTable = v.currentVersion().storageTable() | ||
| if (storageTable != null) { | ||
| val storageIdent = Identifier.of(storageTable.namespace().levels(), storageTable.name()) | ||
| sparkCatalog.dropTable(storageIdent) |
There was a problem hiding this comment.
What needs to be done in RESTCatalog to ensure that both the view and the storage table get dropped?
There was a problem hiding this comment.
Could you clarify if this is a feedback to this particular line or overall direction on supporting MVs in the REST catalog? I think we need to have a standalone discussion on the latter first.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
Just added the view refresh logic. |
| return new SparkView(catalogName, view); | ||
| // Check if the view is a materialized view. If it is, and storage table is fresh, throw | ||
| // IllegalStateException | ||
| if (isMaterializedView(view) && isFresh(view)) { |
There was a problem hiding this comment.
This doesn't seem right to me. Why would we throw if it's fresh? I feel like we should be returning a MV implementation with the associated table, not using exceptions as control flow.
There was a problem hiding this comment.
I believe this might be the only option for a DSv2-based implementation (I understand Spark 4.2 has a loadTableOrView method that may alleviate this, but for now this is the 4.1 behavior). For this particular reason, I have also put together an alternate routing method here wmoustafa#2 that does not use the exception as the control flow, by leveraging analysis rules instead of DSv2 routing. Let me know your thoughts on the alternate method.
There was a problem hiding this comment.
I agree with @danielcweeks too. I don't see why we need to do a freshness check here when the purpose of this method is to return the view definition if it exists.
| val table = icebergCatalog.loadTable(icebergId) | ||
| val snapshotId = | ||
| if (table.currentSnapshot() != null) { | ||
| table.currentSnapshot().snapshotId() |
There was a problem hiding this comment.
If the source table is modified after planning, could this return the wrong snapshot id?
| } else { | ||
| -1L | ||
| } | ||
| states += new SourceTableState( |
| return new SparkView(catalogName, view); | ||
| // Check if the view is a materialized view. If it is, and storage table is fresh, throw | ||
| // IllegalStateException | ||
| if (isMaterializedView(view) && isFresh(view)) { |
There was a problem hiding this comment.
I agree with @danielcweeks too. I don't see why we need to do a freshness check here when the purpose of this method is to return the view definition if it exists.
| return false; | ||
| } | ||
| } catch (Exception e) { | ||
| return false; |
There was a problem hiding this comment.
Should this be logged with stacktrace?
Summary
This PR adds support for materialized views in Iceberg and integrates the implementation with Spark SQL.
Spec
Full Materialized View Spec can be found in #11041. A materialized view is an Iceberg view whose current version has a
storage-tablefield: a struct withnamespaceandnameidentifying an Iceberg table that holds the precomputed results. The storage table is used to return the precomputed results of the view as long as the results are "fresh".Freshness is tracked through a
refresh-stateJSON string stored in the storage table's snapshot summary. The refresh state captures:A materialized view is considered fresh when the view version ID and all source snapshot/version IDs in the refresh state match their current values.
Core
New model classes:
ViewVersion.storageTable()— nullableTableIdentifieron the view version; non-null indicates a materialized viewRefreshState/RefreshStateParser— model and JSON serialization for refresh state stored in snapshot summariesSourceState/SourceTableState/SourceViewState— polymorphic source state model discriminated by atypefield (tableorview)Spark SQL
This PR adds support for
CREATE MATERIALIZED VIEWand extendsDROP VIEWto handle materialized views:CREATE MATERIALIZED VIEWcreates the storage table first, then registers the view metadata with astorage-tablereference on the view version. The storage table identifier can be specified via aSTORED AS '<identifier>'clause; otherwise a default<name>__storageidentifier is used.DROP VIEWon a materialized view removes both the view metadata and its associated storage table.REFRESH MATERIALIZED VIEWis left as a future enhancement.Spark Catalog
The
SparkCatalogdetermines whether to serve precomputed data from the storage table or fall back to the view's SQL query:loadTable()checks if the requested identifier corresponds to a fresh materialized view. If so, it returns aSparkMaterializedViewbacked by the storagetable, allowing queries to read the precomputed data directly.
loadView()checks if the materialized view is fresh. If fresh, it defers toloadTable(). If stale, it returns aSparkView, triggering the usual Spark view logic that re-executes the query against the current state of the source tables.Notes
InMemoryCataloghas been extended with a testLocalFileIOto support data file operations required by the storage table.