Materialized View Spec#11041
Conversation
62ba549 to
b986cf0
Compare
bennychow
left a comment
There was a problem hiding this comment.
Hi @JanKaul
I left some minor comments around wording. Otherwise, I believe your changes here capture everything we need for a minimum MV spec.
The mailing list did talk about including the partial identifiers for the source table and source view records to improve usability. While not absolutely necessary, I think its a pretty good addition to include too.
https://lists.apache.org/thread/9lc3t4k0hw4d0hn07lgy9t2vgp2fm0om
Thanks
90f517e to
2790b0d
Compare
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
2790b0d to
af76391
Compare
szehon-ho
left a comment
There was a problem hiding this comment.
Thanks, one more minor suggestion
af76391 to
521477f
Compare
|
When materialized view is created, two entities would be added to the catalog - a view and a storage table. From the engine perspective it is important to expose it as a single object during listing. Are there any rules how catalog implementors should deal with these objects? E.g., shall we expose the view via |
|
Yes correct, the idea is to filter-out the storage-table for the |
|
|
||
| Consumers may use any combination of the following to assess the storage table: | ||
|
|
||
| - **Recency policy.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. A recency policy bounds data age but does not establish freshness. |
There was a problem hiding this comment.
| - **Recency policy.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. A recency policy bounds data age but does not establish freshness. | |
| - **Refresh Staleness.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. Staleness bounds data age but does not establish freshness. |
There was a problem hiding this comment.
I like "Recency Policy". "Refresh Staleness" feels less clear to me.
There was a problem hiding this comment.
NIT: "Recency" doesn't seem to be a common term used by the DBs ( commonly used terms are staleness , freshness ,lag ) . Unless we are trying to make point that is somehow a distinct feature I would prefer not to use an "uncommon" term here
There was a problem hiding this comment.
I agree with Igor that we should generally try to reuse existing terms.
| - **Must** include all distinct source states for the inputs they chose to track. | ||
| - **May** leave `source-states` empty (e.g., when sources are non-Iceberg or freshness is determined by a mechanism outside this spec). | ||
|
|
||
| A snapshot whose refresh state violates a `Must` rule is invalid; consumers may treat it as if it had no `refresh-state`. |
|
|
||
| Consumers may use any combination of the following to assess the storage table: | ||
|
|
||
| - **Recency policy.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. A recency policy bounds data age but does not establish freshness. |
There was a problem hiding this comment.
I like "Recency Policy". "Refresh Staleness" feels less clear to me.
|
|
||
| `F`, `G`, and `H` do not appear in `A`'s `source-states` directly; they belong to `C` and `D`'s dependency sets and are reached recursively through `C` and `D`'s refresh states. | ||
|
|
||
| A consumer establishes `A`'s freshness by checking each entry in `source-states` against the current catalog state. For `C` and `D`, the consumer compares the recorded storage-table snapshot to the current snapshot, then recurses into their `refresh-state` to verify each is itself fresh. |
There was a problem hiding this comment.
For C and D, it would be really nice if the consumer could know up front whether the table was a base table or storage table.
| │ ├── E [TABLE] <-- recorded in A: snapshot-id: 101 | ||
| │ └── D [MV — expanded as view] <-- recorded in A: version-id: 9 | ||
| │ └── H [TABLE] <-- recorded in A: snapshot-id: 104 | ||
| └── C [MV — expanded as view] <-- recorded in A: version-id: 7 |
There was a problem hiding this comment.
This example would be more interesting if C's storage table was stale. Otherwise, the refresh state here is the same as the refresh state from Strategy 1.
There was a problem hiding this comment.
I think the main point here is the semantics of recursively expanding the dependencies just like it were a view. I think the whole point is that the MV is treated like a view, so it is naturally similar to Strategy. I don' think that's an issue. The difference is that Strategy 1 are common views, while here it is materialized views.
There was a problem hiding this comment.
Yes, the point is to illustrate the case when the MVs are treated as views, and require nested expansion, similar to Strategy 1 with virtual views.
- Define freshness as a universal property: storage table equals the result of the current view query (at the MV's current view-version-id) over the current state of dependencies. - Define dependencies objectively by parsing the SQL: base tables, views (transitively expanded), and intermediate materialized views (treated as their storage tables with recursive freshness via their own refresh-state). Drop the producer-relative dependency framing. - Reorganize the freshness section into producer flexibility and consumer options. Producer chooses what to record; consumer chooses what to verify. Producer-first ordering matches the chronological flow. - Tighten the schema: intermediate materialized views are recorded as a single table entry referencing the storage table; recording as a view entry is not permitted. - Crisp read-decision rule: if a consumer's assessment passes, it reads from the storage table; otherwise it evaluates the view query in place of the storage table. - Replace the "consumer/producer behavior" prose under Freshness with Producer flexibility and Consumer options subsections that match the three reference verification paths (recency policy, trust, verify). - Add Appendix B "What counts as a dependency" with the SQL-derived dependency rules and a worked example showing the recursive boundary for intermediate materialized views.
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
0a8957e to
00fb392
Compare
| * **Storage table** -- Iceberg table that stores the precomputed data of a materialized view. | ||
| * **Refresh state** -- A record stored in the storage table's snapshot summary that captures the state of source tables and views at the time of the last refresh operation. | ||
| * **Dependency graph** -- The graph of all source tables, views, and materialized views that a materialized view depends on, including nested dependencies. | ||
| * **Source table** -- A table reference that is used in the computation of the query results of a materialized view. | ||
| * **Source view** -- A view reference that is used in the computation of the query results of a materialized view. | ||
| * **Source materialized view** -- A materialized view reference that is used in the computation of the query results of a materialized view. |
There was a problem hiding this comment.
I am not sure we really need that section. It might bring more questions than answers since all of that is discussed in detail below. Some of the terms are also standard terms (e.g., Schema, Version) that are not necessarily scoped to MVs. My recommendation is to remove this section.
This PR implements the Iceberg Materialized View Proposal #10043 by adding a section for Materialized Views to the View spec. It follows the design of the proposal document.
The idea is to resolve any remaining questions before starting the voting process on the dev list.