Skip to content

Materialized View Spec#11041

Open
JanKaul wants to merge 82 commits into
apache:mainfrom
JanKaul:materialized-view-spec
Open

Materialized View Spec#11041
JanKaul wants to merge 82 commits into
apache:mainfrom
JanKaul:materialized-view-spec

Conversation

@JanKaul

@JanKaul JanKaul commented Aug 29, 2024

Copy link
Copy Markdown

This PR implements the Iceberg Materialized View Proposal #10043 by adding a section for Materialized Views to the View spec. It follows the design of the proposal document.

The idea is to resolve any remaining questions before starting the voting process on the dev list.

@JanKaul

JanKaul commented Aug 29, 2024

Copy link
Copy Markdown
Author

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Aug 29, 2024
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md
@JanKaul JanKaul force-pushed the materialized-view-spec branch from 62ba549 to b986cf0 Compare September 19, 2024 09:08
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated

@bennychow bennychow left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JanKaul

I left some minor comments around wording. Otherwise, I believe your changes here capture everything we need for a minimum MV spec.

The mailing list did talk about including the partial identifiers for the source table and source view records to improve usability. While not absolutely necessary, I think its a pretty good addition to include too.

https://lists.apache.org/thread/9lc3t4k0hw4d0hn07lgy9t2vgp2fm0om

Thanks

Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
@JanKaul JanKaul force-pushed the materialized-view-spec branch from 90f517e to 2790b0d Compare October 4, 2024 14:13
@JanKaul JanKaul mentioned this pull request Nov 8, 2024
6 tasks
@github-actions

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Nov 12, 2024
@github-actions

Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Nov 19, 2024

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right direction, thanks @JanKaul. I think we should reopen. Also left some suggestions.

Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md
Comment thread format/view-spec.md Outdated
@szehon-ho szehon-ho reopened this Nov 21, 2024
@JanKaul JanKaul force-pushed the materialized-view-spec branch from 2790b0d to af76391 Compare November 21, 2024 10:53

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, one more minor suggestion

Comment thread format/view-spec.md Outdated
@github-actions github-actions Bot removed the stale label Nov 22, 2024
@JanKaul JanKaul force-pushed the materialized-view-spec branch from af76391 to 521477f Compare November 25, 2024 07:33
@devozerov

Copy link
Copy Markdown
Member

When materialized view is created, two entities would be added to the catalog - a view and a storage table. From the engine perspective it is important to expose it as a single object during listing. Are there any rules how catalog implementors should deal with these objects? E.g., shall we expose the view via listViews, but filter-out the storage table for listTables?

@JanKaul

JanKaul commented Nov 25, 2024

Copy link
Copy Markdown
Author

Yes correct, the idea is to filter-out the storage-table for the listTables operation. But I would regard this as not part of the table/view spec but should rather be included in the REST catalog specification.

Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md

Consumers may use any combination of the following to assess the storage table:

- **Recency policy.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. A recency policy bounds data age but does not establish freshness.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Recency policy.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. A recency policy bounds data age but does not establish freshness.
- **Refresh Staleness.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. Staleness bounds data age but does not establish freshness.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "Recency Policy". "Refresh Staleness" feels less clear to me.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: "Recency" doesn't seem to be a common term used by the DBs ( commonly used terms are staleness , freshness ,lag ) . Unless we are trying to make point that is somehow a distinct feature I would prefer not to use an "uncommon" term here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Igor that we should generally try to reuse existing terms.

Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
- **Must** include all distinct source states for the inputs they chose to track.
- **May** leave `source-states` empty (e.g., when sources are non-Iceberg or freshness is determined by a mechanism outside this spec).

A snapshot whose refresh state violates a `Must` rule is invalid; consumers may treat it as if it had no `refresh-state`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree here too.

Comment thread format/view-spec.md

Consumers may use any combination of the following to assess the storage table:

- **Recency policy.** Accept the storage table when `refresh-start-timestamp-ms` falls within a staleness window. A recency policy bounds data age but does not establish freshness.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "Recency Policy". "Refresh Staleness" feels less clear to me.

Comment thread format/view-spec.md Outdated

`F`, `G`, and `H` do not appear in `A`'s `source-states` directly; they belong to `C` and `D`'s dependency sets and are reached recursively through `C` and `D`'s refresh states.

A consumer establishes `A`'s freshness by checking each entry in `source-states` against the current catalog state. For `C` and `D`, the consumer compares the recorded storage-table snapshot to the current snapshot, then recurses into their `refresh-state` to verify each is itself fresh.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For C and D, it would be really nice if the consumer could know up front whether the table was a base table or storage table.

Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated

@igorbelianski-cyber igorbelianski-cyber left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md
│ ├── E [TABLE] <-- recorded in A: snapshot-id: 101
│ └── D [MV — expanded as view] <-- recorded in A: version-id: 9
│ └── H [TABLE] <-- recorded in A: snapshot-id: 104
└── C [MV — expanded as view] <-- recorded in A: version-id: 7

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example would be more interesting if C's storage table was stale. Otherwise, the refresh state here is the same as the refresh state from Strategy 1.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main point here is the semantics of recursively expanding the dependencies just like it were a view. I think the whole point is that the MV is treated like a view, so it is naturally similar to Strategy. I don' think that's an issue. The difference is that Strategy 1 are common views, while here it is materialized views.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the point is to illustrate the case when the MVs are treated as views, and require nested expansion, similar to Strategy 1 with virtual views.

wmoustafa and others added 9 commits June 2, 2026 13:36
- Define freshness as a universal property: storage table equals the
  result of the current view query (at the MV's current view-version-id)
  over the current state of dependencies.
- Define dependencies objectively by parsing the SQL: base tables, views
  (transitively expanded), and intermediate materialized views (treated
  as their storage tables with recursive freshness via their own
  refresh-state). Drop the producer-relative dependency framing.
- Reorganize the freshness section into producer flexibility and
  consumer options. Producer chooses what to record; consumer chooses
  what to verify. Producer-first ordering matches the chronological flow.
- Tighten the schema: intermediate materialized views are recorded as a
  single table entry referencing the storage table; recording as a view
  entry is not permitted.
- Crisp read-decision rule: if a consumer's assessment passes, it reads
  from the storage table; otherwise it evaluates the view query in place
  of the storage table.
- Replace the "consumer/producer behavior" prose under Freshness with
  Producer flexibility and Consumer options subsections that match the
  three reference verification paths (recency policy, trust, verify).
- Add Appendix B "What counts as a dependency" with the SQL-derived
  dependency rules and a worked example showing the recursive boundary
  for intermediate materialized views.
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
@JanKaul JanKaul force-pushed the materialized-view-spec branch from 0a8957e to 00fb392 Compare June 3, 2026 20:12
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md Outdated
Comment thread format/view-spec.md
Comment on lines +60 to +65
* **Storage table** -- Iceberg table that stores the precomputed data of a materialized view.
* **Refresh state** -- A record stored in the storage table's snapshot summary that captures the state of source tables and views at the time of the last refresh operation.
* **Dependency graph** -- The graph of all source tables, views, and materialized views that a materialized view depends on, including nested dependencies.
* **Source table** -- A table reference that is used in the computation of the query results of a materialized view.
* **Source view** -- A view reference that is used in the computation of the query results of a materialized view.
* **Source materialized view** -- A materialized view reference that is used in the computation of the query results of a materialized view.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we really need that section. It might bring more questions than answers since all of that is discussed in detail below. Some of the terms are also standard terms (e.g., Schema, Version) that are not necessarily scoped to MVs. My recommendation is to remove this section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

not-stale Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.