Add Iceberg optimization research playbook#66
Open
itamarwe wants to merge 4 commits into
Open
Conversation
A standalone methodology framework for optimizing Iceberg tables: an analysis-first workflow (query logs, table metadata, ingestion logs, user interviews) feeding per-table optimization profiles, then recommendations split by table-properties / ingestion / maintenance and mapped across platforms (Databricks, Snowflake, bespoke, NiFi, Flink, dbt, managed ELT). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
…aybook Run the Stanford STORM workflow as a five-agent scan (Practitioner, Skeptic, Economist, Historian, Academic), then a contradiction map, synthesis briefing, and peer review. Fold peer-review corrections back into the playbook: target- file-size verification caveat, Z-order locality decay + Hilbert, streaming/ compaction commit-conflict warning, partial-progress option, and a cost/scale 'when not to optimize' gate (maintenance + Snowflake managed clustering). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
…ked by network allowlist Re-fetched flagged sources. Only GitHub is reachable in this environment (arXiv/AWS/Snowflake/Databricks/VLDB/HN and api.firecrawl.dev all return 403 via the egress allowlist). Verified four GitHub-sourced claims by direct read; vendor cost/benchmark figures remain snippet-only and are labeled as such. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a comprehensive research-backed playbook for optimizing Apache Iceberg tables across engines and platforms. This is a new research document collection, not a blog post, organized as a structured guide with methodology, recommendations, and platform-specific guidance.
Changes
Part I (Methodology & Analysis) —
01-methodology-and-analysis.md: Five-stage workflow for deriving optimization decisions from evidence (query logs, table metadata, ingestion logs, user interviews). Includes reusable SQL queries and health-check templates.Part II (Recommendations) — Four chapters covering:
02-recommendations-table-properties.md: Partitioning, sort/clustering, file size, write distribution, COW vs MOR, format/compression03-recommendations-ingestion.md: Ingestion models, commit cadence, write distribution control04-recommendations-maintenance.md: Compaction, snapshot expiration, orphan-file removal, manifest rewriting05-platform-playbooks.md: Platform-specific guidance (Databricks, Snowflake, bespoke Spark/Trino/PyIceberg, Flink, dbt)Part II (Reference) —
06-decision-matrices.md: One-page cheat sheets mapping workload archetypes and diagnostic signals to settings.STORM Research Layer —
storm/directory documenting the research methodology:00-method.md: Overview of the Stanford STORM multi-perspective research method01-perspectives.md: Five independent expert perspectives (Practitioner, Skeptic, Economist, Historian, Academic) with sources02-contradiction-map.md: Where perspectives disagree and how well-supported each side is03-synthesis-briefing.md: Findings, tradeoffs, reliability ranking, and recommended actions04-peer-review.md: Self-critique of strong/weak claims and corrections folded back into the playbookSupporting files:
README.md: Navigation guide and how to use the playbookSOURCES.md: Source attribution and verification notesImplementation notes
https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa