Add Iceberg optimization research playbook by itamarwe · Pull Request #66 · itamarwe/itamarwe.github.io

itamarwe · 2026-06-21T06:14:26Z

Summary

Add a comprehensive research-backed playbook for optimizing Apache Iceberg tables across engines and platforms. This is a new research document collection, not a blog post, organized as a structured guide with methodology, recommendations, and platform-specific guidance.

Changes

Part I (Methodology & Analysis) — 01-methodology-and-analysis.md: Five-stage workflow for deriving optimization decisions from evidence (query logs, table metadata, ingestion logs, user interviews). Includes reusable SQL queries and health-check templates.
Part II (Recommendations) — Four chapters covering:
- 02-recommendations-table-properties.md: Partitioning, sort/clustering, file size, write distribution, COW vs MOR, format/compression
- 03-recommendations-ingestion.md: Ingestion models, commit cadence, write distribution control
- 04-recommendations-maintenance.md: Compaction, snapshot expiration, orphan-file removal, manifest rewriting
- 05-platform-playbooks.md: Platform-specific guidance (Databricks, Snowflake, bespoke Spark/Trino/PyIceberg, Flink, dbt)
Part II (Reference) — 06-decision-matrices.md: One-page cheat sheets mapping workload archetypes and diagnostic signals to settings.
STORM Research Layer — storm/ directory documenting the research methodology:
- 00-method.md: Overview of the Stanford STORM multi-perspective research method
- 01-perspectives.md: Five independent expert perspectives (Practitioner, Skeptic, Economist, Historian, Academic) with sources
- 02-contradiction-map.md: Where perspectives disagree and how well-supported each side is
- 03-synthesis-briefing.md: Findings, tradeoffs, reliability ranking, and recommended actions
- 04-peer-review.md: Self-critique of strong/weak claims and corrections folded back into the playbook
Supporting files:
- README.md: Navigation guide and how to use the playbook
- SOURCES.md: Source attribution and verification notes

Implementation notes

The playbook is analysis-first: it derives optimization settings from per-table evidence (Stage 0–4 profile) rather than prescribing universal defaults.
All numeric recommendations (e.g., "128–512 MB file size", "compact every 1–4 h") are labeled as informed defaults to validate, not laws, with caveats about when they don't apply.
The STORM research layer is included verbatim to show the multi-perspective reasoning and contradiction resolution that informed the main playbook.
Platform guidance is conditional on whether the platform manages the table (auto-maintains it) or merely reads it (you maintain it).

https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa

A standalone methodology framework for optimizing Iceberg tables: an analysis-first workflow (query logs, table metadata, ingestion logs, user interviews) feeding per-table optimization profiles, then recommendations split by table-properties / ingestion / maintenance and mapped across platforms (Databricks, Snowflake, bespoke, NiFi, Flink, dbt, managed ELT). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa

…aybook Run the Stanford STORM workflow as a five-agent scan (Practitioner, Skeptic, Economist, Historian, Academic), then a contradiction map, synthesis briefing, and peer review. Fold peer-review corrections back into the playbook: target- file-size verification caveat, Z-order locality decay + Hilbert, streaming/ compaction commit-conflict warning, partial-progress option, and a cost/scale 'when not to optimize' gate (maintenance + Snowflake managed clustering). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa

…ked by network allowlist Re-fetched flagged sources. Only GitHub is reachable in this environment (arXiv/AWS/Snowflake/Databricks/VLDB/HN and api.firecrawl.dev all return 403 via the egress allowlist). Verified four GitHub-sourced claims by direct read; vendor cost/benchmark figures remain snippet-only and are labeled as such. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa

vercel · 2026-06-21T06:14:32Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
itamarwe-github-io	Ready	Preview, Comment	Jun 21, 2026 6:14am

claude added 4 commits June 20, 2026 19:40

Add STORM research-method overview for Iceberg book

6453172

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Iceberg optimization research playbook#66

Add Iceberg optimization research playbook#66
itamarwe wants to merge 4 commits into
masterfrom
claude/iceberg-optimization-research-h5qi22

itamarwe commented Jun 21, 2026

Uh oh!

vercel Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

itamarwe commented Jun 21, 2026

Summary

Changes

Implementation notes

Uh oh!

vercel Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants