Skip to content

Add Iceberg optimization research playbook#66

Open
itamarwe wants to merge 4 commits into
masterfrom
claude/iceberg-optimization-research-h5qi22
Open

Add Iceberg optimization research playbook#66
itamarwe wants to merge 4 commits into
masterfrom
claude/iceberg-optimization-research-h5qi22

Conversation

@itamarwe

Copy link
Copy Markdown
Owner

Summary

Add a comprehensive research-backed playbook for optimizing Apache Iceberg tables across engines and platforms. This is a new research document collection, not a blog post, organized as a structured guide with methodology, recommendations, and platform-specific guidance.

Changes

  • Part I (Methodology & Analysis)01-methodology-and-analysis.md: Five-stage workflow for deriving optimization decisions from evidence (query logs, table metadata, ingestion logs, user interviews). Includes reusable SQL queries and health-check templates.

  • Part II (Recommendations) — Four chapters covering:

    • 02-recommendations-table-properties.md: Partitioning, sort/clustering, file size, write distribution, COW vs MOR, format/compression
    • 03-recommendations-ingestion.md: Ingestion models, commit cadence, write distribution control
    • 04-recommendations-maintenance.md: Compaction, snapshot expiration, orphan-file removal, manifest rewriting
    • 05-platform-playbooks.md: Platform-specific guidance (Databricks, Snowflake, bespoke Spark/Trino/PyIceberg, Flink, dbt)
  • Part II (Reference)06-decision-matrices.md: One-page cheat sheets mapping workload archetypes and diagnostic signals to settings.

  • STORM Research Layerstorm/ directory documenting the research methodology:

    • 00-method.md: Overview of the Stanford STORM multi-perspective research method
    • 01-perspectives.md: Five independent expert perspectives (Practitioner, Skeptic, Economist, Historian, Academic) with sources
    • 02-contradiction-map.md: Where perspectives disagree and how well-supported each side is
    • 03-synthesis-briefing.md: Findings, tradeoffs, reliability ranking, and recommended actions
    • 04-peer-review.md: Self-critique of strong/weak claims and corrections folded back into the playbook
  • Supporting files:

    • README.md: Navigation guide and how to use the playbook
    • SOURCES.md: Source attribution and verification notes

Implementation notes

  • The playbook is analysis-first: it derives optimization settings from per-table evidence (Stage 0–4 profile) rather than prescribing universal defaults.
  • All numeric recommendations (e.g., "128–512 MB file size", "compact every 1–4 h") are labeled as informed defaults to validate, not laws, with caveats about when they don't apply.
  • The STORM research layer is included verbatim to show the multi-perspective reasoning and contradiction resolution that informed the main playbook.
  • Platform guidance is conditional on whether the platform manages the table (auto-maintains it) or merely reads it (you maintain it).

https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa

claude added 4 commits June 20, 2026 19:40
A standalone methodology framework for optimizing Iceberg tables: an
analysis-first workflow (query logs, table metadata, ingestion logs, user
interviews) feeding per-table optimization profiles, then recommendations
split by table-properties / ingestion / maintenance and mapped across
platforms (Databricks, Snowflake, bespoke, NiFi, Flink, dbt, managed ELT).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
…aybook

Run the Stanford STORM workflow as a five-agent scan (Practitioner, Skeptic,
Economist, Historian, Academic), then a contradiction map, synthesis briefing,
and peer review. Fold peer-review corrections back into the playbook: target-
file-size verification caveat, Z-order locality decay + Hilbert, streaming/
compaction commit-conflict warning, partial-progress option, and a cost/scale
'when not to optimize' gate (maintenance + Snowflake managed clustering).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
…ked by network allowlist

Re-fetched flagged sources. Only GitHub is reachable in this environment
(arXiv/AWS/Snowflake/Databricks/VLDB/HN and api.firecrawl.dev all return 403
via the egress allowlist). Verified four GitHub-sourced claims by direct read;
vendor cost/benchmark figures remain snippet-only and are labeled as such.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VSERJhCnioX19isLH4BfDa
@vercel

vercel Bot commented Jun 21, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
itamarwe-github-io Ready Ready Preview, Comment Jun 21, 2026 6:14am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants