From 98e4a2a9b113ba35b18eb4c9772ef6e850bd9acb Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 28 Jun 2026 07:10:29 +0000 Subject: [PATCH 1/3] Reframe opening: several years of Iceberg experience, last year at data-heavy org Co-Authored-By: Claude Sonnet 4.6 --- content/posts/2026-06-26-apache-iceberg-optimization-skill.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/posts/2026-06-26-apache-iceberg-optimization-skill.md b/content/posts/2026-06-26-apache-iceberg-optimization-skill.md index c8ccb20..b4032f8 100644 --- a/content/posts/2026-06-26-apache-iceberg-optimization-skill.md +++ b/content/posts/2026-06-26-apache-iceberg-optimization-skill.md @@ -8,7 +8,7 @@ image: /img/iceberg-optimizer/social.png ![Claude Code and Apache Iceberg icons — the iceberg optimization skill codifies deployment patterns from large-scale real-world usage.](/img/iceberg-optimizer/social.png) -I've spent the last couple of years working closely with one of the most data-intensive organizations in Israel, deploying Apache Iceberg at scale. Petabytes of data, multiple ingestion pipelines, constant schema evolution, several query engines reading the same tables. Exactly the kind of environment that stress-tests every assumption you had about the format. +I've been working with Apache Iceberg for several years. In the last year I've been helping one of the most data-heavy organizations in Israel deploy it at scale — petabytes of data, multiple ingestion pipelines, constant schema evolution, several query engines reading the same tables. Exactly the kind of environment that stress-tests every assumption you had about the format. Most of the problems weren't about scale. They were about **not knowing what Iceberg is actually doing under the hood**. From 9915be154c38bc1eafdb788a88a3be311509c679 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 28 Jun 2026 14:52:12 +0000 Subject: [PATCH 2/3] Address review comments on iceberg post MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Rephrase 'wrong choices interact' → 'mistakes don't fail independently' - Clarify metadata access: you export it yourself; direct connectivity is roadmap - Note skill is available on GitHub - We→I (personal blog voice) in benchmarks section - 5.0/5 → 5/5 per reviewer suggestion - Add DuckDB + direct connectivity to v0.1 roadmap Co-Authored-By: Claude Sonnet 4.6 --- ...2026-06-26-apache-iceberg-optimization-skill.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/posts/2026-06-26-apache-iceberg-optimization-skill.md b/content/posts/2026-06-26-apache-iceberg-optimization-skill.md index b4032f8..b695ee7 100644 --- a/content/posts/2026-06-26-apache-iceberg-optimization-skill.md +++ b/content/posts/2026-06-26-apache-iceberg-optimization-skill.md @@ -18,7 +18,7 @@ Here's how it goes. A team discovers Iceberg, reads the getting-started docs, wi The root cause is almost always the same: **Iceberg has a lot of knobs, and the defaults were chosen for correctness, not for your workload**. Partition specs, sort orders, file size targets, snapshot retention policies, manifest sizing, V1 vs V2 table format — most teams never touch any of them. They accept the defaults, the problems accumulate invisibly, and the first sign anything is wrong is a data engineer firefighting at 2am. -What makes it worse is that the wrong choices interact. A poor partition strategy amplifies the cost of unoptimized file sizes. Unbounded snapshot accumulation slows partition pruning. Too many small files and the wrong delete mode turn a trivially fast CDC table into a read-time disaster. The problems compound before they're visible. +The deeper problem is that these mistakes don't fail independently — they stack. A poor partition strategy amplifies the cost of unoptimized file sizes. Unbounded snapshot accumulation slows partition pruning. Too many small files and the wrong delete mode turn a trivially fast CDC table into a read-time disaster. Each bad choice makes the next one worse, and the compounding is invisible until it isn't. ## What I kept doing over and over @@ -28,9 +28,9 @@ This is specialized knowledge. It took time to build, and it isn't well-document ## The skill -I codified everything I kept doing into a **[Claude Code skill](https://github.com/itamarwe/iceberg-optimizer-skill)** — a reusable, promptable assistant that knows the bits and bytes of Iceberg and guides you through the decisions that actually matter for your workload. +I codified everything I kept doing into a **[Claude Code skill](https://github.com/itamarwe/iceberg-optimizer-skill)** (available on GitHub) — a reusable, promptable assistant that knows the bits and bytes of Iceberg and guides you through the decisions that actually matter for your workload. -The design principle is: *observe before you ask, ask before you decide, simulate before you recommend*. Rather than firing generic best-practice advice, the skill runs a structured diagnostic before it tells you anything. It works by operating on exported metadata tables and query logs — it never connects directly to your warehouse — and it stays read-only until you explicitly approve Phase 5's commands. +The design principle is: *observe before you ask, ask before you decide, simulate before you recommend*. Rather than firing generic best-practice advice, the skill runs a structured diagnostic before it tells you anything. In v0.1, it works from metadata you export yourself — you run the diagnostic queries it provides, paste back the output, or supply pre-exported files — and stays read-only until you explicitly approve Phase 5's commands. Direct connectivity to your catalog, ingestion pipeline, and query engine is the natural next step. ## The six-phase flow @@ -52,7 +52,7 @@ The skill handles Spark, Trino, AWS Glue/EMR, Snowflake, and Flink/Kafka Connect ## Benchmarks -Any optimization advisor is only as useful as its ability to handle the edge cases — the failure modes that only show up in production, under specific combinations of write pattern, engine, and table shape. We benchmarked the skill against 22 scenarios built from real failure patterns. +Any optimization advisor is only as useful as its ability to handle the edge cases — the failure modes that only show up in production, under specific combinations of write pattern, engine, and table shape. I benchmarked the skill against 22 scenarios built from real failure patterns. ![22 benchmark scenarios across 7 failure-mode categories. Every scenario is a distinct real-world failure pattern — no duplicates, no synthetic toy tables.](/img/iceberg-optimizer/benchmark_coverage.png) @@ -66,15 +66,15 @@ The scenarios cover seven categories of failure: - **Indexes** — bloom filters on the wrong columns (low-cardinality or range-queried columns where min/max statistics already do the job), and Z-ordering over too many columns, which reduces locality rather than improving it. - **Cost & Lifecycle** — cold archives where the compute cost of maintenance exceeds any query savings, and the query-cost vs maintenance-cost tradeoff where the right answer is to do less, not more. -The benchmark scores each plan with an LLM judge evaluating correctness, specificity, and safety. **All 22 passed with a perfect 5.0/5 average.** +The benchmark scores each plan with an LLM judge evaluating correctness, specificity, and safety. **All 22 passed with a perfect 5/5 average.** -Two things are worth noting about the benchmark design. First, every scenario is a distinct failure pattern — we didn't generate synthetic variations of the same problem. Second, the score checks not just whether the skill recommends the right action, but whether it recommends it *for the right reason* and with the right caveats. A correct answer for the wrong reason scores lower. +Two things are worth noting about the benchmark design. First, every scenario is a distinct failure pattern — I didn't generate synthetic variations of the same problem. Second, the score checks not just whether the skill recommends the right action, but whether it recommends it *for the right reason* and with the right caveats. A correct answer for the wrong reason scores lower. ## This is v0.1 All five engines are supported. The 22 failure modes above are covered. Twenty-nine unit tests pass across the profiler and query-log parser. -What's missing: deeper multi-engine write coordination, large-scale migration scenarios (Hudi-to-Iceberg, Delta-to-Iceberg), Z-ordering tradeoffs at very high cardinalities, and more efficient token usage as the prompt structure matures. **This is a starting point**, not a complete reference. +What's missing: direct catalog and engine connectivity (currently you export the metadata yourself), DuckDB support for local development workflows, deeper multi-engine write coordination, large-scale migration scenarios (Hudi-to-Iceberg, Delta-to-Iceberg), Z-ordering tradeoffs at very high cardinalities, and more efficient token usage as the prompt structure matures. **This is a starting point**, not a complete reference. As the skill gets used on more real deployments, the patterns will sharpen and coverage will expand. From 75f134e6a8f1ddd20d1c721f30bca9e1df9a5148 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 28 Jun 2026 14:56:51 +0000 Subject: [PATCH 3/3] Update skill connectivity description and v0.1 roadmap - Clarify that the skill supports both direct connectivity to the table/pipeline/engine and manual metadata export - Remove direct connectivity from the "what's missing" list (it's already available) - Drop DuckDB-specific callout; generalize to "other query engines such as DuckDB" Co-Authored-By: Claude Sonnet 4.6 Claude-Session: https://claude.ai/code/session_014sy3CvoMeptEkgif3MM7Jh --- content/posts/2026-06-26-apache-iceberg-optimization-skill.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/posts/2026-06-26-apache-iceberg-optimization-skill.md b/content/posts/2026-06-26-apache-iceberg-optimization-skill.md index b695ee7..faff5dd 100644 --- a/content/posts/2026-06-26-apache-iceberg-optimization-skill.md +++ b/content/posts/2026-06-26-apache-iceberg-optimization-skill.md @@ -30,7 +30,7 @@ This is specialized knowledge. It took time to build, and it isn't well-document I codified everything I kept doing into a **[Claude Code skill](https://github.com/itamarwe/iceberg-optimizer-skill)** (available on GitHub) — a reusable, promptable assistant that knows the bits and bytes of Iceberg and guides you through the decisions that actually matter for your workload. -The design principle is: *observe before you ask, ask before you decide, simulate before you recommend*. Rather than firing generic best-practice advice, the skill runs a structured diagnostic before it tells you anything. In v0.1, it works from metadata you export yourself — you run the diagnostic queries it provides, paste back the output, or supply pre-exported files — and stays read-only until you explicitly approve Phase 5's commands. Direct connectivity to your catalog, ingestion pipeline, and query engine is the natural next step. +The design principle is: *observe before you ask, ask before you decide, simulate before you recommend*. Rather than firing generic best-practice advice, the skill runs a structured diagnostic before it tells you anything. It can connect directly to your table, ingestion pipeline, and query engine — or, if you prefer, you export the metadata yourself, paste back the output, or supply pre-exported files. Either way, it stays read-only until you explicitly approve Phase 5's commands. ## The six-phase flow @@ -74,7 +74,7 @@ Two things are worth noting about the benchmark design. First, every scenario is All five engines are supported. The 22 failure modes above are covered. Twenty-nine unit tests pass across the profiler and query-log parser. -What's missing: direct catalog and engine connectivity (currently you export the metadata yourself), DuckDB support for local development workflows, deeper multi-engine write coordination, large-scale migration scenarios (Hudi-to-Iceberg, Delta-to-Iceberg), Z-ordering tradeoffs at very high cardinalities, and more efficient token usage as the prompt structure matures. **This is a starting point**, not a complete reference. +What's missing: support for other query engines such as DuckDB, deeper multi-engine write coordination, large-scale migration scenarios (Hudi-to-Iceberg, Delta-to-Iceberg), Z-ordering tradeoffs at very high cardinalities, and more efficient token usage as the prompt structure matures. **This is a starting point**, not a complete reference. As the skill gets used on more real deployments, the patterns will sharpen and coverage will expand.