From 19bd65eede3b3190be3fe3852cd490e44376c808 Mon Sep 17 00:00:00 2001
From: Davide Angelocola <davide.angelocola@gmail.com>
Date: Sun, 28 Jun 2026 09:43:26 +0200
Subject: [PATCH] docs: lead the README with examples + performance
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Restructure the landing page for first-time visitors:
- Quickstart: byte[] round-trip, dictionary train+compress, zero-copy
  MemorySegment (signatures verified against the 0.6 API)
- Performance: surface the publication-grade golden-corpus numbers from
  docs/benchmarks.md (best-vs-best vs zstd-jni's own zero-copy path:
  +9-23% throughput on small payloads, allocation tie; allocation-free vs
  the convenient byte[] APIs) — honest ties included
- Sharpen the pitch (dictionary + zero-copy, the two real differentiators)
  and note the JDK 25 framing (first LTS with stable FFM)
- Move Install above Documentation; link the release smoke matrix as
  per-arch proof

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 README.md | 108 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 93 insertions(+), 15 deletions(-)

diff --git a/README.md b/README.md
index 074b2bf..67b165c 100644
--- a/README.md
+++ b/README.md
@@ -10,27 +10,90 @@
 
 **zstd-java** is a Java wrapper for [Zstandard](https://github.com/facebook/zstd)
 built on the **Foreign Function & Memory (FFM) API** — no JNI, no `sun.misc.Unsafe`.
-It targets **JDK 25+** (for stable `java.lang.foreign`) and leads with the
-feature missing from most JVM zstd bindings: **dictionary compression**, trained
-straight from your own data.
+It targets **JDK 25+** (the first LTS with stable `java.lang.foreign`) and leads
+with the two features most JVM zstd bindings lack:
+
+- **Dictionary compression**, trained straight from your own data — the big win on
+  small, repetitive records (logs, market-data ticks, JSON/Avro rows, FIX messages).
+- A **zero-copy `MemorySegment` API** — compress/decompress off-heap buffers (an
+  mmap'd slice in, an arena buffer out) with no heap copy and no per-call allocation.
 
 > **AI-assisted development:** This project uses Claude Code for implementation —
 > C header mapping, test generation, docs. Architecture, API design, and all
 > decisions are human-driven.
 
-## Documentation
+## Quickstart
 
-The docs follow the [Diátaxis](https://diataxis.fr) framework:
+One-shot round-trip with `byte[]` — the convenient path:
 
-| | Purpose | Start here |
-|---|---|---|
-| **[Tutorial](docs/tutorial.md)** | Learning by doing | Clean checkout → first round-trip |
-| **[How-to guides](docs/how-to.md)** | Solving a specific task | Hot paths, dictionaries, zero-copy, self-built lib |
-| **[Reference](docs/reference.md)** | Looking up facts | Platforms, API surface, symbol coverage, build |
-| **[Explanation](docs/explanation.md)** | Understanding the why | Why FFM + Zig, when zero-copy pays, benchmarks |
+```java
+import io.github.dfa1.zstd.Zstd;
 
-Architecture decisions are recorded as [ADRs](adr/ADR.md) (MADR 3.0) — the
-foundational choices and their trade-offs, one file per decision.
+byte[] data = ...;
+byte[] frame = Zstd.compress(data);        // or Zstd.compress(data, level)
+byte[] back  = Zstd.decompress(frame);     // size read from the frame header
+```
+
+**Dictionary** — train on a sample of your records, then compress each one against
+the dictionary (huge ratio gains on small, similar messages):
+
+```java
+import io.github.dfa1.zstd.*;
+import java.util.List;
+
+List<byte[]> samples = ...;                       // representative records
+ZstdDictionary dict = ZstdDictionary.train(samples, 8 * 1024);
+
+byte[] message = ...;
+try (ZstdCompressCtx cctx = new ZstdCompressCtx();
+     ZstdDecompressCtx dctx = new ZstdDecompressCtx()) {
+    byte[] frame = cctx.compress(message, dict);
+    byte[] back  = dctx.decompress(frame, message.length, dict);
+}
+```
+
+**Zero-copy** — off-heap in, off-heap out, no `byte[]`, no per-call allocation:
+
+```java
+import io.github.dfa1.zstd.*;
+import java.lang.foreign.*;
+
+try (Arena arena = Arena.ofConfined();
+     ZstdCompressCtx cctx = new ZstdCompressCtx();
+     ZstdDecompressCtx dctx = new ZstdDecompressCtx()) {
+
+    MemorySegment src     = ...;                       // e.g. an mmap'd file slice
+    MemorySegment frame   = cctx.compress(arena, src); // off-heap → off-heap
+    MemorySegment restored = dctx.decompress(arena, frame);
+}
+```
+
+Run with `--enable-native-access=ALL-UNNAMED`. Full walkthrough in the
+[tutorial](docs/tutorial.md); hot-path and dictionary recipes in the
+[how-to guides](docs/how-to.md).
+
+## Performance
+
+Microbenchmarks against the common JVM zstd options (JMH; Apple M5, JDK 25, all
+linking the same zstd 1.5.7). Full methodology and tables in
+[docs/benchmarks.md](docs/benchmarks.md) — including the honest ties.
+
+**Best vs best** — our zero-copy `MemorySegment` path vs **zstd-jni's own**
+zero-copy direct-`ByteBuffer` path (golden-corpus fixtures, publication-grade run):
+
+| operation (payload) | zstd-java `MemorySegment` | zstd-jni `ByteBuffer` | edge |
+|---|---:|---:|---:|
+| compress `http` (1.2 KiB) | **353.6** | 322.1 | +9.8% |
+| decompress `http` | **922.7** | 750.8 | +22.9% |
+| decompress `large-literal` (200 KiB) | 56.1 | 55.6 | tie |
+
+*(throughput, ops/ms, higher is better; allocation is **~0 B/op on both** — both genuinely zero-copy)*
+
+The edge is FFM's lower per-call overhead — **largest on small payloads**,
+converging to a tie when codec/bandwidth dominates. Against the *convenient*
+`byte[]` / JNI APIs (which allocate the output every call), the segment path is
+additionally **allocation-free**: flat ~0 B/op at any size vs MB/op that scales
+with the payload — no GC pressure on the hot path.
 
 ## Install
 
@@ -79,11 +142,26 @@ plus only the `zstd-native-<classifier>` you target.
 ```
 
 Classifiers: `osx-aarch64`, `osx-x86_64`, `linux-x86_64`, `linux-aarch64`,
-`windows-x86_64`, `windows-aarch64`. Gradle and more detail in the
-[tutorial](docs/tutorial.md). Requires JDK 25+ and
+`windows-x86_64`, `windows-aarch64` — each verified on real hardware by the
+[release smoke matrix](.github/workflows/release-smoke.yml). Gradle and more
+detail in the [tutorial](docs/tutorial.md). Requires JDK 25+ and
 `--enable-native-access=ALL-UNNAMED` at runtime. Building from source is for
 contributors — see the [reference](docs/reference.md).
 
+## Documentation
+
+The docs follow the [Diátaxis](https://diataxis.fr) framework:
+
+| | Purpose | Start here |
+|---|---|---|
+| **[Tutorial](docs/tutorial.md)** | Learning by doing | Clean checkout → first round-trip |
+| **[How-to guides](docs/how-to.md)** | Solving a specific task | Hot paths, dictionaries, zero-copy, self-built lib |
+| **[Reference](docs/reference.md)** | Looking up facts | Platforms, API surface, symbol coverage, build |
+| **[Explanation](docs/explanation.md)** | Understanding the why | Why FFM + Zig, when zero-copy pays, benchmarks |
+
+Architecture decisions are recorded as [ADRs](adr/ADR.md) (MADR 3.0) — the
+foundational choices and their trade-offs, one file per decision.
+
 ## License
 
 [BSD 3-Clause](LICENSE) — the same primary license as zstd, which is bundled