Skip to content

Commit 57f6adb

Browse files
claudepasrom
authored andcommitted
docs: update documentation to match current codebase
CLAUDE.md: - Add FluidVAD.swift to project structure - Add SampleRateQuery.swift + audiotap Tests/ to project structure - Update FluidDiarizer.swift description (Sortformer mode) - Update Parakeet description (custom vocabulary via CTC boosting) - Add VAD preprocessing architecture section - Add Diarization section (Sortformer / AppSettings.diarizerMode) - Update build variant test count (~795) docs/architecture-macos.md: - Add FluidVAD.swift row to Audio Processing table - Update AudioTapLib output (mono or stereo, actualChannels) - Update DualSourceRecorder processing (actual channel count) - Update FluidDiarizer diarization section (two modes: OfflineDiarizer + Sortformer) README.md: - Add Parakeet custom vocabulary support (CTC boosting) - Add Sortformer overlap-aware diarizer mode - Add VAD preprocessing feature - Add .none protocol provider (save transcript only, no LLM) https://claude.ai/code/session_017weHGLJ3mD4APmtbpLU1jH
1 parent aadfc80 commit 57f6adb

File tree

3 files changed

+26
-11
lines changed

3 files changed

+26
-11
lines changed

CLAUDE.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,8 @@ app/MeetingTranscriber/ # Swift macOS menu bar app (SPM)
2525
WhisperKitEngine.swift # WhisperKit transcription engine (CoreML/ANE, 99+ languages)
2626
ParakeetEngine.swift # NVIDIA Parakeet TDT v3 engine via FluidAudio (CoreML/ANE, 25 EU languages)
2727
Qwen3AsrEngine.swift # Qwen3-ASR 0.6B engine via FluidAudio (CoreML/ANE, 30 languages, macOS 15+)
28-
FluidDiarizer.swift # CoreML-based speaker diarization via FluidAudio (on-device)
28+
FluidDiarizer.swift # CoreML-based speaker diarization via FluidAudio (on-device, OfflineDiarizer + Sortformer modes)
29+
FluidVAD.swift # VAD preprocessing via FluidAudio Silero v6 (silence trimming + timeline remapping)
2930
SpeakerMatcher.swift # Speaker embedding DB + cosine similarity matching
3031
DiarizationProcess.swift # DiarizationProvider protocol + result types
3132
PipelineQueue.swift # Decouples recording from post-processing (transcription → diarization → protocol)
@@ -62,6 +63,10 @@ tools/audiotap/ # AudioTapLib — CATapDescription-based app audio ca
6263
AudioCaptureResult.swift # Result struct
6364
Helpers.swift # machTicksToSeconds, getDefaultOutputDeviceUID, writeAllToFileHandle
6465
MicRestartPolicy.swift # Pure decision logic for mic engine restart on device change
66+
SampleRateQuery.swift # Pure functions for sample rate detection and cross-validation
67+
Tests/
68+
MicRestartPolicyTests.swift
69+
SampleRateQueryTests.swift
6570
tools/meeting-simulator/ # Meeting simulator tool for testing
6671
Package.swift
6772
Sources/main.swift
@@ -170,7 +175,7 @@ Use the `/git-workflow` skill. Commit proactively after every logical unit of wo
170175
**Transcription engines:**
171176
- `TranscribingEngine` protocol abstracts ASR backends. Three implementations: `WhisperKitEngine` (99+ languages, ~1 GB model), `ParakeetEngine` (25 EU languages, ~50 MB model, ~10× faster), and `Qwen3AsrEngine` (30 languages, ~1.75 GB model, macOS 15+).
172177
- `AppSettings.transcriptionEngine` enum (`.whisperKit` / `.parakeet` / `.qwen3`) selects the engine. Settings UI shows engine picker; engine-specific options hidden when not selected. `availableCases` filters by macOS version.
173-
- Parakeet auto-detects language (no parameter). WhisperKit and Qwen3 support explicit language selection.
178+
- Parakeet auto-detects language (no parameter) and supports custom vocabulary via CTC boosting (`ParakeetEngine.customVocabularyPath`). WhisperKit and Qwen3 support explicit language selection.
174179
- `Qwen3AsrEngine` requires macOS 15+ (`@available`). Returns plain text (no timestamps) — emits single `TimestampedSegment`. Chunks audio into <=30s windows (`Qwen3AsrConfig.maxAudioSeconds`). Type-erased in AppState via `_qwen3Engine: AnyObject?` for macOS <15 compatibility.
175180
- `AppState.activeTranscriptionEngine` returns the selected engine, used by `PipelineQueue`.
176181

@@ -199,11 +204,15 @@ Use the `/git-workflow` skill. Commit proactively after every logical unit of wo
199204
- `MeetingDetector` counts each pattern once per poll — prevents over-counting when multiple windows match the same app.
200205

201206
**Diarization:**
202-
- `FluidDiarizer` uses FluidAudio (CoreML/ANE) for on-device speaker diarization — no HuggingFace token needed.
207+
- `FluidDiarizer` uses FluidAudio (CoreML/ANE) for on-device speaker diarization — no HuggingFace token needed. Two modes: `.offlineDiarizer` (default) and `.sortformer` (overlap-aware, via `SortformerDiarizer`). Selected via `AppSettings.diarizerMode`.
203208
- **Dual-track diarization:** App and mic tracks are diarized separately. Speaker IDs are prefixed (`R_` for remote/app, `M_` for mic/local), merged, and assigned via `assignSpeakersDualTrack`. Single-source recordings fall back to diarizing the mix with `assignSpeakers`.
204209
- `SpeakerMatcher` stores speaker embeddings in `speakers.json` and matches via cosine similarity (multi-embedding, max 5 per speaker, confidence margin 0.10).
205210
- `DiarizationProvider` protocol enables mock injection in tests.
206211

212+
**VAD preprocessing:**
213+
- `FluidVAD` wraps FluidAudio Silero v6 for voice activity detection. When enabled (`AppSettings.vadEnabled`), silence is trimmed before transcription and timestamps are remapped back to the original timeline via `VadSegmentMap`.
214+
- `PipelineQueue` holds a cached `FluidVAD` instance (reused across jobs). Pass `vadConfig: nil` to disable.
215+
207216
**Protocol generation:**
208217
- `ProtocolGenerating` protocol with two implementations: `ClaudeCLIProtocolGenerator` and `OpenAIProtocolGenerator`.
209218
- `AppSettings.protocolProvider` enum (`.claudeCLI` / `.openAICompatible` / `.none`) selects the provider. `.none` skips LLM generation and saves the transcript only.
@@ -233,7 +242,7 @@ Two build variants controlled by compile-time flag `APPSTORE` (`-Xswiftc -DAPPST
233242
| **OpenAI API** | Yes | Yes (only LLM option) |
234243
| **Entitlements** | Mic only | Sandbox + mic + network + file picker |
235244
| **Build** | `./scripts/build_release.sh` | `./scripts/build_release.sh --appstore` |
236-
| **Tests** | 618 | ~604 (14 CLI tests excluded) |
245+
| **Tests** | ~795 | fewer (CLI tests excluded via `#if !APPSTORE`) |
237246

238247
- CLI-specific code lives in `ClaudeCLIProtocolGenerator.swift` (entire file `#if !APPSTORE`)
239248
- `ProtocolProvider` enum uses `CaseIterable``.claudeCLI` case excluded at compile time, picker adapts automatically

README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,12 +55,13 @@ A native macOS menu bar app that automatically detects, records, transcribes, an
5555
- **Dual audio recording** — App audio ([CATapDescription](https://developer.apple.com/documentation/coreaudio/catap)) + microphone simultaneously
5656
- **On-device transcription** — Three engines, selectable in Settings:
5757
- [WhisperKit](https://github.com/argmaxinc/WhisperKit) — 99+ languages, ~1 GB model
58-
- [Parakeet TDT v3](https://github.com/FluidInference/FluidAudio) (NVIDIA) — 25 EU languages, ~50 MB model, ~10× faster
58+
- [Parakeet TDT v3](https://github.com/FluidInference/FluidAudio) (NVIDIA) — 25 EU languages, ~50 MB model, ~10× faster, custom vocabulary support (CTC boosting)
5959
- [Qwen3-ASR](https://github.com/FluidInference/FluidAudio) (Alibaba) — 30 languages, ~1.75 GB model, macOS 15+
60-
- **On-device speaker diarization**[FluidAudio](https://github.com/FluidInference/FluidAudio) via CoreML/ANE — no HuggingFace token needed
60+
- **On-device speaker diarization**[FluidAudio](https://github.com/FluidInference/FluidAudio) via CoreML/ANE — no HuggingFace token needed; two modes: standard (`OfflineDiarizer`) and overlap-aware (`Sortformer`)
6161
- **Dual-track diarization** — App and mic tracks diarized separately for clean speaker separation without echo interference
6262
- **Speaker recognition** — Voice embeddings stored across meetings, matched via cosine similarity
63-
- **AI protocol generation** — Structured Markdown via [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code) or OpenAI-compatible APIs (Ollama, LM Studio, etc.)
63+
- **VAD preprocessing** — Optional silence trimming via FluidAudio Silero v6 before transcription, with automatic timestamp remapping
64+
- **AI protocol generation** — Structured Markdown via [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code), OpenAI-compatible APIs (Ollama, LM Studio, etc.), or disabled (save transcript only)
6465
- **Configurable protocol prompt** — Custom prompt file support (`~/Library/Application Support/MeetingTranscriber/protocol_prompt.md`)
6566
- **Manual recording** — Record any app via app picker, not just detected meetings
6667
- **Multi-format input** — Supports WAV, MP3, M4A, MP4, and with ffmpeg also MKV, WebM, OGG

docs/architecture-macos.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ Meeting Window Detected (CGWindowListCopyWindowInfo)
6464
| `AudioMixer.swift` | Resampling, mixing, echo suppression, mute masking, WAV I/O |
6565
| `AudioConstants.swift` | Shared audio pipeline constants (target sample rate) |
6666
| `MicRecorder.swift` | Microphone recording via AVAudioEngine |
67+
| `FluidVAD.swift` | VAD preprocessing via FluidAudio Silero v6 — silence trimming + `VadSegmentMap` timeline remapping |
6768
| `tools/audiotap/Sources/` | AudioTapLib — CATapDescription-based app audio capture (SPM library) |
6869

6970
### Support
@@ -102,17 +103,17 @@ PipelineQueue: waiting → transcribing → [diarizing] → generatingProtocol
102103
```
103104
AudioTapLib (CATapDescription)
104105
├─ Input: App PID → CoreAudio process tap → aggregate device
105-
├─ Output: Interleaved float32 stereo → FileHandle (raw PCM)
106+
├─ Output: Interleaved float32 (mono or stereo) → FileHandle (raw PCM)
106107
├─ Mic: AVAudioEngine → mono WAV file (MicCaptureHandler)
107-
└─ Metadata: micDelay, actualSampleRate via AudioCaptureResult
108+
└─ Metadata: micDelay, actualSampleRate, actualChannels via AudioCaptureResult
108109
```
109110

110111
**Key:** CATapDescription requires NO Screen Recording permission (purple dot indicator only). Handles output device changes by recreating tap automatically.
111112

112113
### Processing (DualSourceRecorder.stop())
113114

114115
```
115-
Raw float32 stereo → mono (channel average)
116+
Raw float32 (mono or stereo, actual channel count from AudioCaptureResult) → mono
116117
→ Resample to 16kHz
117118
→ Save app.wav (16kHz mono)
118119
→ Load mic.wav (already 16kHz from MicCaptureHandler)
@@ -185,7 +186,11 @@ All recordings are normalized to 16kHz at capture time — no resampling needed
185186

186187
On-device speaker diarization using FluidAudio (CoreML/ANE). No HuggingFace token or Python subprocess needed. Models downloaded automatically on first run (~50 MB).
187188

188-
Flow: `FluidDiarizer.run(audioPath, numSpeakers)``OfflineDiarizerManager``DiarizationResult` with segments, speaking times, and speaker embeddings.
189+
Two modes selected via `AppSettings.diarizerMode`:
190+
- **`.offlineDiarizer`** (default) — `OfflineDiarizerManager`, standard speaker segmentation
191+
- **`.sortformer`**`SortformerDiarizer`, overlap-aware diarization (handles simultaneous speech)
192+
193+
Flow: `FluidDiarizer.run(audioPath, numSpeakers)` → selected diarizer → `DiarizationResult` with segments, speaking times, and speaker embeddings.
189194

190195
### Speaker Matching
191196

0 commit comments

Comments
 (0)