feat(cli): add word-level LRC output with UTF-8 fix #3619

SakuzyPeng · 2026-01-21T18:26:58Z

Summary

Add new -olrcw/--output-lrc-word option for word-level LRC output with inline timestamps per token, and fix UTF-8 character handling issues.

Changes

Add output_lrc_word parameter and CLI option -olrcw
Implement output_lrc_word() function with per-token timestamps
Fix UTF-8 multi-byte character handling by merging continuation bytes
Enable token_timestamps when output_lrc_word is set
Handle diarize speaker prefix without breaking LRC format
Update README.md with new option

UTF-8 Fix (addresses #1798)

CJK characters (3 bytes in UTF-8) were being split across tokens with timestamps inserted between bytes:

Before (broken):

[00:00.50]ã[00:00.52]ª[00:00.54]ã...

After (fixed):

[00:00.50]私[00:00.80]は[00:01.10]歌...

The fix detects UTF-8 continuation bytes (10xxxxxx) and merges them with the previous token.

Output Format

[by:whisper.cpp]
[00:00.50]私[00:00.80]は[00:01.10]歌[00:01.40]を[00:01.70]歌[00:02.00]います

Test Plan

Tested with Japanese songs (CJK characters)
Verified UTF-8 characters are not split
Verified timestamps are accurate (with DTW)

Add new -olrcw/--output-lrc-word option for word-level LRC output with inline timestamps per token. Key changes: - Add output_lrc_word parameter and CLI option - Implement output_lrc_word() function with per-token timestamps - Fix UTF-8 multi-byte character handling (merge continuation bytes) - Enable token_timestamps when output_lrc_word is set - Handle diarize speaker prefix without breaking LRC format - Update README.md with new option The UTF-8 fix addresses issue ggml-org#1798 where CJK characters were split across tokens with timestamps inserted between bytes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): add word-level LRC output with UTF-8 fix #3619

feat(cli): add word-level LRC output with UTF-8 fix #3619

SakuzyPeng commented Jan 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(cli): add word-level LRC output with UTF-8 fix #3619

Are you sure you want to change the base?

feat(cli): add word-level LRC output with UTF-8 fix #3619

Conversation

SakuzyPeng commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

UTF-8 Fix (addresses #1798)

Output Format

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SakuzyPeng commented Jan 21, 2026 •

edited

Loading