Skip to content

Conversation

@SakuzyPeng
Copy link

@SakuzyPeng SakuzyPeng commented Jan 21, 2026

Summary

Add new -olrcw/--output-lrc-word option for word-level LRC output with inline timestamps per token, and fix UTF-8 character handling issues.

Changes

  • Add output_lrc_word parameter and CLI option -olrcw
  • Implement output_lrc_word() function with per-token timestamps
  • Fix UTF-8 multi-byte character handling by merging continuation bytes
  • Enable token_timestamps when output_lrc_word is set
  • Handle diarize speaker prefix without breaking LRC format
  • Update README.md with new option

UTF-8 Fix (addresses #1798)

CJK characters (3 bytes in UTF-8) were being split across tokens with timestamps inserted between bytes:

Before (broken):

[00:00.50]ã[00:00.52]ª[00:00.54]ã...

After (fixed):

[00:00.50]私[00:00.80]は[00:01.10]歌...

The fix detects UTF-8 continuation bytes (10xxxxxx) and merges them with the previous token.

Output Format

[by:whisper.cpp]
[00:00.50]私[00:00.80]は[00:01.10]歌[00:01.40]を[00:01.70]歌[00:02.00]います

Test Plan

  • Tested with Japanese songs (CJK characters)
  • Verified UTF-8 characters are not split
  • Verified timestamps are accurate (with DTW)

Add new -olrcw/--output-lrc-word option for word-level LRC output with
inline timestamps per token.

Key changes:
- Add output_lrc_word parameter and CLI option
- Implement output_lrc_word() function with per-token timestamps
- Fix UTF-8 multi-byte character handling (merge continuation bytes)
- Enable token_timestamps when output_lrc_word is set
- Handle diarize speaker prefix without breaking LRC format
- Update README.md with new option

The UTF-8 fix addresses issue ggml-org#1798 where CJK characters were split
across tokens with timestamps inserted between bytes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant