Skip to content

Commit 92755e0

Browse files
authored
feat: add multilingual G2P model and benchmark CLI command (#367)
## Summary - Add CharsiuG2P ByT5 CoreML multilingual G2P model (`MultilingualG2PModel`, `MultilingualG2PLanguage`, `MultilingualG2PError`) supporting 9 Kokoro-mapped languages - Add `g2p-benchmark` CLI command measuring PER/WER/speed against CharsiuG2P test set with JSON output - Switch both English and multilingual G2P models to `cpuOnly` compute units (benchmarked 2-3x faster than GPU/ANE for autoregressive decoding) - Add `LevenshteinDistance` utility and `MultilingualG2PTests` (9 tests) ### Benchmark Results (M2, CPU-only, 500 words/language) | Language | PER | WER | ms/word | |---|---|---|---| | Spanish | 0.1% | 0.8% | 32.6 | | French | 0.8% | 2.0% | 26.5 | | Italian | 2.8% | 20.0% | 20.9 | | Hindi | 4.5% | 21.4% | 45.4 | | Japanese | 10.5% | 23.8% | 31.7 | | Portuguese | 8.9% | 43.2% | 24.0 | | British English | 13.6% | 29.4% | 34.0 | | American English | 19.0% | 38.8% | 28.2 | | Chinese | 86.2% | 95.0% | 53.9 | ### Compute Unit Benchmarks (English BART G2P) | Config | ms/word | |---|---| | cpuOnly | **13.0** | | all (ANE+GPU+CPU) | 17.3 | | cpuAndGPU | 23.4 | ## Test plan - [ ] `swift build` compiles clean - [ ] `swift test --filter MultilingualG2PTests` passes (9 tests) - [ ] `fluidaudiocli g2p-benchmark --languages eng-us --max-words 10 --data-dir <path>` produces results - [ ] Verify JSON output file is written correctly <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/367" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
1 parent d83c9e2 commit 92755e0

File tree

12 files changed

+930
-20
lines changed

12 files changed

+930
-20
lines changed

Documentation/Benchmarks.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -615,3 +615,55 @@ TS3003a 41.8 36.8 0.7 4.3 4/4 125.7
615615
AVERAGE 31.7 21.5 0.5 9.7 - 126.7
616616
======================================================================
617617
```
618+
619+
## Multilingual G2P (Grapheme-to-Phoneme)
620+
621+
CharsiuG2P ByT5 encoder-decoder model converted to CoreML for multilingual grapheme-to-phoneme conversion. Used by Kokoro TTS for non-English phonemization.
622+
623+
Model: [FluidInference/charsiu-g2p-byt5-coreml](https://huggingface.co/FluidInference/charsiu-g2p-byt5-coreml)
624+
625+
Hardware: Apple M2, 2022, macOS 26
626+
627+
### CharsiuG2P Test Set (500 words/language)
628+
629+
```bash
630+
swift run -c release fluidaudiocli g2p-benchmark --data-dir /path/to/CharsiuG2P/data/test
631+
```
632+
633+
| Language | PER | WER | ms/word |
634+
|---|---|---|---|
635+
| Spanish | 0.1% | 0.8% | 32.6 |
636+
| French | 0.8% | 2.0% | 26.5 |
637+
| Italian | 2.8% | 20.0% | 20.9 |
638+
| Hindi | 4.5% | 21.4% | 45.4 |
639+
| Japanese | 10.5% | 23.8% | 31.7 |
640+
| Portuguese (BR) | 8.9% | 43.2% | 24.0 |
641+
| British English | 13.6% | 29.4% | 34.0 |
642+
| American English | 19.0% | 38.8% | 28.2 |
643+
| Chinese | 86.2%* | 95.0%* | 53.9 |
644+
| **Average** | **16.3%** | **30.5%** | **33.0** |
645+
646+
*\*Chinese PER is inflated due to tone notation mismatch between model output and reference data (tone contour marks vs model format), not a model accuracy issue.*
647+
648+
- **PER** (Phoneme Error Rate): Character-level Levenshtein distance / reference length, stress marks stripped
649+
- **WER** (Word Error Rate): Fraction of words with any phoneme error
650+
651+
### Compute Unit Comparison
652+
653+
Both the English BART G2P and multilingual ByT5 G2P models run fastest on CPU-only due to GPU/ANE dispatch overhead on small autoregressive decoder steps.
654+
655+
**Multilingual G2P (ByT5)**
656+
657+
| Compute Units | ms/word |
658+
|---|---|
659+
| cpuOnly | **38.7** |
660+
| cpuAndGPU | 94.7 |
661+
| all (ANE+GPU+CPU) | 95.2 |
662+
663+
**English G2P (BART)**
664+
665+
| Compute Units | ms/word |
666+
|---|---|
667+
| cpuOnly | **13.0** |
668+
| all (ANE+GPU+CPU) | 17.3 |
669+
| cpuAndGPU | 23.4 |

Sources/FluidAudio/ModelNames.swift

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ public enum Repo: String, CaseIterable {
1515
case pocketTts = "FluidInference/pocket-tts-coreml"
1616
case qwen3Asr = "FluidInference/qwen3-asr-0.6b-coreml/f32"
1717
case qwen3AsrInt8 = "FluidInference/qwen3-asr-0.6b-coreml/int8"
18+
case multilingualG2p = "FluidInference/charsiu-g2p-byt5-coreml"
1819

1920
/// Repository slug (without owner)
2021
public var name: String {
@@ -45,6 +46,8 @@ public enum Repo: String, CaseIterable {
4546
return "qwen3-asr-0.6b-coreml/f32"
4647
case .qwen3AsrInt8:
4748
return "qwen3-asr-0.6b-coreml/int8"
49+
case .multilingualG2p:
50+
return "charsiu-g2p-byt5-coreml"
4851
}
4952
}
5053

@@ -95,6 +98,8 @@ public enum Repo: String, CaseIterable {
9598
return "sortformer"
9699
case .pocketTts:
97100
return "pocket-tts"
101+
case .multilingualG2p:
102+
return "charsiu-g2p-byt5"
98103
default:
99104
return name
100105
}
@@ -339,6 +344,20 @@ public enum ModelNames {
339344
]
340345
}
341346

347+
/// Multilingual G2P (CharsiuG2P ByT5) model names
348+
public enum MultilingualG2P {
349+
public static let encoder = "MultilingualG2PEncoder"
350+
public static let decoder = "MultilingualG2PDecoder"
351+
352+
public static let encoderFile = encoder + ".mlmodelc"
353+
public static let decoderFile = decoder + ".mlmodelc"
354+
355+
public static let requiredModels: Set<String> = [
356+
encoderFile,
357+
decoderFile,
358+
]
359+
}
360+
342361
/// G2P (grapheme-to-phoneme) model names
343362
public enum G2P {
344363
public static let encoder = "G2PEncoder"
@@ -436,6 +455,8 @@ public enum ModelNames {
436455
return ModelNames.Sortformer.requiredModels
437456
case .qwen3Asr, .qwen3AsrInt8:
438457
return ModelNames.Qwen3ASR.requiredModelsFull
458+
case .multilingualG2p:
459+
return ModelNames.MultilingualG2P.requiredModels
439460
}
440461
}
441462
}
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import Foundation
2+
3+
/// Errors raised by ``MultilingualG2PModel``.
4+
public enum MultilingualG2PError: Error, LocalizedError {
5+
case modelLoadFailed(String)
6+
case encoderPredictionFailed
7+
case decoderPredictionFailed
8+
9+
public var errorDescription: String? {
10+
switch self {
11+
case .modelLoadFailed(let detail):
12+
return "Failed to load multilingual G2P CoreML model: \(detail)"
13+
case .encoderPredictionFailed:
14+
return "Multilingual G2P encoder prediction failed."
15+
case .decoderPredictionFailed:
16+
return "Multilingual G2P decoder prediction failed."
17+
}
18+
}
19+
}
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
import Foundation
2+
3+
/// Languages supported by the CharsiuG2P ByT5 multilingual model,
4+
/// mapped to Kokoro voice prefixes.
5+
public enum MultilingualG2PLanguage: String, CaseIterable, Sendable {
6+
case americanEnglish = "eng-us"
7+
case britishEnglish = "eng-uk"
8+
case spanish = "spa"
9+
case french = "fra"
10+
case hindi = "hin"
11+
case italian = "ita"
12+
case japanese = "jpn"
13+
case brazilianPortuguese = "por-bz"
14+
case mandarinChinese = "cmn"
15+
16+
/// The CharsiuG2P language code used in the model input prefix.
17+
public var charsiuCode: String { rawValue }
18+
19+
/// The formatted prefix prepended to input words (e.g. `"<eng-us>: "`).
20+
public var prefix: String { "<\(charsiuCode)>: " }
21+
22+
/// Infer the language from a Kokoro voice identifier.
23+
///
24+
/// Kokoro voices use a two-character prefix indicating language and gender
25+
/// (e.g. `"af_heart"` → American English female). Returns `nil` for
26+
/// unrecognized prefixes.
27+
public static func fromKokoroVoice(_ voiceId: String) -> MultilingualG2PLanguage? {
28+
guard voiceId.count >= 2 else { return nil }
29+
let prefix = String(voiceId.prefix(2))
30+
switch prefix {
31+
case "af", "am":
32+
return .americanEnglish
33+
case "bf", "bm":
34+
return .britishEnglish
35+
case "ef", "em":
36+
return .spanish
37+
case "ff", "fm":
38+
return .french
39+
case "hf", "hm":
40+
return .hindi
41+
case "if", "im":
42+
return .italian
43+
case "jf", "jm":
44+
return .japanese
45+
case "pf", "pm":
46+
return .brazilianPortuguese
47+
case "zf", "zm":
48+
return .mandarinChinese
49+
default:
50+
return nil
51+
}
52+
}
53+
}
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
import CoreML
2+
import Foundation
3+
4+
/// Thread-safe CoreML-based multilingual grapheme-to-phoneme converter.
5+
///
6+
/// Uses the CharsiuG2P ByT5 encoder-decoder model to convert words in multiple
7+
/// languages to IPA phonemes. The model uses byte-level tokenization (no vocab
8+
/// file required).
9+
public actor MultilingualG2PModel {
10+
11+
public static let shared = MultilingualG2PModel()
12+
13+
private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "MultilingualG2PModel")
14+
15+
// ByT5 special token IDs
16+
private let padTokenId: Int32 = 0
17+
private let eosTokenId: Int32 = 1
18+
19+
// Byte offset: byte value b maps to token ID b + 3
20+
private let byteOffset: Int32 = 3
21+
22+
private let maxDecodeSteps = 128
23+
24+
// CoreML models (lazy-loaded)
25+
private var encoder: MLModel?
26+
private var decoder: MLModel?
27+
28+
private init() {}
29+
30+
/// Convert a word to IPA phonemes using the multilingual G2P model.
31+
///
32+
/// - Parameters:
33+
/// - word: The word to convert.
34+
/// - language: The target language for phonemization.
35+
/// - Returns: An array of IPA phoneme strings, or `nil` if the model is
36+
/// unavailable (e.g. in CI).
37+
public func phonemize(word: String, language: MultilingualG2PLanguage) throws -> [String]? {
38+
do {
39+
try loadIfNeeded()
40+
} catch {
41+
if ProcessInfo.processInfo.environment["CI"] != nil {
42+
logger.warning(
43+
"Multilingual G2P unavailable in CI, returning nil for word: \(word)")
44+
return nil
45+
}
46+
throw error
47+
}
48+
49+
guard let encoder, let decoder else { return nil }
50+
51+
// Build input: "<lang-code>: word" encoded as UTF-8 bytes → token IDs
52+
let inputText = "\(language.prefix)\(word)"
53+
let inputBytes = Array(inputText.utf8)
54+
let inputIds = inputBytes.map { Int32($0) + byteOffset }
55+
56+
let encLen = inputIds.count
57+
58+
// Encoder input arrays
59+
let encoderInputIds = try MLMultiArray(shape: [1, NSNumber(value: encLen)], dataType: .int32)
60+
let attentionMask = try MLMultiArray(shape: [1, NSNumber(value: encLen)], dataType: .int32)
61+
for i in 0..<encLen {
62+
encoderInputIds[[0, i] as [NSNumber]] = NSNumber(value: inputIds[i])
63+
attentionMask[[0, i] as [NSNumber]] = NSNumber(value: Int32(1))
64+
}
65+
66+
// Run encoder
67+
let encoderProvider = try MLDictionaryFeatureProvider(
68+
dictionary: [
69+
"input_ids": MLFeatureValue(multiArray: encoderInputIds),
70+
"attention_mask": MLFeatureValue(multiArray: attentionMask),
71+
]
72+
)
73+
guard let encoderOutput = try? encoder.prediction(from: encoderProvider),
74+
let encoderHidden = encoderOutput.featureValue(for: "last_hidden_state")?.multiArrayValue
75+
else {
76+
throw MultilingualG2PError.encoderPredictionFailed
77+
}
78+
79+
// Greedy autoregressive decode
80+
var outputTokens: [Int32] = []
81+
var decoderIds: [Int32] = [padTokenId] // decoder start token
82+
83+
for _ in 0..<maxDecodeSteps {
84+
let decLen = decoderIds.count
85+
86+
let decInput = try MLMultiArray(
87+
shape: [1, NSNumber(value: decLen)], dataType: .int32)
88+
for i in 0..<decLen {
89+
decInput[[0, i] as [NSNumber]] = NSNumber(value: decoderIds[i])
90+
}
91+
92+
let decoderProvider = try MLDictionaryFeatureProvider(
93+
dictionary: [
94+
"decoder_input_ids": MLFeatureValue(multiArray: decInput),
95+
"encoder_hidden_states": MLFeatureValue(multiArray: encoderHidden),
96+
"encoder_attention_mask": MLFeatureValue(multiArray: attentionMask),
97+
]
98+
)
99+
100+
guard let decoderOutput = try? decoder.prediction(from: decoderProvider),
101+
let logits = decoderOutput.featureValue(for: "logits")?.multiArrayValue
102+
else {
103+
throw MultilingualG2PError.decoderPredictionFailed
104+
}
105+
106+
// Argmax over last position
107+
let vocabSize = logits.shape.last!.intValue
108+
let lastPos = decLen - 1
109+
var bestId: Int32 = 0
110+
var bestVal: Float = -.infinity
111+
for v in 0..<vocabSize {
112+
let val = logits[[0, lastPos, v] as [NSNumber]].floatValue
113+
if val > bestVal {
114+
bestVal = val
115+
bestId = Int32(v)
116+
}
117+
}
118+
119+
if bestId == eosTokenId { break }
120+
121+
outputTokens.append(bestId)
122+
decoderIds = [padTokenId] + outputTokens
123+
}
124+
125+
// Decode token IDs back to UTF-8 string
126+
let outputBytes = outputTokens.compactMap { tokenId -> UInt8? in
127+
let byteVal = tokenId - byteOffset
128+
guard byteVal >= 0, byteVal <= 255 else { return nil }
129+
return UInt8(byteVal)
130+
}
131+
132+
guard let ipaString = String(bytes: outputBytes, encoding: .utf8), !ipaString.isEmpty else {
133+
return nil
134+
}
135+
136+
// Split IPA string into individual phoneme characters
137+
return ipaString.map { String($0) }.filter { !$0.trimmingCharacters(in: .whitespaces).isEmpty }
138+
}
139+
140+
/// Verifies that CoreML models can be loaded.
141+
public func ensureModelsAvailable() throws {
142+
try loadIfNeeded()
143+
}
144+
145+
// MARK: - Private
146+
147+
private func loadIfNeeded() throws {
148+
if encoder != nil && decoder != nil { return }
149+
150+
let modelsDir = try TtsModels.cacheDirectoryURL()
151+
.appendingPathComponent("Models")
152+
.appendingPathComponent(Repo.multilingualG2p.folderName)
153+
154+
let encoderURL = modelsDir.appendingPathComponent(ModelNames.MultilingualG2P.encoderFile)
155+
guard FileManager.default.fileExists(atPath: encoderURL.path) else {
156+
throw MultilingualG2PError.modelLoadFailed(
157+
"\(ModelNames.MultilingualG2P.encoderFile) not found at \(encoderURL.path)")
158+
}
159+
160+
let decoderURL = modelsDir.appendingPathComponent(ModelNames.MultilingualG2P.decoderFile)
161+
guard FileManager.default.fileExists(atPath: decoderURL.path) else {
162+
throw MultilingualG2PError.modelLoadFailed(
163+
"\(ModelNames.MultilingualG2P.decoderFile) not found at \(decoderURL.path)")
164+
}
165+
166+
let config = MLModelConfiguration()
167+
config.computeUnits = .cpuOnly
168+
169+
encoder = try MLModel(contentsOf: encoderURL, configuration: config)
170+
decoder = try MLModel(contentsOf: decoderURL, configuration: config)
171+
172+
logger.info("Loaded multilingual G2P CoreML models from \(modelsDir.path)")
173+
}
174+
}

Sources/FluidAudio/TTS/Kokoro/Assets/Lexicon/G2PModel.swift

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ actor G2PModel {
214214
}
215215

216216
let config = MLModelConfiguration()
217-
config.computeUnits = .cpuAndGPU
217+
config.computeUnits = .cpuOnly
218218

219219
encoder = try MLModel(contentsOf: encoderURL, configuration: config)
220220
decoder = try MLModel(contentsOf: decoderURL, configuration: config)

0 commit comments

Comments
 (0)