🌐 Official Website | 🖥️ GitHub | 🤗 Model | 📑 Blog |
Advanced forced alignment and subtitle generation powered by 🤗 Lattice-1 model.
- Features
- Installation
- Quick Start
- CLI Reference
- Python SDK
- Advanced Features
- Text Processing
- Supported Formats & Languages
- Roadmap
- Development
| Feature | Description |
|---|---|
| Forced Alignment | Word-level and segment-level audio-text synchronization powered by Lattice-1 |
| Multi-Model Transcription | Gemini, Parakeet, SenseVoice, Fun-ASR, Qwen3-ASR, Whisper, and any vLLM/SGLang-served model |
| Speaker Diarization | Multi-speaker identification with label preservation |
| Caption Translation | LLM-powered translation with terminology consistency and bilingual output |
| Streaming Mode | Process audio up to 20 hours with minimal memory |
| Universal Format Support | 30+ caption/subtitle formats |
| Model | Links | Languages | Description |
|---|---|---|---|
| Lattice-1 | 🤗 HF • 🤖 MS | English, Chinese, German | Production model with mixed-language alignment support |
| Lattice-1-Alpha | 🤗 HF • 🤖 MS | English | Initial release with English forced alignment |
Model Hub: Models can be downloaded from huggingface (default) or modelscope (recommended for users in China):
# Use ModelScope (faster in China)
lai alignment align audio.wav caption.srt output.srt alignment.model_hub=modelscopefrom lattifai.client import LattifAI
from lattifai.config import AlignmentConfig
client = LattifAI(alignment_config=AlignmentConfig(model_hub="modelscope"))Requires Python 3.10 – 3.14
uv is a fast Python package manager (10-100x faster than pip).
# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | shAs a CLI tool (recommended for most users):
# Install globally — lai command available everywhere
uv tool install "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/
# Quick test without installing
uvx --from lattifai --extra-index-url https://lattifai.github.io/pypi/simple/ lai --helpAs a project dependency (for Python SDK usage):
# Add to an existing project
uv add "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/# Full installation (recommended)
pip install "lattifai[all]" --extra-index-url https://lattifai.github.io/pypi/simple/Configure pip globally (optional, to avoid --extra-index-url each time):
# Add to ~/.pip/pip.conf (Linux/macOS) or %APPDATA%\pip\pip.ini (Windows)
[global]
extra-index-url = https://lattifai.github.io/pypi/simple/| Extra | Includes |
|---|---|
| (base) | Forced alignment, Gemini transcription, YouTube, captions |
transcription |
Local ASR models (Parakeet, SenseVoice, Fun-ASR) |
diarization |
Speaker diarization (NeMo, pyannote) |
translation |
LLM-powered caption translation (OpenAI-compatible) |
event |
Audio event detection |
all |
Base + transcription + diarization + translation + event |
Note: Base installation includes alignment, Gemini transcription, and YouTube. Use [all] for local ASR models and all optional features.
Caption/subtitle format parsing is provided by lattifai-captions, a separate package supporting 30+ formats (SRT, VTT, ASS, TTML, TextGrid, NLE formats, etc.). It is automatically installed with lattifai.
LattifAI API Key (Required) - Get your free key at lattifai.com/dashboard/api-keys, or try instantly with lai auth trial.
Gemini API Key (Optional) - For transcription with Gemini models, get key at aistudio.google.com/apikey
Keys and URLs are resolved in this order (first match wins):
- Environment variable —
export LATTIFAI_API_KEY=lf_xxx - CLI session (
~/.lattifai/config.toml) — written bylai auth login/lai auth trial, device-bound obfuscated storage .envfile — auto-discovered from current directory upward
# Option 1: Environment variable
export LATTIFAI_API_KEY="lf_your_api_key_here"
export GEMINI_API_KEY="your_gemini_api_key_here"
# Option 2: CLI login (opens browser, stores key securely)
lai auth login
# Option 3: Free trial (no sign-up, 120 minutes)
lai auth trial
# Option 4: .env file in project root
cat > .env <<EOF
LATTIFAI_API_KEY=lf_your_api_key_here
LATTIFAI_BASE_URL=https://api.lattifai.com/v1
GEMINI_API_KEY=your_gemini_api_key_here
EOFThe same resolution order applies to LATTIFAI_BASE_URL and LATTIFAI_SITE_URL.
# Align audio with subtitle
lai alignment align audio.wav subtitle.srt output.srt
# YouTube video
lai youtube align "https://youtube.com/watch?v=VIDEO_ID"
# Start local browser playground (4 tabs)
lai serve runfrom lattifai.client import LattifAI
client = LattifAI()
caption = client.alignment(
input_media="audio.wav",
input_caption="subtitle.srt",
output_caption_path="aligned.srt",
)| Command | Description | Example |
|---|---|---|
lai alignment align |
Align audio/video with caption | lai alignment align audio.wav caption.srt output.srt |
lai youtube align |
Download & align YouTube | lai youtube align "https://youtube.com/watch?v=ID" |
lai transcribe run |
Transcribe audio/video | lai transcribe run audio.wav output.srt |
lai transcribe align |
Transcribe and align | lai transcribe align audio.wav output.srt |
lai translate caption |
Translate captions | lai translate caption input.srt output.srt translation.target_lang=zh |
lai caption convert |
Convert caption formats | lai caption convert input.srt output.vtt |
lai caption shift |
Shift timestamps | lai caption shift input.srt output.srt 2.0 |
lai serve run |
Start local web UI playground | lai serve run |
lai doctor |
Run environment diagnostics | lai doctor |
lai update |
Update to latest version | lai update or lai update --force |
lai config |
Manage API keys & settings | lai config set lattifai_api_key lf_xxx |
# Device selection
alignment.device=cuda # cuda, mps, cpu
# Caption options
caption.split_sentence=true # Smart sentence splitting
caption.word_level=true # Word-level timestamps
# Streaming for long audio
media.streaming_chunk_secs=300
# Channel selection
media.channel_selector=left # left, right, average, or indexLattifAI supports a wide range of ASR models — from cloud APIs to local inference to self-hosted servers:
| Model | Type | Languages | Install Extra |
|---|---|---|---|
| Gemini 2.5 Pro/Flash | Cloud API | 100+ | (base) |
| NVIDIA Parakeet | Local | 24 (European) | [transcription] |
| SenseVoice | Local | zh, en, ja, ko, yue | [transcription] |
| Fun-ASR-Nano | Local | 31 (incl. zh dialects) | [transcription] |
| Fun-ASR-MLT-Nano | Local | 31 (incl. zh dialects) | [transcription] |
| Qwen3-ASR | Local / vLLM/SGLang | 52 (30 lang + 22 zh dialects) | [transcription] |
| Whisper | vLLM/SGLang | 99 | — |
| Voxtral | vLLM/SGLang | 13 (European) | — |
| Voxtral Realtime | vLLM (realtime) | 13 (European) | — |
| Gemma-3n | vLLM (chat) | 140+ | — |
⚠️ Gemma-3n is a general-purpose multimodal LLM, not a dedicated ASR model. It has a hard 30s audio encoder limit, ~3x higher WER than Whisper, and weaker multilingual transcription. Best suited for transcription + downstream understanding (summarization, translation) rather than pure ASR accuracy.
# Gemini (cloud API, requires GEMINI_API_KEY)
transcription.model_name=gemini-2.5-pro
# Local models (requires [transcription] extra)
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3
transcription.model_name=iic/SenseVoiceSmall
transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512
transcription.model_name=Qwen/Qwen3-ASR-1.7B
# vLLM/SGLang-served models (requires a running vLLM server)
transcription.model_name=Qwen/Qwen3-ASR-1.7B \
transcription.api_base_url=http://localhost:8081/v1Transcribe audio/video files or YouTube URLs to generate timestamped captions.
# Local file
lai transcribe run audio.wav output.srt
# YouTube URL
lai transcribe run "https://youtube.com/watch?v=VIDEO_ID" output_dir=./output
# With model selection
lai transcribe run audio.wav output.srt \
transcription.model_name=gemini-2.5-pro \
transcription.device=cudaParameters:
input: Path to audio/video file or YouTube URLoutput_caption: Output caption file path (for local files)output_dir: Output directory (for YouTube URLs, defaults to current directory)channel_selector: Audio channel -average(default),left,right, or channel index
Transcribe and align in a single step - produces precisely aligned captions.
# Basic usage
lai transcribe align audio.wav output.srt
# With options
lai transcribe align audio.wav output.srt \
transcription.model_name=nvidia/parakeet-tdt-0.6b-v3 \
alignment.device=cuda \
caption.split_sentence=true \
caption.word_level=trueTranslate caption files to any target language using LLM providers (Gemini, OpenAI-compatible).
Three translation modes with increasing quality:
| Mode | Pipeline | LLM Calls | Use Case |
|---|---|---|---|
quick |
Translate | ~1x | Quick draft, informal review |
normal |
Analyze → Translate | ~2x | Default — terminology-consistent, context-aware |
refined |
Analyze → Translate → Review → Revise | ~3x | Publication-quality professional subtitles |
What each stage does:
- Analyze (
normal/refined): Scans source text to identify domain, terminology, speaker style, and tone. Extracts a glossary of key terms with recommended translations, ensuring consistency across all segments (e.g., "forced alignment" → "强制对齐" everywhere). - Translate: Batch-translates segments with context windows (surrounding lines for coherence). In
quickmode, uses only the raw text. Innormal/refined, the translation prompt includes the analysis results and glossary. - Review (
refinedonly): A separate reviewer pass compares each translation against the original, checking for mistranslations, omissions, tone shifts, and glossary violations. Outputs per-segment critiques. - Revise (
refinedonly): Applies reviewer feedback to produce a polished final version. All intermediate artifacts (analysis, prompts, drafts, critiques, revisions) can be saved withsave_artifacts=true.
# Basic (default: normal mode, bilingual, target=zh)
lai translate caption input.srt output.srt
# Quick mode to English
lai translate caption input.srt output.srt \
translation.target_lang=en \
translation.mode=quick
# Refined mode with artifacts saved
lai translate caption input.srt output.srt \
translation.target_lang=ja \
translation.mode=refined \
translation.save_artifacts=true
# Bilingual output with translation on top
lai translate caption input.srt output.srt \
translation.target_lang=zh \
caption.translation_first=true
# OpenAI-compatible API (local or third-party)
lai translate caption input.srt output.srt \
translation.llm.provider=openai \
translation.llm.api_base_url=http://localhost:8000/v1 \
translation.llm.model=qwen3
# With custom glossary
lai translate caption input.srt output.srt \
translation.glossary_file=glossary.yamlTranslationConfig Options:
| Option | Default | Description |
|---|---|---|
target_lang |
zh |
Target language code (see supported languages) |
source_lang |
auto | Source language (auto-detected if not set) |
approach |
rewrite |
rewrite: natural expression, idiom adaptation; translate: accuracy, source fidelity |
mode |
normal |
Translation mode: quick, normal, refined |
bilingual |
true |
Output bilingual captions (original + translation) |
style |
technical |
Style hint: storytelling, formal, casual, technical |
llm.model |
gemini-3-flash-preview |
LLM model name |
llm.provider |
gemini |
LLM provider: gemini or openai |
llm.api_base_url |
— | Base URL for OpenAI-compatible endpoint (vLLM, SGLang, Ollama) |
batch_size |
30 |
Segments per API call |
max_concurrent |
5 |
Max concurrent batch requests |
glossary_file |
— | Path to custom glossary (YAML or Markdown) |
save_artifacts |
false |
Save intermediate files (analysis, prompts, critiques, revisions) |
55+ languages supported. Common codes:
| Region | Languages |
|---|---|
| East Asian | zh Chinese (Simplified), zh-TW Traditional, ja Japanese, ko Korean |
| South/SE Asian | hi Hindi, bn Bengali, th Thai, vi Vietnamese, id Indonesian, ms Malay |
| Western European | en English, es Spanish, fr French, de German, pt Portuguese, it Italian, nl Dutch |
| Northern European | sv Swedish, da Danish, no Norwegian, fi Finnish |
| Eastern European | ru Russian, uk Ukrainian, pl Polish, cs Czech, ro Romanian, hu Hungarian |
| Middle Eastern | ar Arabic, fa Persian, he Hebrew, tr Turkish |
Full list: lattifai.languages.SUPPORTED_LANGUAGES
Translation approach inspired by 宝玉's AI translation methodology.
from lattifai.client import LattifAI
from lattifai.config import (
ClientConfig,
AlignmentConfig,
CaptionConfig,
CaptionInputConfig,
DiarizationConfig,
MediaConfig,
RenderConfig,
)
client = LattifAI(
client_config=ClientConfig(api_key="lf_xxx", timeout=60.0),
alignment_config=AlignmentConfig(device="cuda"),
caption_config=CaptionConfig(
input=CaptionInputConfig(split_sentence=True),
render=RenderConfig(word_level=True),
),
)
caption = client.alignment(
input_media="audio.wav",
input_caption="subtitle.srt",
output_caption_path="output.json",
)
# Access results
for segment in caption.supervisions:
print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")caption = client.youtube(
url="https://youtube.com/watch?v=VIDEO_ID",
output_dir="./downloads",
output_caption_path="aligned.srt",
)| Sub-config | Option | Default | Description |
|---|---|---|---|
input |
split_sentence |
False |
Smart sentence splitting, separates non-speech elements |
input |
normalize_text |
True |
Clean HTML entities and special characters |
input |
source_lang |
None |
Source language code (e.g., "en", "zh") |
render |
word_level |
False |
Include word-level timestamps in output |
render |
include_speaker_in_text |
True |
Include speaker labels in text output |
render |
translation_first |
False |
Place translation above original in bilingual output |
ass |
speaker_color |
"" |
Speaker name color in ASS output: "" (off), "auto" (10-color palette), "#RRGGBB", or comma-separated list |
from lattifai.client import LattifAI
from lattifai.config import CaptionConfig, CaptionInputConfig, RenderConfig
client = LattifAI(
caption_config=CaptionConfig(
input=CaptionInputConfig(split_sentence=True, normalize_text=True),
render=RenderConfig(word_level=True, include_speaker_in_text=False),
)
)Process audio up to 20 hours with minimal memory:
caption = client.alignment(
input_media="long_audio.wav",
input_caption="subtitle.srt",
streaming_chunk_secs=300.0, # 5-minute chunks
)from lattifai.client import LattifAI
from lattifai.config import CaptionConfig, RenderConfig
client = LattifAI(caption_config=CaptionConfig(render=RenderConfig(word_level=True)))
caption = client.alignment(
input_media="audio.wav",
input_caption="subtitle.srt",
output_caption_path="output.json", # JSON preserves word-level data
)Automatically identify and label different speakers in audio.
Capabilities:
- Multi-Speaker Detection: Automatically detect speaker changes
- Smart Labeling: Assign labels (SPEAKER_00, SPEAKER_01, etc.)
- Label Preservation: Maintain existing speaker names from input captions
- Gemini Integration: Extract speaker names from transcription context
Label Handling:
- Without existing labels → Generic labels (SPEAKER_00, SPEAKER_01)
- With existing labels (
[Alice],>> Bob:,SPEAKER_01:) → Preserved during alignment - Gemini transcription → Names extracted from context (e.g., "Hi, I'm Alice" →
Alice)
from lattifai.client import LattifAI
from lattifai.config import DiarizationConfig
client = LattifAI(
diarization_config=DiarizationConfig(
enabled=True,
device="cuda",
min_speakers=2,
max_speakers=4,
)
)
caption = client.alignment(...)
for segment in caption.supervisions:
print(f"[{segment.speaker}] {segment.text}")LLM Speaker Name Inference:
When speakers remain as SPEAKER_XX after acoustic diarization, enable LLM inference to identify real names from dialogue content:
DiarizationConfig(
enabled=True,
infer_speakers=True, # Use LLM to infer speaker names
)
# Pass context as a per-call parameter to speaker_diarization()
client.speaker_diarization(
input_media=audio,
caption=caption,
output_caption_path="output.srt",
speaker_context="podcast, host is Alice, guest is Bob", # Optional hint
)DiarizationConfig Options:
| Option | Default | Description |
|---|---|---|
enabled |
False |
Enable speaker diarization |
device |
auto |
cpu, cuda, mps, or auto |
num_speakers |
— | Exact number of speakers (overrides min/max) |
min_speakers |
— | Minimum speakers to detect |
max_speakers |
— | Maximum speakers to detect |
infer_speakers |
False |
Use LLM to infer real names from dialogue |
CLI:
lai alignment align audio.wav subtitle.srt output.srt \
diarization.enabled=true \
diarization.device=cuda
# With LLM speaker name inference
lai alignment align audio.wav subtitle.srt output.srt \
diarization.enabled=true \
diarization.infer_speakers=true
# Diarize subcommand with speaker context
lai diarize run audio.wav subtitle.srt output.srt \
--context "interview with Dr. Smith"Input Media → AudioLoader → Aligner → (Diarizer) → Caption
↑
Input Caption → Reader → Tokenizer
The tokenizer handles various text patterns for forced alignment.
Visual captions and annotations in brackets are treated specially - they get two pronunciation paths so the aligner can choose:
- Silence path - skip when content doesn't appear in audio
- Inner text pronunciation - match if someone actually says the words
| Bracket Type | Symbol | Example | Alignment Behavior |
|---|---|---|---|
| Half-width square | [] |
[APPLAUSE] |
Skip or match "applause" |
| Half-width paren | () |
(music) |
Skip or match "music" |
| Full-width square | 【】 |
【笑声】 |
Skip or match "笑声" |
| Full-width paren | () |
(音乐) |
Skip or match "音乐" |
| Angle brackets | <> |
<intro> |
Skip or match "intro" |
| Book title marks | 《》 |
《开场白》 |
Skip or match "开场白" |
This allows proper handling of:
- Visual descriptions:
[Barret adjusts the camera and smiles]→ skipped if not spoken - Sound effects:
[APPLAUSE],(music)→ matched if audible - Chinese annotations:
【笑声】,(鼓掌)→ flexible alignment
| Pattern | Handling | Example |
|---|---|---|
| CJK characters | Split individually | 你好 → ["你", "好"] |
| Latin words | Grouped with accents | Kühlschrank → ["Kühlschrank"] |
| Contractions | Kept together | I'm, don't, we'll |
| Punctuation | Attached to words | Hello, world! |
Recognized speaker patterns are preserved during alignment:
| Format | Example | Output |
|---|---|---|
| Arrow prefix | >> Alice: or >> Alice: |
[Alice] |
| LattifAI format | [SPEAKER_01]: |
[SPEAKER_01] |
| Uppercase name | SPEAKER NAME: |
[SPEAKER NAME] |
| Type | Formats |
|---|---|
| Audio | WAV, MP3, M4A, AAC, FLAC, OGG, OPUS, AIFF, and more |
| Video | MP4, MKV, MOV, WEBM, AVI, and more |
| Caption | SRT, VTT, ASS, SSA, SRV3, JSON, TextGrid, TSV, CSV, LRC, TTML, and more |
Note: Caption format handling is provided by lattifai-captions, which is automatically installed as a dependency. For standalone caption processing without alignment features, install
pip install lattifai-captions.
JSON is the most flexible format for storing caption data with full word-level timing support:
[
{
"text": "Hello beautiful world",
"start": 0.0,
"end": 2.5,
"speaker": "Speaker 1",
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5},
{"word": "beautiful", "start": 0.6, "end": 1.4},
{"word": "world", "start": 1.5, "end": 2.5}
]
}
]Features:
- Word-level timestamps preserved in
wordsarray - Round-trip compatible (read/write without data loss)
- Optional
speakerfield for multi-speaker content
| Format | word_level=True |
word_level=True + karaoke_effect |
|---|---|---|
| JSON | Includes words array |
Same as word_level=True |
| SRT | One word per segment | One word per segment |
| VTT | One word per segment | YouTube VTT style: <00:00:00.000><c> word</c> |
| ASS | One word per segment | {\kf} karaoke tags (sweep effect) |
| LRC | One word per line | Enhanced <timestamp> tags |
| TTML | One word per <p> element |
<span> with itunes:timing="Word" |
The speaker_color option colorizes speaker names in ASS output (works with both karaoke and non-karaoke modes):
| Value | Behavior |
|---|---|
"" (default) |
No speaker coloring |
"auto" |
Assigns from a built-in 10-color palette |
"#RRGGBB" |
Single color for all speakers |
"#RRGGBB,#00BFFF,..." |
Comma-separated list, one per speaker (cycles if more speakers than colors) |
# Auto-color speakers in ASS output
lai caption convert input.json output.ass \
render.include_speaker_in_text=true \
ass.speaker_color=auto
# Custom single color
lai caption convert input.json output.ass \
render.include_speaker_in_text=true \
ass.speaker_color="#1387C0"Use ass.karaoke_color_scheme to apply a predefined color scheme for karaoke ASS output. Each scheme sets primary_color, secondary_color, outline_color, and back_color.
12 schemes available: azure-gold, sakura-purple, mint-ocean, gardenia-green, sunset-warm, prussian-elegant, burgundy-classic, langgan-spring, mars-teal, spring-field, navy-pink, apricot-dark
# Karaoke with color scheme + auto speaker colors
lai caption convert input.json output.ass \
ass.karaoke_effect=sweep \
ass.karaoke_color_scheme=azure-gold \
ass.speaker_color=autoThe VTT format handler supports both standard WebVTT and YouTube VTT with word-level timestamps.
Reading: VTT automatically detects YouTube VTT format (with <timestamp><c> tags) and extracts word-level alignment data:
WEBVTT
00:00:00.000 --> 00:00:02.000
<00:00:00.000><c> Hello</c><00:00:00.500><c> world</c>
Writing: Use render.word_level=True to output YouTube VTT style with word timestamps:
from lattifai.data import Caption
from lattifai.caption.config import ASSConfig, RenderConfig
caption = Caption.read("input.vtt")
caption.write(
"output.ass",
format_config=ASSConfig(karaoke_effect="sweep"),
render=RenderConfig(word_level=True),
)# CLI: Convert to VTT with word-level timestamps
lai caption convert input.json output.vtt \
render.word_level=trueModels: gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-3-pro-preview, gemini-3-flash-preview, gemini-3.1-pro-preview
English, Chinese (Mandarin & Cantonese), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Arabic, Russian, Hindi, Bengali, Turkish, Dutch, Polish, Swedish, Danish, Norwegian, Finnish, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Filipino, Ukrainian, Czech, Romanian, Hungarian, and 70+ more.
Requires Gemini API key from Google AI Studio
Model: nvidia/parakeet-tdt-0.6b-v3
| Region | Languages |
|---|---|
| Western Europe | English (en), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl) |
| Nordic | Danish (da), Swedish (sv), Norwegian (no), Finnish (fi) |
| Eastern Europe | Polish (pl), Czech (cs), Slovak (sk), Hungarian (hu), Romanian (ro), Bulgarian (bg), Ukrainian (uk), Russian (ru) |
| Others | Croatian (hr), Estonian (et), Latvian (lv), Lithuanian (lt), Slovenian (sl), Maltese (mt), Greek (el) |
Model: iic/SenseVoiceSmall
Chinese/Mandarin (zh), English (en), Japanese (ja), Korean (ko), Cantonese (yue)
Models: FunAudioLLM/Fun-ASR-Nano-2512, FunAudioLLM/Fun-ASR-MLT-Nano-2512
800M parameter end-to-end ASR model from Tongyi Lab, excelling at far-field, high-noise, dialect/accent, and music lyric recognition.
| Region | Languages |
|---|---|
| East Asia | Chinese (+ 7 dialects, 26 accents), Japanese, Korean, Cantonese |
| Southeast Asia | Vietnamese, Indonesian, Thai, Malay, Filipino |
| South Asia | Hindi |
| Middle East | Arabic |
| Europe | English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish |
# Use ModelScope (default for China)
lai transcribe run audio.wav output.srt \
transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \
transcription.model_hub=modelscope
# Use HuggingFace
lai transcribe run audio.wav output.srt \
transcription.model_name=FunAudioLLM/Fun-ASR-MLT-Nano-2512 \
transcription.model_hub=huggingfaceAny ASR model served via vLLM or SGLang with an OpenAI-compatible API.
Supported models and limitations:
| Model | Audio tok/s | Max Audio | API Mode | Batch | Notes |
|---|---|---|---|---|---|
| Qwen3-ASR (0.6B/1.7B) | 25 | auto | transcriptions | Yes | Best for zh/en/ja/ko |
| Whisper | 50 | 30s | transcriptions | Yes | Fixed 30s context window |
| Voxtral | 12.5 | auto | transcriptions | Yes | European languages |
| Voxtral Realtime | 12.5 | auto | realtime | Yes | WebSocket, <500ms latency |
| Ultravox | 6.25 | auto | transcriptions | Yes | Confirmed in vLLM source |
| Gemma-3n | 6.25 | 30s | chat (auto) | No | Not a dedicated ASR model (~3x Whisper WER), 30s encoder limit, no concurrent requests |
- Max Audio: "auto" = estimated from
max_model_len; bold values are hard encoder limits - Batch: Whether
batch_size>1concurrent requests are supported - API Mode:
transcriptionsis the default; general-purpose LLMs auto-switch tochat
API modes:
| Mode | Endpoint | Use Case |
|---|---|---|
transcriptions (default) |
/v1/audio/transcriptions |
Dedicated ASR models (Qwen3-ASR, Whisper, GLM-ASR, etc.) |
chat |
/v1/chat/completions |
General-purpose LLMs (Gemma-3n, etc.) — auto-selected for non-ASR models |
realtime |
/v1/realtime (WebSocket) |
Voxtral Realtime |
# 1. Install vLLM with audio support (requires CUDA GPU)
pip install vllm "vllm[audio]"
# 2. Start vLLM server on a Linux GPU machine (auto-downloads the model)
vllm serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8081
# Other models:
# vllm serve openai/whisper-large-v3-turbo
# vllm serve google/gemma-3n-E4B-it --max-model-len 32000 --enforce-eager
# 3. Transcribe (default: transcriptions mode)
lai transcribe run audio.wav output.srt \
transcription.model_name=Qwen/Qwen3-ASR-1.7B \
transcription.api_base_url=http://localhost:8081/v1
# Batch mode for faster processing (4 concurrent requests)
lai transcribe run audio.wav output.srt \
transcription.model_name=Qwen/Qwen3-ASR-1.7B \
transcription.api_base_url=http://localhost:8081/v1 \
transcription.batch_size=4
# General-purpose LLM (auto-switches to chat mode with ASR system prompt)
lai transcribe run audio.wav output.srt \
transcription.model_name=google/gemma-3n-E4B-it \
transcription.api_base_url=http://localhost:8084/v1 \
transcription.language=zh
# Voxtral Realtime (streaming WebSocket, <500ms latency)
# Server: VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
# --host 0.0.0.0 --port 8086 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'
lai transcribe run audio.wav output.srt \
transcription.model_name=mistralai/Voxtral-Mini-4B-Realtime-2602 \
transcription.api_base_url=http://localhost:8086/v1 \
transcription.api_mode=realtimeVisit lattifai.com/roadmap for updates.
| Date | Release | Features |
|---|---|---|
| Oct 2025 | Lattice-1-Alpha | ✅ English forced alignment, multi-format support |
| Nov 2025 | Lattice-1 | ✅ EN+ZH+DE, speaker diarization, multi-model transcription |
| Q2 2026 | Lattice-2 | ✅ Streaming mode, 🔮 40+ languages, real-time alignment |
git clone https://github.com/lattifai/lattifai-python.git
cd lattifai-python
# Using uv (recommended, auto-configures extra index)
uv sync && source .venv/bin/activate
# Or pip (requires extra-index-url for lattifai-core)
pip install -e ".[all,dev]" --extra-index-url https://lattifai.github.io/pypi/simple/
# Run tests
pytest
# Install pre-commit hooks
pre-commit install- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make changes and add tests
- Run
pytestandpre-commit run --all-files - Commit your changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Issues: GitHub Issues
- Discord: Join our community
Apache License 2.0


