This benchmark evaluates the performance and accuracy (Word Error Rate - WER) of Automatic Speech Recognition (ASR) models served via SGLang.
openai/whisper-large-v3openai/whisper-large-v3-turbo
Install the required dependencies:
apt install ffmpeg
pip install librosa soundfile datasets evaluate jiwer transformers openai torchcodec torchLaunch the SGLang server with a Whisper model:
python -m sglang.launch_server --model-path openai/whisper-large-v3 --port 30000Basic usage (using chat completions API):
python bench_sglang.py --base-url http://localhost:30000 --model openai/whisper-large-v3 --n-examples 10Using the OpenAI-compatible transcription API:
python bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--api-type transcription \
--language English \
--n-examples 10Run with streaming and show real-time output:
python bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--api-type transcription \
--stream \
--show-predictions \
--concurrency 1Run with higher concurrency and save results:
python bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--concurrency 8 \
--n-examples 100 \
--output results.json \
--show-predictions| Argument | Description | Default |
|---|---|---|
--base-url |
SGLang server URL | http://localhost:30000 |
--model |
Model name on the server | openai/whisper-large-v3 |
--dataset |
HuggingFace dataset for evaluation | D4nt3/esb-datasets-earnings22-validation-tiny-filtered |
--split |
Dataset split to use | validation |
--concurrency |
Number of concurrent requests | 4 |
--n-examples |
Number of examples to process (-1 for all) |
-1 |
--output |
Path to save results as JSON | None |
--show-predictions |
Display sample predictions | False |
--print-n |
Number of samples to display | 5 |
--api-type |
API to use: chat (chat completions) or transcription (audio transcriptions) |
chat |
--language |
Language for transcription API (e.g., English, en) |
None |
--stream |
Enable streaming mode for transcription API | False |
The benchmark outputs:
| Metric | Description |
|---|---|
| Total Requests | Number of successful ASR requests processed |
| WER | Word Error Rate (lower is better), computed using the evaluate library |
| Average Latency | Mean time per request (seconds) |
| Median Latency | 50th percentile latency (seconds) |
| 95th Latency | 95th percentile latency (seconds) |
| Throughput | Requests processed per second |
| Token Throughput | Output tokens per second |
python bench_sglang.py --api-type transcription --concurrency 128 --model openai/whisper-large-v3 --show-predictions
Loading dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered...
Using API type: transcription
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Performing warmup...
Processing 511 samples...
------------------------------
Results for openai/whisper-large-v3:
Total Requests: 511
WER: 12.7690
Average Latency: 1.3602s
Median Latency: 1.2090s
95th Latency: 2.9986s
Throughput: 19.02 req/s
Token Throughput: 354.19 tok/s
Total Test Time: 26.8726s
------------------------------
==================== Sample Predictions ====================
Sample 1:
REF: on the use of taxonomy i you know i think it is it is early days for us to to make any clear indications to the market about the proportion that would fall under that requirement
PRED: on the eu taxonomy i think it is early days for us to make any clear indications to the market about the proportion that would fall under that requirement
----------------------------------------
Sample 2:
REF: so within fiscal year 2021 say 120 a 100 depending on what the micro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
PRED: so within fiscal year 2021 say $120000 $100000 depending on what the macro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
----------------------------------------
Sample 3:
REF: we talked about 4.7 gigawatts
PRED: we talked about 4.7 gigawatts
----------------------------------------
Sample 4:
REF: and you know depending on that working capital build we will we will see what that yields
PRED: and depending on that working capital build we will see what that yields what
----------------------------------------
Sample 5:
REF: so on on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30 70%
PRED: so on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexes are distributed out 30% 70%
----------------------------------------
============================================================- Audio samples longer than 30 seconds are automatically filtered out (Whisper limitation)
- The benchmark performs a warmup request before measuring performance
- Results are normalized using the model's tokenizer when available
- When using
--streamwith--show-predictions, use--concurrency 1for clean sequential output - The
--languageoption accepts both full names (e.g.,English) and ISO 639-1 codes (e.g.,en)
Server connection refused
- Ensure the SGLang server is running and accessible at the specified
--base-url - Check that the port is not blocked by a firewall
Out of memory errors
- Reduce
--concurrencyto lower GPU memory usage - Use a smaller Whisper model variant