Skip to content

Aigle-2/WhisperAttackAPI

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

285 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WhisperAttackAPI - STT Backends for VoiceAttack

This repository provides a single-server approach for using modern speech-to-text (STT) backends with VoiceAttack, replacing Windows Speech Recognition with accurate push-to-talk transcription.

This fork keeps the WhisperAttack workflow but adds a provider-agnostic STT backend layer. The default backend is ElevenLabs Scribe v2 for API-based transcription that does not consume the GPU DCS needs. The original local faster_whisper workflow remains available as a configurable fallback.

This is a fork for further integration of KneeboardWhisper by the amazing creator @BojoteX. A special thank you goes to @hradec, whose original script used Google Voice Recognition, @SeaTechNerd83 for helping combine the two approaches and creating a VA plugin and finally @sleighzy for VAICOM implementation and the lengthy list of bug fixes and enchancements that would fill this page

In short, SeaTechNerd83 and I combined the two scripts to run voice commands through Whisper using BojoteX's code and then pushed it into VoiceAttack using hradec's code. To speed this up, I unified the codebase into one file and made it run a server to send commands to VoiceAttack. WhisperAttack will run on any Nvidia GPU with 6GB or more of VRAM and will run along with DCS (performance tuning may be required for lower VRAM cards) although absolute minimum spec GPU has not yet been confirmed, RTX 2060 6gb and GTX 1070 8gb have been confirmed working stutter free alongside DCS in VR.


Features

  • Provider-agnostic STT backends:

    • Records mic audio on demand (via socket commands).
    • Transcribes the .wav file using the configured backend.
    • Sends recognized text into VoiceAttack.
    • Pushes transcribed text to clipboard - (perfect for voice to text DCS Chat...)
    • Supports the original local faster_whisper backend.
    • Supports ElevenLabs Scribe v2 via API.
    • Supports OpenAI gpt-4o-transcribe via API.
    • Supports Deepgram Nova-3 via API.
  • VoiceAttack Command Plugin

    • Sends "start", "stop", or "shutdown" commands to the server directly through VoiceAttack.
  • Advantages:

    • API-backed STT avoids using GPU resources needed by DCS.
    • Local Whisper can still be used offline when preferred.
    • Push-to-Talk style workflow with VoiceAttack press & release.
    • STT keyterms can bias recognition toward DCS, VAICOM, ATC, callsigns, and airfields.

VAICOM integration

Instructions for integrating with VAICOM can be located in the VAICOM INTEGRATION documentation.


Requirements

  • VoiceAttack

  • GPU (Optional, but Recommended)

    • Only required when using the faster_whisper backend.
    • API-backed providers do not use local GPU resources.
  • API key (API backends)

    • Create an API key for the provider configured in settings.cfg.
    • Set it with Set STT API Key.cmd from the release folder.
    • Do not put API keys in settings.cfg or commit them to the repository.

Installation

These instructions are for normal users. You do not need Python, Git, Visual Studio, CUDA, or any developer tooling when using the release ZIP.

  1. Download the latest WhisperAttackAPI release ZIP from GitHub Releases.
  2. Extract the ZIP anywhere on your computer, for example:
C:\Program Files\WhisperAttackAPI

or:

C:\Users\yourname\Desktop\WhisperAttackAPI
  1. Open the extracted folder.
  2. Double-click Set STT API Key.cmd once and paste your provider API key.
  3. Double-click WhisperAttackAPI.exe.
  4. Create a shortcut to WhisperAttackAPI.exe if desired.

Keep the folder structure intact. Do not move only the .exe file elsewhere; it must stay beside _internal, settings.cfg, fuzzy_words.txt, word_mappings.txt, and the icon files.

The release folder is expected to look like this:

WhisperAttackAPI v1.2.2-api.1\
  _internal\
  WhisperAttackAPI.exe
  settings.cfg
  fuzzy_words.txt
  word_mappings.txt
  whisper_attack_icon.png
  add_icon.png
  Set STT API Key.cmd
  Set ElevenLabs API Key.cmd
  README_FIRST.txt

VoiceAttack and VAICOM setup stays the same as WhisperAttack when the VoiceAttack plugin connects to 127.0.0.1:65432.


Configuration

The default configuration files are stored beside the WhisperAttack application. Custom configuration can be kept in files of the same name in the C:\Users\username\AppData\Local\WhisperAttack directory. These custom files can be created if they do not exist and can be used to override (or add to for word mappings) the default configuration.

Keeping custom configuration at that location means it will not be overwritten when installing later versions of WhisperAttack.

See below for the list of configuration files.

settings.cfg

The settings.cfg file contains configuration for WhisperAttack.

The default values should cover most cases but can be changed:

  • stt_backend - The speech-to-text backend to use, elevenlabs by default in this fork.
    • Supported values: elevenlabs, openai, deepgram, faster_whisper
  • stt_language - Language hint for transcription, en by default for VAICOM English commands.
  • stt_timeout_seconds - API request timeout in seconds.
  • stt_keyterm_sources - Comma-separated sources used to build provider keyterms without duplicating vocabulary in settings.cfg.
    • Supported values: custom, phonetic_alphabet, fuzzy_words, word_mapping_replacements, word_mapping_aliases, dcs_default, vaicom
  • stt_keyterms_extra - Optional comma-separated extra provider keyterms. Prefer fuzzy_words.txt for domain vocabulary.
  • elevenlabs_api_key_env - Environment variable containing the ElevenLabs API key. Defaults to ELEVENLABS_API_KEY.
  • elevenlabs_model - ElevenLabs model ID, scribe_v2 by default.
  • elevenlabs_no_verbatim - Removes filler words and false starts when supported. Defaults to true.
  • elevenlabs_tag_audio_events - Enables or disables audio event tags. Defaults to false.
  • elevenlabs_timestamps_granularity - Timestamp granularity. Defaults to none because VoiceAttack only needs text.
  • elevenlabs_max_keyterms - Maximum generated keyterms to send to ElevenLabs. Defaults to 900.
  • elevenlabs_max_keyterm_chars - Maximum characters per ElevenLabs keyterm. Defaults to 50.
  • openai_api_key_env - Environment variable containing the OpenAI API key. Defaults to OPENAI_API_KEY.
  • openai_model - OpenAI transcription model ID, gpt-4o-transcribe by default.
  • openai_include_keyterms_in_prompt - Adds generated DCS/VAICOM keyterms to the OpenAI transcription prompt.
  • openai_max_prompt_keyterms - Maximum generated keyterms to include in the OpenAI prompt.
  • openai_prompt_keyterm_char_budget - Maximum generated keyterm text length to add to the OpenAI prompt.
  • deepgram_api_key_env - Environment variable containing the Deepgram API key. Defaults to DEEPGRAM_API_KEY.
  • deepgram_model - Deepgram model ID, nova-3 by default.
  • deepgram_smart_format - Enables Deepgram smart formatting. Defaults to true.
  • deepgram_detect_language - Lets Deepgram detect the spoken language instead of sending stt_language.
  • deepgram_max_keyterms - Maximum generated keyterms to send as Deepgram keyterm parameters.
  • whisper_model - The Whisper model to use, small.en by default. See the table at the bottom of the README file for options.
    • A smaller size can be specified for reducing the amount of VRAM used, e.g. base.en or tiny.en
  • whisper_device - Which device to run the Whisper transcription process on, GPU (default) or CPU
  • theme - To display the WhisperAttack UI in light or dark mode. Valid values:
    • default - this will use the current theme you have set for Windows
    • dark - dark mode
    • light - light mode

API key setup

For release users, use the helper included beside the exe:

Set STT API Key.cmd

This stores the selected provider key in your Windows user environment. The key is not written to settings.cfg.

PowerShell alternative:

setx ELEVENLABS_API_KEY "your-api-key"
setx OPENAI_API_KEY "your-api-key"
setx DEEPGRAM_API_KEY "your-api-key"

Restart WhisperAttackAPI after setting the environment variable.

VAICOM keyterms

The generated VAICOM vocabulary lives in stt_backends/vaicom_keyterms.txt. It was built from the local VAICOMPRO install at:

E:\Jeux\steamapps\common\VoiceAttack 2\Apps\VAICOMPRO

The checked-in list is a curated provider shortlist capped at 850 terms. It is post-processed into unique words: composed phrases, numeric tokens, low-value UI words, and code-only terms such as ICAO identifiers are removed. Technical acronyms such as IFF, TV, and TACAN are placed first, high-value command words such as boresight, clearance, and wheelchocks follow, then callsigns, common DCS terms, and selected proper names. It is generated from VAICOM command phrases, recipients, callsigns, ATC/airfield aliases, RIO/WSO/George commands, and current F10/mission menu terms where available.

Spelled aviation codes are normalized after transcription, so the keyterm list does not need to include every code. For example, U L M B, U-L-M-B, or E.S.N.J are compacted to ULMB and ESNJ before text is sent to VoiceAttack.

To refresh it from a local VAICOM install:

python tools\generate_vaicom_keyterms.py --vaicom-root "E:\Jeux\steamapps\common\VoiceAttack 2\Apps\VAICOMPRO" --saved-games "C:\Users\esteb\Saved Games\DCS"

Use --max-terms to raise or lower the generated shortlist size.

Optional STT providers

ElevenLabs remains the default because it has worked well for DCS/VAICOM push-to-talk with French-accented English. Users can switch providers by editing settings.cfg:

stt_backend=openai

OpenAI uses the official transcription endpoint with gpt-4o-transcribe by default. WhisperAttackAPI sends the DCS/VAICOM prompt and a budgeted set of generated keyterms as transcription context. See the official OpenAI Speech-to-Text guide and transcription API reference.

stt_backend=deepgram

Deepgram uses prerecorded transcription with nova-3 by default. WhisperAttackAPI sends a budgeted set of generated DCS/VAICOM keyterms as Deepgram keyterm query parameters. See the official Deepgram prerecorded audio guide, Nova-3 model overview, and Keyterm Prompting docs.

ElevenLabs cost estimate

Pricing can change, so check the official ElevenLabs API pricing page before publishing guidance to users. The ElevenLabs Speech-to-Text docs describe Scribe v2, language support, and keyterm prompting. The estimate below was checked on 2026-06-18.

WhisperAttackAPI currently uses scribe_v2 in batch Speech-to-Text mode, not Scribe v2 Realtime. The default configuration sends DCS/VAICOM keyterms, so the estimate includes keyterm prompting.

Assumptions:

  • Scribe v1/v2 Speech-to-Text: $0.22 per transcribed audio hour.
  • Keyterm prompting: +$0.05 per transcribed audio hour.
  • Entity detection is not used.
  • Realtime transcription is not used.
  • Estimated total: $0.27 per transcribed audio hour, before taxes.

With $5:

$5 / $0.27 = 18.5 hours of transcribed audio

This is not the same as 18.5 hours of gameplay. WhisperAttackAPI only sends audio while push-to-talk is recording.

Usage style Transcribed audio per gameplay hour Estimated cost per gameplay hour $5 covers about
Light radio use 30 seconds $0.00225 2200 gameplay hours
Normal VAICOM use 2 minutes $0.009 555 gameplay hours
Intensive radio use 5 minutes $0.0225 222 gameplay hours
Very chatty / dictation 15 minutes $0.0675 74 gameplay hours
Push-to-talk nearly always held 60 minutes $0.27 18.5 gameplay hours

A typical 3-second command costs roughly:

$0.27 / 3600 * 3 = $0.000225

So $5 covers about 22,000 short 3-second commands under the straight audio-duration estimate.

Important caveat: ElevenLabs documents Speech-to-Text as billed per audio minute, but the public pricing page does not clearly state whether many very short API requests are rounded up individually. If each short push-to-talk clip were rounded up to one full minute, $5 would cover about 1,111 commands instead. The safest validation is to send a small number of test commands, then check usage in the ElevenLabs developer dashboard.

Building the executable (maintainers only)

Normal users should download the release ZIP and should not run this step. The recommended maintainer build is the API-only executable. It avoids bundling Torch and faster-whisper, so the package is smaller and DCS keeps priority on the GPU.

Double-click:

build_api_only.cmd

The executable is created at:

dist\release\WhisperAttackAPI v1.2.2-api.1\WhisperAttackAPI.exe

The distributable ZIP is created beside it:

dist\release\WhisperAttackAPI v1.2.2-api.1.zip

Any intermediate PyInstaller output is kept under build; only dist\release is meant to be published.

The release folder follows the original WhisperAttack layout: the exe, _internal, settings.cfg, fuzzy_words.txt, word_mappings.txt, icons, and a small API-key helper are all at the top level.

To build the larger offline-capable executable that includes the local faster_whisper backend, double-click:

build_full.cmd

Local Whisper setup

To run fully offline, update settings.cfg:

stt_backend=faster_whisper
whisper_model=small.en

This requires the full executable built with build_full.cmd or a Python environment installed from requirements.txt.

word_mappings.txt

The word_mappings.txt file contains keys and values that can be used to replace a spoken word with another word. For example, if the transcription often outputs "Inter" when you are saying "Enter" then this can be added as a word placement.

The word replacement configuration also supports specifying multiple words to be replaced with a single word, these are separated by a semicolon ;. In the example below saying either "gulf" or "gold" would be replaced with "Golf".

gulf;gold=Golf
inter=Inter

WhisperAttack needs to be restarted after making changes to this file. New word mappings can be added via the configuration screen and do not require a restart. When adding new word mappings they will be created in your custom configuration file, C:\Users\username\AppData\Local\WhisperAttack\word_mappings.txt


Running the Whisper Server

Double click the WhisperAttackAPI.exe file or shortcut. This will open an application window and start the server.

The application window will display startup logging information, including the effective STT keyterm context, the raw text transcribed from the speech, and the final cleaned up command text that was sent to VoiceAttack or DCS. The window can be closed, and then shown again from the menu in the WhisperAttack icon in the Windows system tray. WhisperAttack will continue running even when the window is closed.

WhisperAttack will have completed loading once the "Server started and listening" message is displayed.

Loaded STT keyterm context:
provider: elevenlabs
sources: custom=0, phonetic_alphabet=26, fuzzy_words=..., word_mapping_replacements=..., dcs_default=24, vaicom=850
available: ... unique terms
effective: ... terms sent to elevenlabs
Loading STT backend (elevenlabs) ...
Server started and listening on 127.0.0.1:65432...

whisperattack_voiceattack

A WhisperAttack icon will be placed in your Windows system tray. Right-clicking this will give options to show the WhisperAttack window, or to exit the application.

whisperattack_systemtrayicon

Closing VoiceAttack will also stop and close WhisperAttack.

NOTE: There may be a slow startup time for the Whisper Model to download. This process only needs to take place once (unless you change the Whisper Model to be used)

The Whisper server will output logs to the C:\Users\username\AppData\Local\WhisperAttack\WhisperAttack.log file.


Configuring VoiceAttack

Pre-configured Voice Attack Profile is added to the release for your convenience. It is recommended to read through the steps below to understand how whisper injections actually work!

1. Disable all speech recognition within VoiceAttack

Disable_speech_recognition VoiceAttack_startup

2. Enable Plugin support in VoiceAttack

Go to Options → General → Enable Plugin Support.

EnablePluginsVA

3. Place Plugin in VoiceAttack Apps folder

After extracting the .zip file, Locate the WhisperAttackServerCommand folder and copy the entire folder

image

Locate the VoiceAttack Apps Folder

image

Paste the entire WhisperAttackServerCommand folder into the Apps folder

image

If the plugin is enabled and active and everything is set up correctly, VoiceAttack should give these messages on startup:

image

4. Create Recording commands

In VoiceAttack, go to Edit Profile.

New Command for "Start Whisper Recording":

  • When this command executes:
    • Go to Other → Advancced → Execute an External Plugin Function.
    • Plugin: Point it to 'WASC V0.1beta'
    • Plugin Context:
Start Whisper Recording

Assign a joystick or key press to this command (e.g., "Joystick Button 14 (pressed)").

Whisperattackreadme

Another Command for "Stop Whisper Recording":

Same steps, except the Parameters is:

Stop Whisper Recording

Assign the same joystick button but check "Shortcut is invoked only when released."

Whisperattackreadme1


Adding new word mappings

Word mappings can be added to WhisperAttack so that when these words are found within transcribed sentences they will be replaced with the replacement word you provide. This can aid with replacing words that are consistently transcribed incorrectly into the word you actually want.

Click the Add word mapping button to open this configuration screen. Multiple aliases can be entered, separated by semicolons, for a single replacement.

whisperattack_addwordmapping


Clipboard & DCS Kneeboard Integration - Optional

This script preserves BojoteX original vision for the code and copies the commands into clipboard for use with the Kneeboard. The original repo can be found here: https://github.com/BojoteX/KneeboardWhisper

Do the following to enable DCS Kneeboard to transcribe what you say: Once completed, you must say "Note" followed by what you would like to transcribe to kneeboard/clipboard

assignments

kneeboardwhisper


Troubleshooting

Library cublas64_12.dll is not found

If the below below is displayed in the logs then ensure that CUDA 12 is available, e.g. by installing the CUDA Toolkit 12

ERROR - Failed to transcribe audio: Library cublas64_12.dll is not found or cannot be loaded

ValueError: Requested int8_float16 compute type

For some GPUs which do not support certain compute types, i.e. do not have tensor cores, the below message will be output to the logs:

WARNING - GPU does not have tensor cores, major=6, minor=1

WhisperAttack can detect this and will fallback on supported values for cuda cores.

If however the below error message is displayed then the settings.cfg file can be updated.

ValueError: Requested int8_float16 compute type, but the target device or backend do not support efficient int8_float16 computation.

The settings.cfg file can be updated to add the below entry:

whisper_core_type=standard

Performance (AI Model)

If DCS is GPU constrained, use an API backend such as ElevenLabs so transcription does not consume VRAM. This is the default in WhisperAttackAPI:

stt_backend=elevenlabs
elevenlabs_model=scribe_v2

If you use the local faster_whisper backend and WhisperAttack is causing significant studders, it is likely that the current model is overloading your VRAM. In that case, reduce the local Whisper model size:

stt_backend=faster_whisper
whisper_model=base.en
  • Using smaller models will reduce VRAM and compute costs. See below for a full speed breakdown
  • First activation with a new AI model will prompt the model to be downloaded which may take an extended amount of time depending on internet speed.

Available models and languages

There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model. The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~10x
base 74 M base.en base ~1 GB ~7x
small 244 M small.en small ~2 GB ~4x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x
turbo 809 M N/A turbo ~6 GB ~8x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models. Additionally, the turbo model is an optimized version of large-v3 that offers faster transcription speed with a minimal degradation in accuracy.

Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of large-v3 and large-v2 models by language, using WERs (word error rates) or CER (character error rates, shown in Italic) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

Enjoy your local (offline) speech recognition with OpenAI Whisper + VoiceAttack! If you run into issues, open an issue or check the logs for clues.

About

Whisper AI API cloud based Speech to text computing for VoiceAttack

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 90.4%
  • C# 4.6%
  • PowerShell 3.1%
  • Batchfile 1.9%