Extend GitHub Copilot with open-source language models running on your own infrastructure.
GitHub Copilot LLM Gateway is a companion extension for GitHub Copilot that adds support for self-hosted open-source models. It seamlessly integrates with the Copilot chat experience, allowing you to use models like Qwen, Llama, and Mistral alongside—or instead of—the default Copilot models.
This extension connects to any OpenAI-compatible inference server, giving you complete control over your AI-assisted development environment.
| Benefit | Description |
|---|---|
| Data Sovereignty | Your code never leaves your network. All inference happens on your own hardware. |
| Zero API Costs | No per-token fees. Use your GPU resources without usage limits. |
| Model Choice | Access thousands of open-source models from Hugging Face and beyond. |
| Offline Capable | Work without internet once models are downloaded. |
| Full Customization | Fine-tune models for your specific codebase or domain. |
- vLLM — High-performance inference (recommended)
- Ollama — Easy local deployment
- llama.cpp — CPU and GPU inference
- Text Generation Inference — Hugging Face's server
- LocalAI — OpenAI API drop-in replacement
- Any OpenAI Chat Completions API-compatible endpoint
- VS Code 1.106.0 or later
- GitHub Copilot extension installed and signed in
- Inference server running with an OpenAI-compatible API
Install GitHub Copilot LLM Gateway from the VS Code Marketplace.
Launch your inference server with tool calling enabled. Here's an example using vLLM:
vllm serve Qwen/Qwen3-8B \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--host 0.0.0.0 \
--port 42069Verify the server is running:
curl http://localhost:42069/v1/models- Open VS Code Settings (
Ctrl+,/Cmd+,) - Search for "Copilot LLM Gateway"
- Set Server URL to your inference server address (e.g.,
http://localhost:8000) - Configure other settings as needed (token limits, tool calling, etc.)
Note: If the server is unreachable, you'll see an error notification with a quick link to settings:
- Open GitHub Copilot Chat (
Ctrl+Alt+I/Cmd+Alt+I) - Click the model selector dropdown at the bottom of the chat panel
- Click "Manage Models..." to open the model manager
- Select "LLM Gateway" from the provider list
- Enable the models you want to use from your inference server
Your self-hosted models now appear alongside the default Copilot models. Select one and start coding with AI assistance!
The model integrates seamlessly with Copilot's features including:
- Agent mode for autonomous coding tasks
- Tool calling for file operations, terminal commands, and more
- Context awareness with
@workspaceand file references
Configure the extension through VS Code Settings (Ctrl+, / Cmd+,) → search "Copilot LLM Gateway".
| Setting | Default | Description |
|---|---|---|
| Server URL | http://localhost:8000 |
Base URL of your OpenAI-compatible inference server |
| API Key | (empty) | Authentication key if your server requires one |
| Request Timeout | 60000 |
Request timeout in milliseconds |
| Setting | Default | Description |
|---|---|---|
| Default Max Tokens | 32768 |
Context window size (input tokens). Match to your model's capability. |
| Default Max Output Tokens | 4096 |
Maximum tokens the model can generate per response |
These settings control how the extension handles agentic features like code editing and file operations.
| Setting | Default | Description |
|---|---|---|
| Enable Tool Calling | true |
Allow models to use Copilot's tools (file read/write, terminal, etc.) |
| Parallel Tool Calling | true |
Allow multiple tools to be called simultaneously. Disable if your model struggles with parallel calls. |
| Agent Temperature | 0.0 |
Temperature for tool calling mode. Lower values produce more consistent tool call formatting. |
Tip: If your model outputs tool descriptions as text instead of actually calling tools, try setting Agent Temperature to
0.0and disabling Parallel Tool Calling.
These models have been tested with good tool calling support:
| Model | VRAM | Tool Support | Best For |
|---|---|---|---|
| Qwen/Qwen3-8B | ~16GB | Excellent | General coding, 32GB GPU |
| Qwen/Qwen2.5-7B-Instruct | ~14GB | Excellent | Balanced performance |
| Qwen/Qwen2.5-14B-Instruct | ~28GB | Excellent | Higher quality (48GB GPU) |
| meta-llama/Llama-3.1-8B-Instruct | ~16GB | Good | Alternative to Qwen |
Important: Avoid Qwen2.5-Coder models for tool calling—they have known issues with vLLM's tool parser. Use standard Qwen2.5-Instruct or Qwen3 models instead.
pip install vllmEach model family requires a specific parser:
| Model Family | Parser | Example |
|---|---|---|
| Qwen2.5, Qwen3 | hermes |
--tool-call-parser hermes |
| Qwen3-Coder | qwen3_coder |
--tool-call-parser qwen3_coder |
| Llama 3.1/3.2 | llama3_json |
--tool-call-parser llama3_json |
| Mistral | mistral |
--tool-call-parser mistral |
Approximate memory for BF16 (full precision) inference:
| Model Size | Model VRAM | 32K Context Total |
|---|---|---|
| 7-8B | ~16GB | ~22GB |
| 14B | ~28GB | ~34GB |
| 30B+ | ~60GB | Requires quantization |
Qwen3-8B (Recommended):
vllm serve Qwen/Qwen3-8B \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--host 0.0.0.0 \
--port 42069Llama 3.1 8B:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--max-model-len 32768 \
--host 0.0.0.0 \
--port 42069Quantized Model (limited VRAM):
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 16384 \
--gpu-memory-utilization 0.95 \
--host 0.0.0.0 \
--port 42069- Verify server is running:
curl http://your-server:port/v1/models - Check Server URL in settings matches exactly
- Run command "Copilot LLM Gateway: Test Server Connection" from the Command Palette
The model failed to generate output. Try:
- Check tool parser — Ensure
--tool-call-parsermatches your model family - Disable tool calling — Set
github.copilot.llm-gateway.enableToolCallingtofalseto test basic chat - Reduce context — Your conversation may exceed the model's limit
The model outputs text like "Using the read_file tool..." instead of actually calling tools.
- Use Qwen3-8B or Qwen2.5-7B-Instruct (avoid Coder variants)
- Set Agent Temperature to
0.0 - Disable Parallel Tool Calling
- Ensure server has
--enable-auto-tool-choiceflag
- Reduce
--max-model-len(try 8192 or 16384) - Use a quantized model (AWQ, GPTQ, FP8)
- Choose a smaller model
Access from the Command Palette (Ctrl+Shift+P / Cmd+Shift+P):
| Command | Description |
|---|---|
| GitHub Copilot LLM Gateway: Test Server Connection | Test connectivity and list available models |
- Issues & Feature Requests: GitHub Issues
- Discussions: GitHub Discussions
MIT License — see LICENSE for details.
This extension is not affiliated with GitHub or Microsoft. GitHub Copilot is a trademark of GitHub, Inc.




