You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"text": "Point to no more than 10 items in the image. The label returned should be an identifying name for the object detected. The answer should follow the json format: [{\"point\": [y, x], \"label\": <label1>}, ...]. The points are in [y, x] format normalized to 0-1000."
When `index_name` is omitted, LiteLLM automatically creates:
309
+
- S3 vector bucket (if it doesn't exist)
310
+
- Vector index with auto-detected dimensions from your embedding model
311
+
312
+
**Dimension Auto-Detection**: The vector dimension is automatically detected by making a test embedding request to your specified model. No need to manually specify dimensions!
313
+
314
+
**Supported Embedding Models**: Works with any LiteLLM-supported embedding model (OpenAI, Cohere, Bedrock, Azure, etc.)
315
+
:::
316
+
317
+
**Example with auto-detection:**
318
+
```json
319
+
{
320
+
"embedding": {
321
+
"model": "text-embedding-3-small"// Dimension auto-detected as 1536
322
+
},
323
+
"vector_store": {
324
+
"custom_llm_provider": "s3_vectors",
325
+
"vector_bucket_name": "my-embeddings"
326
+
}
327
+
}
328
+
```
329
+
330
+
**Example with custom embedding provider:**
331
+
```json
332
+
{
333
+
"embedding": {
334
+
"model": "cohere/embed-english-v3.0"// Dimension auto-detected as 1024
Traffic mirroring allows you to "mimic" production traffic to a secondary (silent) model for evaluation purposes. The silent model's response is gathered in the background and does not affect the latency or result of the primary request.
835
+
836
+
[**See detailed guide on A/B Testing - Traffic Mirroring here**](./traffic_mirroring.md)
Traffic mirroring allows you to "mimic" production traffic to a secondary (silent) model for evaluation purposes. The silent model's response is gathered in the background and does not affect the latency or result of the primary request.
7
+
8
+
This is useful for:
9
+
- Testing a new model's performance on production prompts before switching.
10
+
- Comparing costs and latency between different providers.
11
+
- Debugging issues by mirroring traffic to a more verbose model.
12
+
13
+
## Quick Start
14
+
15
+
To enable traffic mirroring, add `silent_model` to the `litellm_params` of a deployment.
16
+
17
+
<Tabs>
18
+
<TabItemvalue="sdk"label="SDK">
19
+
20
+
```python
21
+
from litellm import Router
22
+
23
+
model_list = [
24
+
{
25
+
"model_name": "gpt-3.5-turbo",
26
+
"litellm_params": {
27
+
"model": "azure/chatgpt-v-2",
28
+
"api_key": "...",
29
+
"silent_model": "gpt-4"# 👈 Mirror traffic to gpt-4
30
+
},
31
+
},
32
+
{
33
+
"model_name": "gpt-4",
34
+
"litellm_params": {
35
+
"model": "openai/gpt-4",
36
+
"api_key": "..."
37
+
},
38
+
}
39
+
]
40
+
41
+
router = Router(model_list=model_list)
42
+
43
+
# The request to "gpt-3.5-turbo" will trigger a background call to "gpt-4"
44
+
response =await router.acompletion(
45
+
model="gpt-3.5-turbo",
46
+
messages=[{"role": "user", "content": "How does traffic mirroring work?"}]
47
+
)
48
+
```
49
+
50
+
</TabItem>
51
+
<TabItemvalue="proxy"label="Proxy">
52
+
53
+
Add `silent_model` to your `config.yaml`:
54
+
55
+
```yaml
56
+
model_list:
57
+
- model_name: primary-model
58
+
litellm_params:
59
+
model: azure/gpt-35-turbo
60
+
api_key: os.environ/AZURE_API_KEY
61
+
silent_model: evaluation-model # 👈 Mirror traffic here
62
+
- model_name: evaluation-model
63
+
litellm_params:
64
+
model: openai/gpt-4o
65
+
api_key: os.environ/OPENAI_API_KEY
66
+
```
67
+
68
+
</TabItem>
69
+
</Tabs>
70
+
71
+
## How it works
72
+
1. **Request Received**: A request is made to a model group (e.g. `primary-model`).
73
+
2. **Deployment Picked**: LiteLLM picks a deployment from the group.
74
+
3. **Primary Call**: LiteLLM makes the call to the primary deployment.
75
+
4. **Mirroring**: If `silent_model` is present, LiteLLM triggers a background call to that model.
76
+
- For **Sync** calls: Uses a shared thread pool.
77
+
- For **Async** calls: Uses `asyncio.create_task`.
78
+
5. **Isolation**: The background call uses a `deepcopy` of the original request parameters and sets `metadata["is_silent_experiment"] = True`. It also strips out logging IDs to prevent collisions in usage tracking.
79
+
80
+
## Key Features
81
+
- **Latency Isolation**: The primary request returns as soon as it's ready. The background (silent) call does not block.
82
+
- **Unified Logging**: Background calls are processed via the Router, meaning they are automatically logged to your configured observability tools (Langfuse, S3, etc.).
83
+
- **Evaluation**: Use the `is_silent_experiment: True` flag in your logs to filter and compare results between the primary and mirrored calls.
0 commit comments