Skip to content

Commit 7bd3f05

Browse files
committed
feat: implement hybrid search functionality with BM25 and semantic retrieval
1 parent b612262 commit 7bd3f05

File tree

12 files changed

+539
-6
lines changed

12 files changed

+539
-6
lines changed

CLAUDE.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Commands
6+
7+
```bash
8+
# Install dependencies
9+
make install # uv pip install -r pyproject.toml
10+
11+
# Run all tests
12+
make test # PYTHONPATH=src python3 -m unittest -v
13+
14+
# Run a single test module
15+
PYTHONPATH=src python3 -m unittest tests.tests_rag.test_rag_pipeline
16+
17+
# Format code
18+
uv run black .
19+
```
20+
21+
Python >= 3.12 required. Uses `uv` as package manager.
22+
23+
## Architecture
24+
25+
RAGLight is a modular RAG library built around the **Builder pattern** for pipeline composition. The core abstraction is a LangGraph `StateGraph` (`retrieve → generate`) that orchestrates embeddings, vector store, and LLM.
26+
27+
### Data flow
28+
29+
```
30+
FolderSource / GitHubSource
31+
→ DocumentProcessorFactory (PDF / Code / Text / VLM-PDF)
32+
→ EmbeddingsModel.embed_documents()
33+
→ VectorStore.ingest()
34+
35+
Query → VectorStore.similarity_search() → [CrossEncoder rerank] → LLM.generate() → Answer
36+
```
37+
38+
### Key abstractions (all use ABC + strategy pattern)
39+
40+
| Abstraction | Location | Implementations |
41+
|---|---|---|
42+
| `LLM` | `src/raglight/llm/llm.py` | Ollama, LMStudio, Mistral, OpenAI, Gemini |
43+
| `EmbeddingsModel` | `src/raglight/embeddings/embeddings_model.py` | HuggingFace, Ollama, OpenAI, Gemini |
44+
| `VectorStore` | `src/raglight/vectorstore/vector_store.py` | ChromaVS only |
45+
| `DocumentProcessor` | `src/raglight/document_processing/document_processor.py` | PDF, Code, Text, VLM-PDF |
46+
47+
### Extending the library
48+
49+
- **New LLM**: extend `LLM`, implement `load()` + `generate(input: Dict) -> str`, register in `builder.py` `with_llm()`
50+
- **New embeddings**: extend `EmbeddingsModel`, implement `load()` + `embed_documents()` + `embed_query()`, register in `builder.py` `with_embeddings()`
51+
- **New vector store**: extend `VectorStore`, implement abstract methods, register in `builder.py` `with_vector_store()`
52+
- **New document processor**: extend `DocumentProcessor`, implement `process()`, register in `DocumentProcessorFactory.get_processor()`
53+
54+
### Pipeline entry points
55+
56+
- `src/raglight/rag/simple_rag_api.py``RAGPipeline` (high-level, recommended for users)
57+
- `src/raglight/rag/builder.py``Builder` (fluent API for custom pipelines)
58+
- `src/raglight/rag/rag.py``RAG` (core LangGraph state machine)
59+
- `src/raglight/rag/simple_agentic_rag_api.py``AgenticRAGPipeline` (agent mode with MCP tools)
60+
61+
### Configuration
62+
63+
All configs are `@dataclass` in `src/raglight/config/`. Provider constants (e.g. `Settings.OLLAMA`, `Settings.CHROMA`) are defined in `src/raglight/config/settings.py`.
64+
65+
## Testing conventions
66+
67+
- Framework: `unittest` with `unittest.mock`
68+
- Tests live in `tests/` mirroring the `src/raglight/` structure
69+
- Shared test constants in `tests/test_config.py`
70+
- Mock pattern: instantiate the class with `preload_model=False` (or equivalent), then inject `MagicMock()` into the relevant attribute before calling methods

README.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Designed for simplicity and flexibility, RAGLight provides modular components to
4343
- [MCP Integration](#mcp-integration)
4444
- [Use Custom Pipeline](#use-custom-pipeline)
4545
- [Override Default Processors](#override-default-processors)
46+
- [Hybrid Search](#hybrid-search-bm25--semantic--rrf-)
4647

4748
- [Use RAGLight with Docker](#use-raglight-with-docker)
4849

@@ -73,6 +74,7 @@ Designed for simplicity and flexibility, RAGLight provides modular components to
7374
- 🔌 **MCP Integration**: Add external tool capabilities (e.g. code execution, database access) via MCP servers.
7475
- **Flexible Document Support**: Ingest and index various document types (e.g., PDF, TXT, DOCX, Python, Javascript, ...).
7576
- **Extensible Architecture**: Easily swap vector stores, embedding models, or LLMs to suit your needs.
77+
- 🔍 **Hybrid Search (BM25 + Semantic + RRF)**: Combine keyword-based BM25 retrieval with dense vector search using Reciprocal Rank Fusion for best-of-both-worlds results.
7678

7779
---
7880

@@ -499,6 +501,77 @@ vector_store.ingest(data_path=data_path)
499501

500502
With this setup, all `.pdf` files will be processed by your custom `VlmPDFProcessor`, while other file types keep using the default processors.
501503

504+
### Hybrid Search (BM25 + Semantic + RRF) 🔍
505+
506+
RAGLight supports three retrieval strategies, configurable via the `search_type` parameter:
507+
508+
| Mode | Description |
509+
|---|---|
510+
| `"semantic"` | Dense vector similarity search (default) |
511+
| `"bm25"` | Keyword-based BM25 search |
512+
| `"hybrid"` | BM25 + semantic merged with Reciprocal Rank Fusion (RRF) |
513+
514+
#### With the Builder API
515+
516+
```python
517+
from raglight.rag.builder import Builder
518+
from raglight.config.settings import Settings
519+
520+
rag = (
521+
Builder()
522+
.with_embeddings(Settings.HUGGINGFACE, model_name="all-MiniLM-L6-v2")
523+
.with_vector_store(
524+
Settings.CHROMA,
525+
persist_directory="./myDb",
526+
collection_name="my_collection",
527+
search_type=Settings.SEARCH_HYBRID, # "semantic" | "bm25" | "hybrid"
528+
alpha=0.5, # weight between semantic and BM25 in RRF
529+
)
530+
.with_llm(Settings.OLLAMA, model_name="llama3.1:8b")
531+
.build_rag(k=5)
532+
)
533+
534+
rag.vector_store.ingest(data_path="./docs")
535+
response = rag.generate("What is Reciprocal Rank Fusion?")
536+
print(response)
537+
```
538+
539+
#### With the high-level RAGPipeline API
540+
541+
```python
542+
from raglight.rag.simple_rag_api import RAGPipeline
543+
from raglight.config.rag_config import RAGConfig
544+
from raglight.config.vector_store_config import VectorStoreConfig
545+
from raglight.config.settings import Settings
546+
from raglight.models.data_source_model import FolderSource
547+
548+
vector_store_config = VectorStoreConfig(
549+
embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
550+
provider=Settings.HUGGINGFACE,
551+
database=Settings.CHROMA,
552+
persist_directory="./myDb",
553+
collection_name="my_collection",
554+
search_type=Settings.SEARCH_HYBRID, # or SEARCH_SEMANTIC / SEARCH_BM25
555+
hybrid_alpha=0.5,
556+
)
557+
558+
config = RAGConfig(
559+
llm="llama3.1:8b",
560+
provider=Settings.OLLAMA,
561+
k=5,
562+
knowledge_base=[FolderSource(path="./docs")],
563+
)
564+
565+
pipeline = RAGPipeline(config, vector_store_config)
566+
pipeline.build()
567+
response = pipeline.generate("Explain the retrieval pipeline")
568+
print(response)
569+
```
570+
571+
> **How RRF works**: each search mode returns its own ranked list of documents. RRF assigns a score of `1 / (k + rank)` to each document per list and sums them — documents appearing high in both lists are promoted, while documents unique to one list are kept but ranked lower. This gives the hybrid mode better recall and precision than either mode alone.
572+
573+
> See the full working example in [examples/hybrid_search_example.py](examples/hybrid_search_example.py).
574+
502575
---
503576

504577
## Use RAGLight with Docker

examples/hybrid_search_example.py

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
"""
2+
Hybrid Search Example
3+
=====================
4+
Demonstrates the three search modes available in RAGLight:
5+
- "semantic" : vector similarity only (default)
6+
- "bm25" : keyword-based BM25 search only
7+
- "hybrid" : BM25 + semantic combined via Reciprocal Rank Fusion (RRF)
8+
9+
Requirements:
10+
- Ollama running locally with llama3 (or any model you prefer)
11+
- rank_bm25 installed (pip install raglight includes it)
12+
"""
13+
14+
import uuid
15+
from raglight.rag.builder import Builder
16+
from raglight.config.settings import Settings
17+
from dotenv import load_dotenv
18+
19+
load_dotenv()
20+
Settings.setup_logging()
21+
22+
persist_directory = "./hybridDb"
23+
model_embeddings = Settings.DEFAULT_EMBEDDINGS_MODEL
24+
model_name = "llama3.1:8b"
25+
collection_name = str(uuid.uuid4())
26+
data_path = "./src/raglight" # folder to ingest — adjust to your own documents
27+
28+
# ── 1. Semantic search (default behaviour) ──────────────────────────────────
29+
print("\n=== Semantic search ===")
30+
rag_semantic = (
31+
Builder()
32+
.with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings)
33+
.with_vector_store(
34+
Settings.CHROMA,
35+
persist_directory=persist_directory,
36+
collection_name=collection_name,
37+
search_type=Settings.SEARCH_SEMANTIC, # default — can be omitted
38+
)
39+
.with_llm(Settings.OLLAMA, model_name=model_name, system_prompt=Settings.DEFAULT_SYSTEM_PROMPT)
40+
.build_rag(k=5)
41+
)
42+
rag_semantic.vector_store.ingest(data_path=data_path)
43+
response = rag_semantic.generate("How do I create a RAG pipeline with RAGLight?")
44+
print(response)
45+
46+
# ── 2. BM25-only search ──────────────────────────────────────────────────────
47+
print("\n=== BM25 search ===")
48+
rag_bm25 = (
49+
Builder()
50+
.with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings)
51+
.with_vector_store(
52+
Settings.CHROMA,
53+
persist_directory=persist_directory,
54+
collection_name=collection_name + "_bm25",
55+
search_type=Settings.SEARCH_BM25,
56+
)
57+
.with_llm(Settings.OLLAMA, model_name=model_name, system_prompt=Settings.DEFAULT_SYSTEM_PROMPT)
58+
.build_rag(k=5)
59+
)
60+
rag_bm25.vector_store.ingest(data_path=data_path)
61+
response = rag_bm25.generate("What classes are available in the vectorstore module?")
62+
print(response)
63+
64+
# ── 3. Hybrid search (BM25 + semantic via RRF) ───────────────────────────────
65+
print("\n=== Hybrid search (RRF) ===")
66+
rag_hybrid = (
67+
Builder()
68+
.with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings)
69+
.with_vector_store(
70+
Settings.CHROMA,
71+
persist_directory=persist_directory,
72+
collection_name=collection_name + "_hybrid",
73+
search_type=Settings.SEARCH_HYBRID,
74+
alpha=0.5, # weight of semantic vs BM25 in the RRF merge (0=BM25 only, 1=semantic only)
75+
)
76+
.with_llm(Settings.OLLAMA, model_name=model_name, system_prompt=Settings.DEFAULT_SYSTEM_PROMPT)
77+
.build_rag(k=5)
78+
)
79+
rag_hybrid.vector_store.ingest(data_path=data_path)
80+
response = rag_hybrid.generate("Explain the Builder pattern used in RAGLight")
81+
print(response)
82+
83+
# ── 4. Hybrid search via VectorStoreConfig (high-level API) ─────────────────
84+
print("\n=== Hybrid search via RAGPipeline (high-level API) ===")
85+
from raglight.rag.simple_rag_api import RAGPipeline
86+
from raglight.config.rag_config import RAGConfig
87+
from raglight.config.vector_store_config import VectorStoreConfig
88+
from raglight.models.data_source_model import FolderSource
89+
90+
vector_store_config = VectorStoreConfig(
91+
embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
92+
provider=Settings.HUGGINGFACE,
93+
database=Settings.CHROMA,
94+
persist_directory=persist_directory,
95+
collection_name=collection_name + "_api",
96+
search_type=Settings.SEARCH_HYBRID, # <-- hybrid mode
97+
hybrid_alpha=0.5,
98+
)
99+
100+
config = RAGConfig(
101+
llm=model_name,
102+
provider=Settings.OLLAMA,
103+
k=5,
104+
knowledge_base=[FolderSource(path=data_path)],
105+
)
106+
107+
pipeline = RAGPipeline(config, vector_store_config)
108+
pipeline.build()
109+
response = pipeline.generate("How does the ChromaVS vector store work?")
110+
print(response)

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ dependencies = [
3636
"nest-asyncio>=1.6.0",
3737
"typing-extensions==4.15.0",
3838
"nltk>=3.9.2",
39+
"rank_bm25>=0.2.2",
3940
]
4041
[project.scripts]
4142
raglight = "raglight.cli.main:app"

src/raglight/config/settings.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,10 @@ def example_function():
185185
THINKING_PATTERN = r"<think>(.*?)</think>"
186186
DEFAULT_K = 5
187187

188+
SEARCH_SEMANTIC = "semantic"
189+
SEARCH_BM25 = "bm25"
190+
SEARCH_HYBRID = "hybrid"
191+
188192
DEFAULT_IGNORE_FOLDERS = [
189193
".venv",
190194
"venv",

src/raglight/config/vector_store_config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,5 @@ class VectorStoreConfig:
1717
ignore_folders: list = field(
1818
default_factory=lambda: list(Settings.DEFAULT_IGNORE_FOLDERS)
1919
)
20+
search_type: str = field(default=Settings.SEARCH_SEMANTIC)
21+
hybrid_alpha: float = 0.5

src/raglight/rag/builder.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,14 @@ def with_vector_store(self, type: str, **kwargs) -> Builder:
114114
"You need to set an embedding model before setting a vector store"
115115
)
116116
elif type == Settings.CHROMA:
117-
self.vector_store = ChromaVS(embeddings_model=self.embeddings, **kwargs)
117+
search_type = kwargs.pop("search_type", Settings.SEARCH_SEMANTIC)
118+
alpha = kwargs.pop("alpha", 0.5)
119+
self.vector_store = ChromaVS(
120+
embeddings_model=self.embeddings,
121+
search_type=search_type,
122+
alpha=alpha,
123+
**kwargs,
124+
)
118125
else:
119126
raise ValueError(f"Unknown VectorStore type: {type}")
120127
logging.info("✅ VectorStore created")

src/raglight/rag/simple_rag_api.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ def __init__(
5858
collection_name=collection_name,
5959
host=vector_store_config.host,
6060
port=vector_store_config.port,
61+
search_type=vector_store_config.search_type,
62+
alpha=vector_store_config.hybrid_alpha,
6163
)
6264
.with_llm(
6365
provider,
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
from __future__ import annotations
2+
import json
3+
import re
4+
from pathlib import Path
5+
from typing import List, Tuple, Optional
6+
7+
from rank_bm25 import BM25Okapi
8+
9+
10+
class BM25Index:
11+
"""Lightweight BM25 index over a list of text documents."""
12+
13+
def __init__(self) -> None:
14+
self.corpus: List[str] = []
15+
self._bm25: Optional[BM25Okapi] = None
16+
17+
def _tokenize(self, text: str) -> List[str]:
18+
return re.findall(r'\w+', text.lower())
19+
20+
def _rebuild(self) -> None:
21+
if self.corpus:
22+
self._bm25 = BM25Okapi([self._tokenize(t) for t in self.corpus])
23+
else:
24+
self._bm25 = None
25+
26+
def add_documents(self, texts: List[str]) -> None:
27+
self.corpus.extend(texts)
28+
self._rebuild()
29+
30+
def search(self, query: str, k: int) -> List[Tuple[int, float]]:
31+
if not self._bm25 or not self.corpus:
32+
return []
33+
tokens = self._tokenize(query)
34+
scores = self._bm25.get_scores(tokens)
35+
indexed = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
36+
return [(idx, score) for idx, score in indexed[:k]]
37+
38+
def save(self, path: Path) -> None:
39+
path.write_text(json.dumps(self.corpus, ensure_ascii=False), encoding="utf-8")
40+
41+
def load(self, path: Path) -> None:
42+
self.corpus = json.loads(path.read_text(encoding="utf-8"))
43+
self._rebuild()

0 commit comments

Comments
 (0)