feat: implement hybrid search functionality with BM25 and semantic retrieval

Bessouat40 · Bessouat40 · commit 7bd3f052ccf0 · 2026-03-03T23:33:07.000+01:00
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,70 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Commands
+
+```bash
+# Install dependencies
+make install          # uv pip install -r pyproject.toml
+
+# Run all tests
+make test             # PYTHONPATH=src python3 -m unittest -v
+
+# Run a single test module
+PYTHONPATH=src python3 -m unittest tests.tests_rag.test_rag_pipeline
+
+# Format code
+uv run black .
+```
+
+Python >= 3.12 required. Uses `uv` as package manager.
+
+## Architecture
+
+RAGLight is a modular RAG library built around the **Builder pattern** for pipeline composition. The core abstraction is a LangGraph `StateGraph` (`retrieve → generate`) that orchestrates embeddings, vector store, and LLM.
+
+### Data flow
+
+```
+FolderSource / GitHubSource
+    → DocumentProcessorFactory (PDF / Code / Text / VLM-PDF)
+    → EmbeddingsModel.embed_documents()
+    → VectorStore.ingest()
+
+Query → VectorStore.similarity_search() → [CrossEncoder rerank] → LLM.generate() → Answer
+```
+
+### Key abstractions (all use ABC + strategy pattern)
+
+| Abstraction | Location | Implementations |
+|---|---|---|
+| `LLM` | `src/raglight/llm/llm.py` | Ollama, LMStudio, Mistral, OpenAI, Gemini |
+| `EmbeddingsModel` | `src/raglight/embeddings/embeddings_model.py` | HuggingFace, Ollama, OpenAI, Gemini |
+| `VectorStore` | `src/raglight/vectorstore/vector_store.py` | ChromaVS only |
+| `DocumentProcessor` | `src/raglight/document_processing/document_processor.py` | PDF, Code, Text, VLM-PDF |
+
+### Extending the library
+
+- **New LLM**: extend `LLM`, implement `load()` + `generate(input: Dict) -> str`, register in `builder.py` `with_llm()`
+- **New embeddings**: extend `EmbeddingsModel`, implement `load()` + `embed_documents()` + `embed_query()`, register in `builder.py` `with_embeddings()`
+- **New vector store**: extend `VectorStore`, implement abstract methods, register in `builder.py` `with_vector_store()`
+- **New document processor**: extend `DocumentProcessor`, implement `process()`, register in `DocumentProcessorFactory.get_processor()`
+
+### Pipeline entry points
+
+- `src/raglight/rag/simple_rag_api.py` — `RAGPipeline` (high-level, recommended for users)
+- `src/raglight/rag/builder.py` — `Builder` (fluent API for custom pipelines)
+- `src/raglight/rag/rag.py` — `RAG` (core LangGraph state machine)
+- `src/raglight/rag/simple_agentic_rag_api.py` — `AgenticRAGPipeline` (agent mode with MCP tools)
+
+### Configuration
+
+All configs are `@dataclass` in `src/raglight/config/`. Provider constants (e.g. `Settings.OLLAMA`, `Settings.CHROMA`) are defined in `src/raglight/config/settings.py`.
+
+## Testing conventions
+
+- Framework: `unittest` with `unittest.mock`
+- Tests live in `tests/` mirroring the `src/raglight/` structure
+- Shared test constants in `tests/test_config.py`
+- Mock pattern: instantiate the class with `preload_model=False` (or equivalent), then inject `MagicMock()` into the relevant attribute before calling methods
diff --git a/README.md b/README.md
@@ -43,6 +43,7 @@ Designed for simplicity and flexibility, RAGLight provides modular components to
   - [MCP Integration](#mcp-integration)
   - [Use Custom Pipeline](#use-custom-pipeline)
   - [Override Default Processors](#override-default-processors)
+  - [Hybrid Search](#hybrid-search-bm25--semantic--rrf-)
 
 - [Use RAGLight with Docker](#use-raglight-with-docker)
 
@@ -73,6 +74,7 @@ Designed for simplicity and flexibility, RAGLight provides modular components to
 - 🔌 **MCP Integration**: Add external tool capabilities (e.g. code execution, database access) via MCP servers.
 - **Flexible Document Support**: Ingest and index various document types (e.g., PDF, TXT, DOCX, Python, Javascript, ...).
 - **Extensible Architecture**: Easily swap vector stores, embedding models, or LLMs to suit your needs.
+- 🔍 **Hybrid Search (BM25 + Semantic + RRF)**: Combine keyword-based BM25 retrieval with dense vector search using Reciprocal Rank Fusion for best-of-both-worlds results.
 
 ---
 
@@ -499,6 +501,77 @@ vector_store.ingest(data_path=data_path)
 
 With this setup, all `.pdf` files will be processed by your custom `VlmPDFProcessor`, while other file types keep using the default processors.
 
+### Hybrid Search (BM25 + Semantic + RRF) 🔍
+
+RAGLight supports three retrieval strategies, configurable via the `search_type` parameter:
+
+| Mode | Description |
+|---|---|
+| `"semantic"` | Dense vector similarity search (default) |
+| `"bm25"` | Keyword-based BM25 search |
+| `"hybrid"` | BM25 + semantic merged with Reciprocal Rank Fusion (RRF) |
+
+#### With the Builder API
+
+```python
+from raglight.rag.builder import Builder
+from raglight.config.settings import Settings
+
+rag = (
+    Builder()
+    .with_embeddings(Settings.HUGGINGFACE, model_name="all-MiniLM-L6-v2")
+    .with_vector_store(
+        Settings.CHROMA,
+        persist_directory="./myDb",
+        collection_name="my_collection",
+        search_type=Settings.SEARCH_HYBRID,  # "semantic" | "bm25" | "hybrid"
+        alpha=0.5,                           # weight between semantic and BM25 in RRF
+    )
+    .with_llm(Settings.OLLAMA, model_name="llama3.1:8b")
+    .build_rag(k=5)
+)
+
+rag.vector_store.ingest(data_path="./docs")
+response = rag.generate("What is Reciprocal Rank Fusion?")
+print(response)
+```
+
+#### With the high-level RAGPipeline API
+
+```python
+from raglight.rag.simple_rag_api import RAGPipeline
+from raglight.config.rag_config import RAGConfig
+from raglight.config.vector_store_config import VectorStoreConfig
+from raglight.config.settings import Settings
+from raglight.models.data_source_model import FolderSource
+
+vector_store_config = VectorStoreConfig(
+    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
+    provider=Settings.HUGGINGFACE,
+    database=Settings.CHROMA,
+    persist_directory="./myDb",
+    collection_name="my_collection",
+    search_type=Settings.SEARCH_HYBRID,   # or SEARCH_SEMANTIC / SEARCH_BM25
+    hybrid_alpha=0.5,
+)
+
+config = RAGConfig(
+    llm="llama3.1:8b",
+    provider=Settings.OLLAMA,
+    k=5,
+    knowledge_base=[FolderSource(path="./docs")],
+)
+
+pipeline = RAGPipeline(config, vector_store_config)
+pipeline.build()
+response = pipeline.generate("Explain the retrieval pipeline")
+print(response)
+```
+
+> **How RRF works**: each search mode returns its own ranked list of documents. RRF assigns a score of `1 / (k + rank)` to each document per list and sums them — documents appearing high in both lists are promoted, while documents unique to one list are kept but ranked lower. This gives the hybrid mode better recall and precision than either mode alone.
+
+> See the full working example in [examples/hybrid_search_example.py](examples/hybrid_search_example.py).
+
 ---
 
 ## Use RAGLight with Docker
diff --git a/examples/hybrid_search_example.py b/examples/hybrid_search_example.py
@@ -0,0 +1,110 @@
+"""
+Hybrid Search Example
+=====================
+Demonstrates the three search modes available in RAGLight:
+  - "semantic" : vector similarity only (default)
+  - "bm25"     : keyword-based BM25 search only
+  - "hybrid"   : BM25 + semantic combined via Reciprocal Rank Fusion (RRF)
+
+Requirements:
+  - Ollama running locally with llama3 (or any model you prefer)
+  - rank_bm25 installed (pip install raglight includes it)
+"""
+
+import uuid
+from raglight.rag.builder import Builder
+from raglight.config.settings import Settings
+from dotenv import load_dotenv
+
+load_dotenv()
+Settings.setup_logging()
+
+persist_directory = "./hybridDb"
+model_embeddings = Settings.DEFAULT_EMBEDDINGS_MODEL
+model_name = "llama3.1:8b"
+collection_name = str(uuid.uuid4())
+data_path = "./src/raglight"  # folder to ingest — adjust to your own documents
+
+# ── 1. Semantic search (default behaviour) ──────────────────────────────────
+print("\n=== Semantic search ===")
+rag_semantic = (
+    Builder()
+    .with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings)
+    .with_vector_store(
+        Settings.CHROMA,
+        persist_directory=persist_directory,
+        collection_name=collection_name,
+        search_type=Settings.SEARCH_SEMANTIC,   # default — can be omitted
+    )
+    .with_llm(Settings.OLLAMA, model_name=model_name, system_prompt=Settings.DEFAULT_SYSTEM_PROMPT)
+    .build_rag(k=5)
+)
+rag_semantic.vector_store.ingest(data_path=data_path)
+response = rag_semantic.generate("How do I create a RAG pipeline with RAGLight?")
+print(response)
+
+# ── 2. BM25-only search ──────────────────────────────────────────────────────
+print("\n=== BM25 search ===")
+rag_bm25 = (
+    Builder()
+    .with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings)
+    .with_vector_store(
+        Settings.CHROMA,
+        persist_directory=persist_directory,
+        collection_name=collection_name + "_bm25",
+        search_type=Settings.SEARCH_BM25,
+    )
+    .with_llm(Settings.OLLAMA, model_name=model_name, system_prompt=Settings.DEFAULT_SYSTEM_PROMPT)
+    .build_rag(k=5)
+)
+rag_bm25.vector_store.ingest(data_path=data_path)
+response = rag_bm25.generate("What classes are available in the vectorstore module?")
+print(response)
+
+# ── 3. Hybrid search (BM25 + semantic via RRF) ───────────────────────────────
+print("\n=== Hybrid search (RRF) ===")
+rag_hybrid = (
+    Builder()
+    .with_embeddings(Settings.HUGGINGFACE, model_name=model_embeddings)
+    .with_vector_store(
+        Settings.CHROMA,
+        persist_directory=persist_directory,
+        collection_name=collection_name + "_hybrid",
+        search_type=Settings.SEARCH_HYBRID,
+        alpha=0.5,   # weight of semantic vs BM25 in the RRF merge (0=BM25 only, 1=semantic only)
+    )
+    .with_llm(Settings.OLLAMA, model_name=model_name, system_prompt=Settings.DEFAULT_SYSTEM_PROMPT)
+    .build_rag(k=5)
+)
+rag_hybrid.vector_store.ingest(data_path=data_path)
+response = rag_hybrid.generate("Explain the Builder pattern used in RAGLight")
+print(response)
+
+# ── 4. Hybrid search via VectorStoreConfig (high-level API) ─────────────────
+print("\n=== Hybrid search via RAGPipeline (high-level API) ===")
+from raglight.rag.simple_rag_api import RAGPipeline
+from raglight.config.rag_config import RAGConfig
+from raglight.config.vector_store_config import VectorStoreConfig
+from raglight.models.data_source_model import FolderSource
+
+vector_store_config = VectorStoreConfig(
+    embedding_model=Settings.DEFAULT_EMBEDDINGS_MODEL,
+    provider=Settings.HUGGINGFACE,
+    database=Settings.CHROMA,
+    persist_directory=persist_directory,
+    collection_name=collection_name + "_api",
+    search_type=Settings.SEARCH_HYBRID,   # <-- hybrid mode
+    hybrid_alpha=0.5,
+)
+
+config = RAGConfig(
+    llm=model_name,
+    provider=Settings.OLLAMA,
+    k=5,
+    knowledge_base=[FolderSource(path=data_path)],
+)
+
+pipeline = RAGPipeline(config, vector_store_config)
+pipeline.build()
+response = pipeline.generate("How does the ChromaVS vector store work?")
+print(response)
diff --git a/pyproject.toml b/pyproject.toml
@@ -36,6 +36,7 @@ dependencies = [
     "nest-asyncio>=1.6.0",
     "typing-extensions==4.15.0",
     "nltk>=3.9.2",
+    "rank_bm25>=0.2.2",
 ]
 [project.scripts]
 raglight = "raglight.cli.main:app"
diff --git a/src/raglight/config/settings.py b/src/raglight/config/settings.py
@@ -185,6 +185,10 @@ def example_function():
     THINKING_PATTERN = r"<think>(.*?)</think>"
     DEFAULT_K = 5
 
+    SEARCH_SEMANTIC = "semantic"
+    SEARCH_BM25 = "bm25"
+    SEARCH_HYBRID = "hybrid"
+
     DEFAULT_IGNORE_FOLDERS = [
         ".venv",
         "venv",
diff --git a/src/raglight/config/vector_store_config.py b/src/raglight/config/vector_store_config.py
@@ -17,3 +17,5 @@ class VectorStoreConfig:
     ignore_folders: list = field(
         default_factory=lambda: list(Settings.DEFAULT_IGNORE_FOLDERS)
     )
+    search_type: str = field(default=Settings.SEARCH_SEMANTIC)
+    hybrid_alpha: float = 0.5
diff --git a/src/raglight/rag/builder.py b/src/raglight/rag/builder.py
@@ -114,7 +114,14 @@ def with_vector_store(self, type: str, **kwargs) -> Builder:
                 "You need to set an embedding model before setting a vector store"
             )
         elif type == Settings.CHROMA:
-            self.vector_store = ChromaVS(embeddings_model=self.embeddings, **kwargs)
+            search_type = kwargs.pop("search_type", Settings.SEARCH_SEMANTIC)
+            alpha = kwargs.pop("alpha", 0.5)
+            self.vector_store = ChromaVS(
+                embeddings_model=self.embeddings,
+                search_type=search_type,
+                alpha=alpha,
+                **kwargs,
+            )
         else:
             raise ValueError(f"Unknown VectorStore type: {type}")
         logging.info("✅ VectorStore created")
diff --git a/src/raglight/rag/simple_rag_api.py b/src/raglight/rag/simple_rag_api.py
@@ -58,6 +58,8 @@ def __init__(
                 collection_name=collection_name,
                 host=vector_store_config.host,
                 port=vector_store_config.port,
+                search_type=vector_store_config.search_type,
+                alpha=vector_store_config.hybrid_alpha,
             )
             .with_llm(
                 provider,
diff --git a/src/raglight/vectorstore/bm25_index.py b/src/raglight/vectorstore/bm25_index.py
@@ -0,0 +1,43 @@
+from __future__ import annotations
+import json
+import re
+from pathlib import Path
+from typing import List, Tuple, Optional
+
+from rank_bm25 import BM25Okapi
+
+
+class BM25Index:
+    """Lightweight BM25 index over a list of text documents."""
+
+    def __init__(self) -> None:
+        self.corpus: List[str] = []
+        self._bm25: Optional[BM25Okapi] = None
+
+    def _tokenize(self, text: str) -> List[str]:
+        return re.findall(r'\w+', text.lower())
+
+    def _rebuild(self) -> None:
+        if self.corpus:
+            self._bm25 = BM25Okapi([self._tokenize(t) for t in self.corpus])
+        else:
+            self._bm25 = None
+
+    def add_documents(self, texts: List[str]) -> None:
+        self.corpus.extend(texts)
+        self._rebuild()
+
+    def search(self, query: str, k: int) -> List[Tuple[int, float]]:
+        if not self._bm25 or not self.corpus:
+            return []
+        tokens = self._tokenize(query)
+        scores = self._bm25.get_scores(tokens)
+        indexed = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
+        return [(idx, score) for idx, score in indexed[:k]]
+
+    def save(self, path: Path) -> None:
+        path.write_text(json.dumps(self.corpus, ensure_ascii=False), encoding="utf-8")
+
+    def load(self, path: Path) -> None:
+        self.corpus = json.loads(path.read_text(encoding="utf-8"))
+        self._rebuild()
diff --git a/src/raglight/vectorstore/chroma.py b/src/raglight/vectorstore/chroma.py
diff --git a/tests/tests_vector_store/test_hybrid_search.py b/tests/tests_vector_store/test_hybrid_search.py
diff --git a/uv.lock b/uv.lock

Original file line number	Diff line number	Diff line change
`@@ -36,6 +36,7 @@ dependencies = [`
`36`	`36`	`"nest-asyncio>=1.6.0",`
`37`	`37`	`"typing-extensions==4.15.0",`
`38`	`38`	`"nltk>=3.9.2",`
	`39`	`+ "rank_bm25>=0.2.2",`
`39`	`40`	`]`
`40`	`41`	`[project.scripts]`
`41`	`42`	`raglight = "raglight.cli.main:app"`
Original file line number	Diff line number	Diff line change
`@@ -17,3 +17,5 @@ class VectorStoreConfig:`
`17`	`17`	`ignore_folders: list = field(`
`18`	`18`	`default_factory=lambda: list(Settings.DEFAULT_IGNORE_FOLDERS)`
`19`	`19`	`)`
	`20`	`+ search_type: str = field(default=Settings.SEARCH_SEMANTIC)`
	`21`	`+ hybrid_alpha: float = 0.5`
Original file line number	Diff line number	Diff line change
`@@ -58,6 +58,8 @@ def __init__(`
`58`	`58`	`collection_name=collection_name,`
`59`	`59`	`host=vector_store_config.host,`
`60`	`60`	`port=vector_store_config.port,`
	`61`	`+ search_type=vector_store_config.search_type,`
	`62`	`+ alpha=vector_store_config.hybrid_alpha,`
`61`	`63`	`)`
`62`	`64`	`.with_llm(`
`63`	`65`	`provider,`