Skip to content

Commit 9745c30

Browse files
PsiACEDavdGao
andauthored
feat(rag): add OceanBaseStore as a vector database choice (#1078)
--------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: DavdGao <gaodawei.gdw@alibaba-inc.com>
1 parent 8b5b350 commit 9745c30

File tree

7 files changed

+1229
-0
lines changed

7 files changed

+1229
-0
lines changed
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# OceanBase Vector Store
2+
3+
This example demonstrates how to use **OceanBaseStore** for vector storage and semantic search in AgentScope.
4+
It includes CRUD operations, metadata filtering, document chunking, and distance metric tests.
5+
6+
### Quick Start
7+
8+
Install dependencies (including `pyobvector`):
9+
10+
```bash
11+
pip install -e .[full]
12+
```
13+
14+
Start seekdb (a minimal OceanBase-compatible instance):
15+
16+
```bash
17+
docker run -d -p 2881:2881 oceanbase/seekdb
18+
```
19+
20+
Run the example script:
21+
22+
```bash
23+
python main.py
24+
```
25+
26+
> **Note:** The script defaults to `127.0.0.1:2881`, user `root`, database `test`.
27+
> If you use a multi-tenant OceanBase account (e.g., `root@test`), override via environment variables.
28+
29+
## Usage
30+
31+
### Initialize Store
32+
33+
```python
34+
from agentscope.rag import OceanBaseStore
35+
36+
store = OceanBaseStore(
37+
collection_name="test_collection",
38+
dimensions=768,
39+
distance="COSINE",
40+
uri="127.0.0.1:2881",
41+
user="root",
42+
password="",
43+
db_name="test",
44+
)
45+
```
46+
47+
### Add Documents
48+
49+
```python
50+
from agentscope.rag import Document, DocMetadata
51+
from agentscope.message import TextBlock
52+
53+
doc = Document(
54+
metadata=DocMetadata(
55+
content=TextBlock(type="text", text="Your document text"),
56+
doc_id="doc_1",
57+
chunk_id=0,
58+
total_chunks=1,
59+
),
60+
embedding=[0.1, 0.2, 0.3],
61+
)
62+
63+
await store.add([doc])
64+
```
65+
66+
### Search
67+
68+
```python
69+
results = await store.search(
70+
query_embedding=[0.1, 0.2, 0.3],
71+
limit=5,
72+
score_threshold=0.9,
73+
)
74+
```
75+
76+
### Filter Search
77+
78+
```python
79+
client = store.get_client()
80+
table = client.load_table(collection_name="test_collection")
81+
82+
results = await store.search(
83+
query_embedding=[0.1, 0.2, 0.3],
84+
limit=5,
85+
flter=[table.c["doc_id"].like("doc%")],
86+
)
87+
```
88+
89+
> Note: The parameter name is `flter` (missing the "i") to avoid clashing with
90+
> Python's built-in `filter` and follows the underlying library's convention.
91+
92+
### Delete
93+
94+
```python
95+
client = store.get_client()
96+
table = client.load_table(collection_name="test_collection")
97+
98+
await store.delete(where=[table.c["doc_id"] == "doc_1"])
99+
```
100+
101+
## Distance Metrics
102+
103+
| Metric | Description | Best For |
104+
|--------|-------------|----------|
105+
| **COSINE** | Cosine similarity | Text embeddings (recommended) |
106+
| **L2** | Euclidean distance | Spatial data |
107+
| **IP** | Inner product | Recommendation systems |
108+
109+
## Filter Expressions
110+
111+
Build filters using SQLAlchemy expressions and pass them via `flter`:
112+
113+
```python
114+
table = store.get_client().load_table("test_collection")
115+
116+
filters = [
117+
table.c["doc_id"] == "doc_1",
118+
table.c["doc_id"].like("prefix%"),
119+
table.c["chunk_id"] >= 0,
120+
]
121+
```
122+
123+
## Advanced Usage
124+
125+
### Access Underlying Client
126+
127+
```python
128+
client = store.get_client()
129+
stats = client.get_collection_stats(collection_name="test_collection")
130+
```
131+
132+
### Document Metadata
133+
134+
- `content`: Text content (TextBlock)
135+
- `doc_id`: Unique document identifier
136+
- `chunk_id`: Chunk position (0-indexed)
137+
- `total_chunks`: Total chunks in document
138+
139+
## FAQ
140+
141+
**What embedding dimension should I use?**
142+
Match your embedding model's output dimension (e.g., 768 for BERT, 1536 for OpenAI ada-002).
143+
144+
**Can I change the distance metric after creation?**
145+
No, create a new collection with the desired metric.
146+
147+
**How do I clean up test data?**
148+
Drop the collection via the underlying client or remove the seekdb container volume.
149+
150+
## Environment Variables
151+
152+
The script supports the following environment variables to override connection settings:
153+
154+
```bash
155+
export OCEANBASE_URI="127.0.0.1:2881"
156+
export OCEANBASE_USER="root"
157+
export OCEANBASE_PASSWORD=""
158+
export OCEANBASE_DB="test"
159+
```
160+
161+
## References
162+
163+
- [OceanBase Vector Store](https://github.com/oceanbase/pyobvector)
164+
- [AgentScope RAG Tutorial](https://doc.agentscope.io/tutorial/task_rag.html)

0 commit comments

Comments
 (0)