|
| 1 | +# OceanBase Vector Store |
| 2 | + |
| 3 | +This example demonstrates how to use **OceanBaseStore** for vector storage and semantic search in AgentScope. |
| 4 | +It includes CRUD operations, metadata filtering, document chunking, and distance metric tests. |
| 5 | + |
| 6 | +### Quick Start |
| 7 | + |
| 8 | +Install dependencies (including `pyobvector`): |
| 9 | + |
| 10 | +```bash |
| 11 | +pip install -e .[full] |
| 12 | +``` |
| 13 | + |
| 14 | +Start seekdb (a minimal OceanBase-compatible instance): |
| 15 | + |
| 16 | +```bash |
| 17 | +docker run -d -p 2881:2881 oceanbase/seekdb |
| 18 | +``` |
| 19 | + |
| 20 | +Run the example script: |
| 21 | + |
| 22 | +```bash |
| 23 | +python main.py |
| 24 | +``` |
| 25 | + |
| 26 | +> **Note:** The script defaults to `127.0.0.1:2881`, user `root`, database `test`. |
| 27 | +> If you use a multi-tenant OceanBase account (e.g., `root@test`), override via environment variables. |
| 28 | +
|
| 29 | +## Usage |
| 30 | + |
| 31 | +### Initialize Store |
| 32 | + |
| 33 | +```python |
| 34 | +from agentscope.rag import OceanBaseStore |
| 35 | + |
| 36 | +store = OceanBaseStore( |
| 37 | + collection_name="test_collection", |
| 38 | + dimensions=768, |
| 39 | + distance="COSINE", |
| 40 | + uri="127.0.0.1:2881", |
| 41 | + user="root", |
| 42 | + password="", |
| 43 | + db_name="test", |
| 44 | +) |
| 45 | +``` |
| 46 | + |
| 47 | +### Add Documents |
| 48 | + |
| 49 | +```python |
| 50 | +from agentscope.rag import Document, DocMetadata |
| 51 | +from agentscope.message import TextBlock |
| 52 | + |
| 53 | +doc = Document( |
| 54 | + metadata=DocMetadata( |
| 55 | + content=TextBlock(type="text", text="Your document text"), |
| 56 | + doc_id="doc_1", |
| 57 | + chunk_id=0, |
| 58 | + total_chunks=1, |
| 59 | + ), |
| 60 | + embedding=[0.1, 0.2, 0.3], |
| 61 | +) |
| 62 | + |
| 63 | +await store.add([doc]) |
| 64 | +``` |
| 65 | + |
| 66 | +### Search |
| 67 | + |
| 68 | +```python |
| 69 | +results = await store.search( |
| 70 | + query_embedding=[0.1, 0.2, 0.3], |
| 71 | + limit=5, |
| 72 | + score_threshold=0.9, |
| 73 | +) |
| 74 | +``` |
| 75 | + |
| 76 | +### Filter Search |
| 77 | + |
| 78 | +```python |
| 79 | +client = store.get_client() |
| 80 | +table = client.load_table(collection_name="test_collection") |
| 81 | + |
| 82 | +results = await store.search( |
| 83 | + query_embedding=[0.1, 0.2, 0.3], |
| 84 | + limit=5, |
| 85 | + flter=[table.c["doc_id"].like("doc%")], |
| 86 | +) |
| 87 | +``` |
| 88 | + |
| 89 | +> Note: The parameter name is `flter` (missing the "i") to avoid clashing with |
| 90 | +> Python's built-in `filter` and follows the underlying library's convention. |
| 91 | +
|
| 92 | +### Delete |
| 93 | + |
| 94 | +```python |
| 95 | +client = store.get_client() |
| 96 | +table = client.load_table(collection_name="test_collection") |
| 97 | + |
| 98 | +await store.delete(where=[table.c["doc_id"] == "doc_1"]) |
| 99 | +``` |
| 100 | + |
| 101 | +## Distance Metrics |
| 102 | + |
| 103 | +| Metric | Description | Best For | |
| 104 | +|--------|-------------|----------| |
| 105 | +| **COSINE** | Cosine similarity | Text embeddings (recommended) | |
| 106 | +| **L2** | Euclidean distance | Spatial data | |
| 107 | +| **IP** | Inner product | Recommendation systems | |
| 108 | + |
| 109 | +## Filter Expressions |
| 110 | + |
| 111 | +Build filters using SQLAlchemy expressions and pass them via `flter`: |
| 112 | + |
| 113 | +```python |
| 114 | +table = store.get_client().load_table("test_collection") |
| 115 | + |
| 116 | +filters = [ |
| 117 | + table.c["doc_id"] == "doc_1", |
| 118 | + table.c["doc_id"].like("prefix%"), |
| 119 | + table.c["chunk_id"] >= 0, |
| 120 | +] |
| 121 | +``` |
| 122 | + |
| 123 | +## Advanced Usage |
| 124 | + |
| 125 | +### Access Underlying Client |
| 126 | + |
| 127 | +```python |
| 128 | +client = store.get_client() |
| 129 | +stats = client.get_collection_stats(collection_name="test_collection") |
| 130 | +``` |
| 131 | + |
| 132 | +### Document Metadata |
| 133 | + |
| 134 | +- `content`: Text content (TextBlock) |
| 135 | +- `doc_id`: Unique document identifier |
| 136 | +- `chunk_id`: Chunk position (0-indexed) |
| 137 | +- `total_chunks`: Total chunks in document |
| 138 | + |
| 139 | +## FAQ |
| 140 | + |
| 141 | +**What embedding dimension should I use?** |
| 142 | +Match your embedding model's output dimension (e.g., 768 for BERT, 1536 for OpenAI ada-002). |
| 143 | + |
| 144 | +**Can I change the distance metric after creation?** |
| 145 | +No, create a new collection with the desired metric. |
| 146 | + |
| 147 | +**How do I clean up test data?** |
| 148 | +Drop the collection via the underlying client or remove the seekdb container volume. |
| 149 | + |
| 150 | +## Environment Variables |
| 151 | + |
| 152 | +The script supports the following environment variables to override connection settings: |
| 153 | + |
| 154 | +```bash |
| 155 | +export OCEANBASE_URI="127.0.0.1:2881" |
| 156 | +export OCEANBASE_USER="root" |
| 157 | +export OCEANBASE_PASSWORD="" |
| 158 | +export OCEANBASE_DB="test" |
| 159 | +``` |
| 160 | + |
| 161 | +## References |
| 162 | + |
| 163 | +- [OceanBase Vector Store](https://github.com/oceanbase/pyobvector) |
| 164 | +- [AgentScope RAG Tutorial](https://doc.agentscope.io/tutorial/task_rag.html) |
0 commit comments