Skip to content

Commit 1727981

Browse files
authored
add pinecone as backend db (#1577)
1 parent ae0d527 commit 1727981

File tree

7 files changed

+302
-25
lines changed

7 files changed

+302
-25
lines changed

README.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ dfs: List[pd.DataFrame] = DeepFace.find(img_path = "img1.jpg", db_path = "C:/my_
7777

7878
<p align="center"><img src="https://raw.githubusercontent.com/serengil/deepface/master/icon/stock-6-v2.jpg" width="95%"></p>
7979

80-
Here, the `find` function relies on a directory-based face datastore and stores embeddings on disk. Alternatively, DeepFace provides a database-backed [`search`](https://sefiks.com/2026/01/01/introducing-brand-new-face-recognition-in-deepface/) functionality where embeddings are explicitly registered and queried. Currently, [postgres](https://sefiks.com/2023/06/22/vector-similarity-search-in-postgresql/), [mongo](https://sefiks.com/2021/01/22/deep-face-recognition-with-mongodb/), [neo4j](https://sefiks.com/2021/04/03/deep-face-recognition-with-neo4j/), [pgvector](https://sefiks.com/2024/07/05/postgres-as-a-vector-database-billion-scale-vector-similarity-search-with-pgvector/) and weaviate are supported as backend databases.
80+
Here, the `find` function relies on a directory-based face datastore and stores embeddings on disk. Alternatively, DeepFace provides a database-backed [`search`](https://sefiks.com/2026/01/01/introducing-brand-new-face-recognition-in-deepface/) functionality where embeddings are explicitly registered and queried. Currently, [postgres](https://sefiks.com/2023/06/22/vector-similarity-search-in-postgresql/), [mongo](https://sefiks.com/2021/01/22/deep-face-recognition-with-mongodb/), [neo4j](https://sefiks.com/2021/04/03/deep-face-recognition-with-neo4j/), [pgvector](https://sefiks.com/2024/07/05/postgres-as-a-vector-database-billion-scale-vector-similarity-search-with-pgvector/), [pinecone](https://sefiks.com/2021/05/19/large-scale-face-recognition-with-pinecone-vector-database/) and weaviate are supported as backend databases.
8181

8282
```python
8383
# register an image into the database
@@ -87,7 +87,7 @@ DeepFace.register(img = "img1.jpg")
8787
dfs: List[pd.DataFrame] = DeepFace.search(img = "target.jpg")
8888
```
8989

90-
If you want to perform [`approximate nearest neighbor`](https://sefiks.com/2023/12/31/a-step-by-step-approximate-nearest-neighbor-example-in-python-from-scratch/) search instead of exact search to achieve faster results on [large-scale databases](https://www.youtube.com/playlist?list=PLsS_1RYmYQQGSJu_Z3OVhXhGmZ86_zuIm), you can build an index beforehand and explicitly enable ANN search. Here, [Faiss](https://sefiks.com/2020/09/17/large-scale-face-recognition-with-facebook-faiss/) is used to index embeddings in postgres and mongo; whereas pgvector, weaviate and neo4j handle indexing internally.
90+
If you want to perform [`approximate nearest neighbor`](https://sefiks.com/2023/12/31/a-step-by-step-approximate-nearest-neighbor-example-in-python-from-scratch/) search instead of exact search to achieve faster results on [large-scale databases](https://www.youtube.com/playlist?list=PLsS_1RYmYQQGSJu_Z3OVhXhGmZ86_zuIm), you can build an index beforehand and explicitly enable ANN search. Here, [Faiss](https://sefiks.com/2020/09/17/large-scale-face-recognition-with-facebook-faiss/) is used to index embeddings in postgres and mongo; whereas vector databases such as pgvector, weaviate, pinecone and neo4j handle indexing internally.
9191

9292
```python
9393
# build index on registered embeddings (for postgres and mongo only)
@@ -316,11 +316,20 @@ cd scripts && ./dockerize.sh
316316
Face verification, facial attribute analysis, vector representation and register & search functions are covered in the API. The API accepts images as file uploads (via form data), or as exact image paths, URLs, or base64-encoded strings (via either JSON or form data).
317317

318318
```shell
319-
$ curl -X POST http://localhost:5005/represent -d '{"model_name":"Facenet", "img":"img1.jpg"}' -H "Content-Type: application/json"
320-
$ curl -X POST http://localhost:5005/verify -d '{"img1":"img1.jpg", "img2":"img3.jpg"}' -H "Content-Type: application/json"
321-
$ curl -X POST http://localhost:5005/analyze -d '{"img": "img2.jpg", "actions": ["age", "gender"]}' -H "Content-Type: application/json"
322-
$ curl -X POST http://localhost:5005/register -d '{"model_name":"Facenet", "img":"img18.jpg"}' -H "Content-Type: application/json"
323-
$ curl -X POST http://localhost:5005/search -d '{"img":"img1.jpg", "model_name":"Facenet"}' -H "Content-Type: application/json"
319+
$ curl -X POST http://localhost:5005/represent \
320+
-d '{"model_name":"Facenet", "img":"img1.jpg"}'
321+
322+
$ curl -X POST http://localhost:5005/verify \
323+
-d '{"img1":"img1.jpg", "img2":"img3.jpg"}'
324+
325+
$ curl -X POST http://localhost:5005/analyze \
326+
-d '{"img": "img2.jpg", "actions": ["age", "gender"]}'
327+
328+
$ curl -X POST http://localhost:5005/register \
329+
-d '{"model_name":"Facenet", "img":"img18.jpg"}'
330+
331+
$ curl -X POST http://localhost:5005/search \
332+
-d '{"img":"img1.jpg", "model_name":"Facenet"}'
324333
```
325334

326335
[`Here`](https://github.com/serengil/deepface/tree/master/deepface/api/postman), you can find a postman project to find out how these methods should be called.

deepface/DeepFace.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -760,7 +760,7 @@ def register(
760760
Options: base, raw, Facenet, Facenet2018, VGGFace, VGGFace2, ArcFace (default is base).
761761
anti_spoofing (boolean): Flag to enable anti spoofing (default is False).
762762
database_type (str): Type of database to register identities. Options: 'postgres', 'mongo',
763-
'weaviate', 'neo4j', 'pgvector' (default is 'postgres').
763+
'weaviate', 'neo4j', 'pgvector', 'pinecone' (default is 'postgres').
764764
connection_details (dict or str): Connection details for the database.
765765
connection (Any): Existing database connection object. If provided, this connection
766766
will be used instead of creating a new one.
@@ -772,7 +772,7 @@ def register(
772772
- DEEPFACE_MONGO_URI
773773
- DEEPFACE_WEAVIATE_URI
774774
- DEEPFACE_NEO4J_URI
775-
775+
- DEEPFACE_PINECONE_API_KEY
776776
Returns:
777777
result (dict): A dictionary containing registration results with following keys.
778778
- inserted (int): Number of embeddings successfully registered to the database.
@@ -844,7 +844,7 @@ def search(
844844
search_method (str): Method to use for searching identities. Options: 'exact', 'ann'.
845845
To use ann search, you must run build_index function first to create the index.
846846
database_type (str): Type of database to search identities. Options: 'postgres', 'mongo',
847-
'weaviate', 'neo4j', 'pgvector' (default is 'postgres').
847+
'weaviate', 'neo4j', 'pgvector', 'pinecone' (default is 'postgres').
848848
connection_details (dict or str): Connection details for the database.
849849
connection (Any): Existing database connection object. If provided, this connection
850850
will be used instead of creating a new one.
@@ -856,7 +856,7 @@ def search(
856856
- DEEPFACE_MONGO_URI
857857
- DEEPFACE_WEAVIATE_URI
858858
- DEEPFACE_NEO4J_URI
859-
859+
- DEEPFACE_PINECONE_API_KEY
860860
Returns:
861861
results (List[pd.DataFrame]):
862862
A list of pandas dataframes or a list of dicts. Each dataframe or dict corresponds
@@ -919,7 +919,8 @@ def build_index(
919919
- Use this function after registering all identities to the database.
920920
- This function is resumable, run again whenever new identities are added to the db.
921921
- Vector databases handle indexing internally, so you don't need to use this function
922-
when using a vector database ('weaviate', 'neo4j', 'pgvector') as database_type.
922+
when using a vector database ('weaviate', 'neo4j', 'pgvector', 'pinecone')
923+
as database_type.
923924
924925
Args:
925926
model_name (str): Model for face recognition. Options: VGG-Face, Facenet, Facenet512,
@@ -933,7 +934,7 @@ def build_index(
933934
max_neighbors_per_node (int): Maximum number of neighbors per node in the index
934935
(default is 32).
935936
database_type (str): Type of database to build index. Options: 'postgres', 'mongo',
936-
'weaviate', 'neo4j', 'pgvector' (default is 'postgres').
937+
'weaviate', 'neo4j', 'pgvector', 'pinecone' (default is 'postgres').
937938
connection (Any): Existing database connection object. If provided, this connection
938939
will be used instead of creating a new one.
939940
connection_details (dict or str): Connection details for the database.
@@ -945,6 +946,7 @@ def build_index(
945946
- DEEPFACE_MONGO_URI
946947
- DEEPFACE_WEAVIATE_URI
947948
- DEEPFACE_NEO4J_URI
949+
- DEEPFACE_PINECONE_API_KEY
948950
"""
949951
return datastore.build_index(
950952
model_name=model_name,

deepface/api/src/dependencies/variables.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ def __init__(self) -> None:
1515
conection_details = os.getenv("DEEPFACE_WEAVIATE_URI")
1616
elif self.database_type == "neo4j":
1717
conection_details = os.getenv("DEEPFACE_NEO4J_URI")
18+
elif self.database_type == "pinecone":
19+
conection_details = os.getenv("DEEPFACE_PINECONE_API_KEY")
1820
else:
1921
conection_details = None
2022

Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# built-in dependencies
2+
import os
3+
import json
4+
import hashlib
5+
import struct
6+
import math
7+
from typing import Any, Dict, Optional, List, Union
8+
9+
# project dependencies
10+
from deepface.modules.database.types import Database
11+
from deepface.modules.modeling import build_model
12+
from deepface.commons.logger import Logger
13+
14+
logger = Logger()
15+
16+
17+
class PineconeClient(Database):
18+
"""
19+
Pinecone client for storing and retrieving face embeddings and indices.
20+
"""
21+
22+
def __init__(
23+
self,
24+
connection_details: Optional[Union[str, Dict[str, Any]]] = None,
25+
connection: Any = None,
26+
):
27+
try:
28+
from pinecone import Pinecone, ServerlessSpec
29+
except (ModuleNotFoundError, ImportError) as e:
30+
raise ValueError(
31+
"pinecone is an optional dependency. Install with 'pip install pinecone'"
32+
) from e
33+
34+
self.pinecone = Pinecone
35+
self.serverless_spec = ServerlessSpec
36+
37+
if connection is not None:
38+
self.client = connection
39+
else:
40+
self.conn_details = connection_details or os.environ.get("DEEPFACE_PINECONE_API_KEY")
41+
if not isinstance(self.conn_details, str):
42+
raise ValueError(
43+
"Pinecone api key must be provided as a string in connection_details "
44+
"or via DEEPFACE_PINECONE_API_KEY environment variable."
45+
)
46+
47+
self.client = self.pinecone(api_key=self.conn_details)
48+
49+
def initialize_database(self, **kwargs: Any) -> None:
50+
"""
51+
Ensure Pinecone index exists.
52+
"""
53+
model_name = kwargs.get("model_name", "VGG-Face")
54+
detector_backend = kwargs.get("detector_backend", "opencv")
55+
aligned = kwargs.get("aligned", True)
56+
l2_normalized = kwargs.get("l2_normalized", False)
57+
58+
index_name = self.__generate_index_name(
59+
model_name, detector_backend, aligned, l2_normalized
60+
)
61+
62+
if self.client.has_index(index_name):
63+
logger.debug(f"Pinecone index '{index_name}' already exists.")
64+
return
65+
66+
model = build_model(task="facial_recognition", model_name=model_name)
67+
dimensions = model.output_shape
68+
similarity_function = "cosine" if l2_normalized else "euclidean"
69+
70+
self.client.create_index(
71+
name=index_name,
72+
dimension=dimensions,
73+
metric=similarity_function,
74+
spec=self.serverless_spec(
75+
cloud=os.getenv("DEEPFACE_PINECONE_CLOUD", "aws"),
76+
region=os.getenv("DEEPFACE_PINECONE_REGION", "us-east-1"),
77+
),
78+
)
79+
logger.debug(f"Created Pinecone index '{index_name}' with dimension {dimensions}.")
80+
81+
def insert_embeddings(self, embeddings: List[Dict[str, Any]], batch_size: int = 100) -> int:
82+
"""
83+
Insert embeddings into Pinecone database in batches.
84+
"""
85+
if not embeddings:
86+
raise ValueError("No embeddings to insert.")
87+
88+
self.initialize_database(
89+
model_name=embeddings[0]["model_name"],
90+
detector_backend=embeddings[0]["detector_backend"],
91+
aligned=embeddings[0]["aligned"],
92+
l2_normalized=embeddings[0]["l2_normalized"],
93+
)
94+
95+
index_name = self.__generate_index_name(
96+
embeddings[0]["model_name"],
97+
embeddings[0]["detector_backend"],
98+
embeddings[0]["aligned"],
99+
embeddings[0]["l2_normalized"],
100+
)
101+
102+
# connect to the index
103+
index = self.client.Index(index_name)
104+
105+
total = 0
106+
for i in range(0, len(embeddings), batch_size):
107+
batch = embeddings[i : i + batch_size]
108+
vectors = []
109+
for e in batch:
110+
face_json = json.dumps(e["face"].tolist())
111+
face_hash = hashlib.sha256(face_json.encode()).hexdigest()
112+
embedding_bytes = struct.pack(f'{len(e["embedding"])}d', *e["embedding"])
113+
embedding_hash = hashlib.sha256(embedding_bytes).hexdigest()
114+
115+
vectors.append(
116+
{
117+
"id": f"{face_hash}:{embedding_hash}",
118+
"values": e["embedding"],
119+
"metadata": {
120+
"img_name": e["img_name"],
121+
# "face": e["face"].tolist(),
122+
# "face_shape": list(e["face"].shape),
123+
},
124+
}
125+
)
126+
index.upsert(vectors=vectors)
127+
total += len(vectors)
128+
129+
return total
130+
131+
def search_by_vector(
132+
self,
133+
vector: List[float],
134+
model_name: str = "VGG-Face",
135+
detector_backend: str = "opencv",
136+
aligned: bool = True,
137+
l2_normalized: bool = False,
138+
limit: int = 10,
139+
) -> List[Dict[str, Any]]:
140+
"""
141+
ANN search using the main vector (embedding).
142+
"""
143+
out: List[Dict[str, Any]] = []
144+
145+
self.initialize_database(
146+
model_name=model_name,
147+
detector_backend=detector_backend,
148+
aligned=aligned,
149+
l2_normalized=l2_normalized,
150+
)
151+
152+
index_name = self.__generate_index_name(
153+
model_name, detector_backend, aligned, l2_normalized
154+
)
155+
156+
index = self.client.Index(index_name)
157+
results = index.query(
158+
vector=vector,
159+
top_k=limit,
160+
include_metadata=True,
161+
include_values=False,
162+
)
163+
164+
if not results.matches:
165+
return out
166+
167+
for res in results.matches:
168+
score = float(res.score)
169+
if l2_normalized:
170+
distance = 1 - score
171+
else:
172+
distance = math.sqrt(max(score, 0.0))
173+
174+
out.append(
175+
{
176+
"id": res.id,
177+
"distance": distance,
178+
"img_name": res.metadata.get("img_name"),
179+
}
180+
)
181+
return out
182+
183+
def fetch_all_embeddings(
184+
self,
185+
model_name: str,
186+
detector_backend: str,
187+
aligned: bool,
188+
l2_normalized: bool,
189+
batch_size: int = 1000,
190+
) -> List[Dict[str, Any]]:
191+
"""
192+
Fetch all embeddings from Pinecone database in batches.
193+
"""
194+
out: List[Dict[str, Any]] = []
195+
196+
self.initialize_database(
197+
model_name=model_name,
198+
detector_backend=detector_backend,
199+
aligned=aligned,
200+
l2_normalized=l2_normalized,
201+
)
202+
203+
index_name = self.__generate_index_name(
204+
model_name, detector_backend, aligned, l2_normalized
205+
)
206+
207+
index = self.client.Index(index_name)
208+
209+
# Fetch all IDs
210+
ids: List[str] = []
211+
for _id in index.list():
212+
ids.extend(_id)
213+
214+
for i in range(0, len(ids), batch_size):
215+
batch_ids = ids[i : i + batch_size]
216+
fetched = index.fetch(ids=batch_ids)
217+
for _id, v in fetched.get("vectors", {}).items():
218+
md = v.get("metadata") or {}
219+
out.append(
220+
{
221+
"id": _id,
222+
"embedding": v.get("values"),
223+
"img_name": md.get("img_name"),
224+
"face_hash": md.get("face_hash"),
225+
"embedding_hash": md.get("embedding_hash"),
226+
}
227+
)
228+
229+
return out
230+
231+
def close(self) -> None:
232+
"""Pinecone client does not require explicit closure"""
233+
return
234+
235+
@staticmethod
236+
def __generate_index_name(
237+
model_name: str,
238+
detector_backend: str,
239+
aligned: bool,
240+
l2_normalized: bool,
241+
) -> str:
242+
"""
243+
Generate Pinecone index name based on parameters.
244+
"""
245+
index_name_attributes = [
246+
"embeddings",
247+
model_name.replace("-", ""),
248+
detector_backend,
249+
"Aligned" if aligned else "Unaligned",
250+
"Norm" if l2_normalized else "Raw",
251+
]
252+
return "-".join(index_name_attributes).lower()

0 commit comments

Comments
 (0)