fix: Faiss read/write failure on Windows with non-ASCII user paths (reopened)#8323
fix: Faiss read/write failure on Windows with non-ASCII user paths (reopened)#8323irmia2026 wants to merge 2 commits into
Conversation
Bridge Faiss C++ fopen() ANSI codepage limitation through pure ASCII temp files using Python shutil. Also fix dtype=np.int64 for IDs, vector.reshape for search, and remove incorrect normalize_L2 on IndexFlatL2.
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
_read_indexanddelete, swallowing allRuntimeErrorexceptions without logging or re-raising can hide genuine issues; consider at least logging the exception or narrowing the exception handling to the specific error patterns you expect (e.g., path-encoding failures or missing IDs). - For consistency with
insertandinsert_batch, it would be helpful forsearchto validate the query vector’s dimensionality againstself.dimensionbefore callingindex.search, to fail fast with a clearer error message on mismatched shapes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `_read_index` and `delete`, swallowing all `RuntimeError` exceptions without logging or re-raising can hide genuine issues; consider at least logging the exception or narrowing the exception handling to the specific error patterns you expect (e.g., path-encoding failures or missing IDs).
- For consistency with `insert` and `insert_batch`, it would be helpful for `search` to validate the query vector’s dimensionality against `self.dimension` before calling `index.search`, to fail fast with a clearer error message on mismatched shapes.
## Individual Comments
### Comment 1
<location path="astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="136-139" />
<code_context>
+ self.index.add_with_ids(vectors, np.array(ids, dtype=np.int64))
await self.save_index()
async def search(self, vector: np.ndarray, k: int) -> tuple:
- """搜索最相似的向量
-
- Args:
- vector (np.ndarray): 查询向量
- k (int): 返回的最相似向量的数量
- Returns:
- tuple: (距离, 索引)
-
- """
+ """搜索向量"""
assert self.index is not None, "FAISS index is not initialized."
- faiss.normalize_L2(vector)
- distances, indices = self.index.search(vector, k)
+ distances, indices = self.index.search(vector.reshape(1, -1), k)
return distances, indices
</code_context>
<issue_to_address>
**issue:** Restrict the accepted shape for `vector` in `search` to avoid unintended reshaping.
Because `search` now blindly applies `vector.reshape(1, -1)`, passing a 2D array with multiple rows will be flattened into one long vector, silently changing semantics and likely breaking alignment with `self.dimension`. It would be safer to enforce that `vector` is 1D (or already shape `(1, d)`), e.g. by checking `vector.ndim` and `vector.shape[-1] == self.dimension` and raising a clear error otherwise.
</issue_to_address>
### Comment 2
<location path="astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="21" />
<code_context>
+# 本模块通过"纯 ASCII 临时文件桥接"规避此问题。
+
+
+def _safe_temp_dir() -> str:
+ """返回保证纯 ASCII 且可写的临时目录,用于 Faiss I/O 桥接。
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the Windows non-ASCII Faiss I/O bridging helpers so they only activate when necessary and use a more minimal temp file/exception-handling approach.
You can keep the new functionality and reduce complexity by:
1. Only bridging when it’s actually needed (Windows + non‑ASCII path).
2. Using a single, simple temp dir strategy.
3. Using simpler temp file APIs without extra UUID logic.
4. Narrowing the impact of broad exception handling.
Below are focused suggestions.
---
### 1. Simplify temp directory + temp file creation
You don’t need multiple candidates and a UUID on top of `mkstemp`’s uniqueness.
```python
def _safe_temp_dir() -> str:
"""Return a writable ASCII-only temp dir for Windows Faiss I/O bridging."""
if os.name == "nt":
# SystemRoot\Temp is usually ASCII & writable
root = os.environ.get("SystemRoot", r"C:\Windows")
temp_dir = os.path.join(root, "Temp")
if temp_dir.isascii() and os.path.isdir(temp_dir) and os.access(temp_dir, os.W_OK):
return temp_dir
# Fallback: tempfile.gettempdir() if ASCII
tmp = tempfile.gettempdir()
if tmp.isascii():
return tmp
# As a last resort, current working dir if ASCII
cwd = os.getcwd()
if cwd.isascii():
return cwd
raise OSError("无法找到可写的纯 ASCII 临时目录用于 Faiss I/O.")
else:
return tempfile.gettempdir()
```
```python
def _make_temp_file(prefix: str) -> str:
"""Create a temp file for Faiss bridging and return its path."""
safe_dir = _safe_temp_dir()
fd, path = tempfile.mkstemp(prefix=f"{prefix}_", suffix=".faiss", dir=safe_dir)
os.close(fd)
return path
```
If you’re comfortable dropping the CWD fallback, you can shrink `_safe_temp_dir` further to just `SystemRoot\Temp` + `tempfile.gettempdir()`.
---
### 2. Add a small helper to decide when bridging is needed
Centralize the “is this path problematic?” logic:
```python
def _needs_bridge(path: str) -> bool:
return os.name == "nt" and not path.isascii()
```
This keeps Windows‑specific handling local and makes the conditions obvious.
---
### 3. Narrow when `_read_index` uses the temp bridge
Avoid falling back for every `RuntimeError`; only do it when bridging is plausible:
```python
@staticmethod
def _read_index(path: str) -> "faiss.Index":
"""读取 Faiss 索引,兼容含非 ASCII 字符的 Windows 路径。"""
try:
return faiss.read_index(path)
except RuntimeError:
# 仅在 Windows + 非 ASCII 路径时尝试桥接
if not _needs_bridge(path):
raise
tmp = _make_temp_file("_faiss_read")
try:
shutil.copy2(path, tmp)
return faiss.read_index(tmp)
finally:
try:
os.remove(tmp)
except OSError:
pass
```
This keeps real index errors visible on non‑Windows or ASCII paths.
---
### 4. Only bridge on write when required
You can keep the atomic write behavior for the non‑ASCII Windows case while using the simple path otherwise:
```python
@staticmethod
def _write_index(index: "faiss.Index", path: str) -> None:
"""保存 Faiss 索引,兼容含非 ASCII 字符的 Windows 路径。"""
dirname = os.path.dirname(path)
if dirname:
os.makedirs(dirname, exist_ok=True)
if not _needs_bridge(path):
# Simple path for non‑Windows or ASCII paths
faiss.write_index(index, path)
return
tmp = _make_temp_file("_faiss_write")
try:
faiss.write_index(index, tmp)
shutil.move(tmp, path)
finally:
try:
os.remove(tmp)
except OSError:
pass
```
Now:
- Non‑Windows platforms keep the straightforward behavior.
- Windows ASCII paths also stay simple.
- The bridging complexity is only paid when it’s actually needed.
---
These changes keep your functionality (Windows non‑ASCII path support, temporary bridging, cleanup) but localize complexity and reduce branching, making the Faiss wrapper easier to reason about.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces a workaround for Faiss C++ path encoding issues on Windows by implementing a temporary file bridge using ASCII-only paths for index reading and writing. It also enhances the robustness of the EmbeddingStorage class by enforcing int64 data types for IDs, adding input shape validation for batch insertions, and refining search and delete operations. Feedback was provided suggesting that logging should be added when catching a RuntimeError during ID removal to avoid swallowing potential issues silently.
| try: | ||
| self.index.remove_ids(np.array(ids, dtype=np.int64)) | ||
| except RuntimeError: | ||
| pass |
There was a problem hiding this comment.
Swallowing RuntimeError during remove_ids without any logging can make it difficult to diagnose issues if the operation fails for unexpected reasons (e.g., index corruption or internal Faiss errors). While IndexIDMap generally supports removal, it is better practice to log a warning when an exception is caught.
| try: | |
| self.index.remove_ids(np.array(ids, dtype=np.int64)) | |
| except RuntimeError: | |
| pass | |
| try: | |
| self.index.remove_ids(np.array(ids, dtype=np.int64)) | |
| except RuntimeError: | |
| # Consider logging the error here if a logger is available | |
| pass |
…en needed Addresses bot review feedback on PR AstrBotDevs#8323: - Add _needs_bridge() helper to activate temp file bridge only on Windows + non-ASCII paths (Sourcery AstrBotDevs#5) - _read_index: re-raise RuntimeError when bridging not needed, preventing silent swallowing of genuine Faiss errors (Sourcery AstrBotDevs#1) - _write_index: skip temp file for ASCII/non-Windows paths (Sourcery AstrBotDevs#5) - search(): validate ndim==1 and dimension before reshape, preventing silent semantic corruption on 2D input (Sourcery AstrBotDevs#3, AstrBotDevs#4) - _safe_temp_dir & _make_temp_file: simplify (Sourcery AstrBotDevs#6, AstrBotDevs#7) - Remove redundant CWD fallback (never reached on non-ASCII paths) - Remove redundant UUID prefix (mkstemp O_EXCL guarantees uniqueness) All changes tested: 119/119 pass covering bridge logic, ASCII/non-ASCII paths, concurrent temp file uniqueness, search validation, and exception propagation.
|
Thanks for the update, @irmia2026. The implementation of the ASCII-only temporary file bridge for Faiss I/O on Windows is a robust solution to the reported encoding issues. The added unit tests in |
…en needed Addresses bot review feedback on PR AstrBotDevs#8323: - Add _needs_bridge() helper to activate temp file bridge only on Windows + non-ASCII paths (Sourcery AstrBotDevs#5) - _read_index: re-raise RuntimeError when bridging not needed, preventing silent swallowing of genuine Faiss errors (Sourcery AstrBotDevs#1) - _write_index: skip temp file for ASCII/non-Windows paths (Sourcery AstrBotDevs#5) - search(): validate ndim==1 and dimension before reshape, preventing silent semantic corruption on 2D input (Sourcery AstrBotDevs#3, AstrBotDevs#4) - _safe_temp_dir & _make_temp_file: simplify (Sourcery AstrBotDevs#6, AstrBotDevs#7) - Remove redundant CWD fallback (never reached on non-ASCII paths) - Remove redundant UUID prefix (mkstemp O_EXCL guarantees uniqueness) All changes tested: 119/119 pass covering bridge logic, ASCII/non-ASCII paths, concurrent temp file uniqueness, search validation, and exception propagation.
df7dd28 to
31ff6b1
Compare
|
You're welcome, @irmia2026. |
…en needed Addresses bot review feedback on PR AstrBotDevs#8323: - Add _needs_bridge() helper to activate temp file bridge only on Windows + non-ASCII paths (Sourcery AstrBotDevs#5) - _read_index: re-raise RuntimeError when bridging not needed, preventing silent swallowing of genuine Faiss errors (Sourcery AstrBotDevs#1) - _write_index: skip temp file for ASCII/non-Windows paths (Sourcery AstrBotDevs#5) - search(): validate ndim==1 and dimension before reshape, preventing silent semantic corruption on 2D input (Sourcery AstrBotDevs#3, AstrBotDevs#4) - _safe_temp_dir & _make_temp_file: simplify (Sourcery AstrBotDevs#6, AstrBotDevs#7) - Remove redundant CWD fallback (never reached on non-ASCII paths) - Remove redundant UUID prefix (mkstemp O_EXCL guarantees uniqueness) All changes tested: 119/119 pass covering bridge logic, ASCII/non-ASCII paths, concurrent temp file uniqueness, search validation, and exception propagation.
31ff6b1 to
54607b7
Compare
|
Thanks @sourcery-ai and @gemini-code-assist! All suggestions addressed in 54607b7: |
|
Great to hear that the changes are working as expected, @irmia2026. The ASCII-only temporary file bridge is a solid approach for handling Windows path encoding issues with Faiss. Since all concerns have been addressed, this looks good to go. |
Summary
Fix Faiss index read/write failure on Windows when the path contains non-ASCII characters (Chinese, Japanese, Korean, Cyrillic, etc.).
Root Cause
Faiss C++
fopen()uses ANSI codepage on Windows, but Python passes paths as UTF-8 bytes. Non-ASCII characters get garbled →RuntimeError: could not open.Fix
Bridge Faiss I/O through pure ASCII temp files using Python's
shutil(which uses Windows wide-charCreateFileWAPI):_safe_temp_dir()— guaranteed pure-ASCII writable dir (C:\WINDOWS\Tempon Windows)_make_temp_file()— unique temp files with UUID for thread/coroutine safety_read_index()— direct read first, temp file copy fallback_write_index()— temp file write thenshutil.move(atomic on same drive)Additionally fixes:
insert()/insert_batch(): explicitdtype=np.int64for IDssearch(): usevector.reshape(1, -1)for consistencysearch(): removed incorrectly appliedfaiss.normalize_L2onIndexFlatL2save_index(): guard againstNoneindex andNonepathWhy Reopened
Original PR (#8185) was accidentally closed when the fork repository was temporarily set to private. This resubmits the same fix against the correct upstream
AstrBotDevs/AstrBot.Summary by Sourcery
Fix Faiss index persistence on Windows when file paths contain non-ASCII characters by routing read/write operations through ASCII-only temporary files and updating related index operations for robustness.
Bug Fixes:
Enhancements: