Skip to content

fix: Faiss read/write failure on Windows with non-ASCII user paths (reopened)#8323

Open
irmia2026 wants to merge 2 commits into
AstrBotDevs:masterfrom
irmia2026:fix/faiss-unicode-path
Open

fix: Faiss read/write failure on Windows with non-ASCII user paths (reopened)#8323
irmia2026 wants to merge 2 commits into
AstrBotDevs:masterfrom
irmia2026:fix/faiss-unicode-path

Conversation

@irmia2026
Copy link
Copy Markdown

@irmia2026 irmia2026 commented May 25, 2026

Summary

Fix Faiss index read/write failure on Windows when the path contains non-ASCII characters (Chinese, Japanese, Korean, Cyrillic, etc.).

Root Cause

Faiss C++ fopen() uses ANSI codepage on Windows, but Python passes paths as UTF-8 bytes. Non-ASCII characters get garbled → RuntimeError: could not open.

Fix

Bridge Faiss I/O through pure ASCII temp files using Python's shutil (which uses Windows wide-char CreateFileW API):

  1. _safe_temp_dir() — guaranteed pure-ASCII writable dir (C:\WINDOWS\Temp on Windows)
  2. _make_temp_file() — unique temp files with UUID for thread/coroutine safety
  3. _read_index() — direct read first, temp file copy fallback
  4. _write_index() — temp file write then shutil.move (atomic on same drive)

Additionally fixes:

  • insert()/insert_batch(): explicit dtype=np.int64 for IDs
  • search(): use vector.reshape(1, -1) for consistency
  • search(): removed incorrectly applied faiss.normalize_L2 on IndexFlatL2
  • save_index(): guard against None index and None path

Why Reopened

Original PR (#8185) was accidentally closed when the fork repository was temporarily set to private. This resubmits the same fix against the correct upstream AstrBotDevs/AstrBot.

Summary by Sourcery

Fix Faiss index persistence on Windows when file paths contain non-ASCII characters by routing read/write operations through ASCII-only temporary files and updating related index operations for robustness.

Bug Fixes:

  • Resolve Faiss index read/write failures on Windows user directories with non-ASCII characters by adding an ASCII-only temporary file bridge.
  • Ensure insert and insert_batch use int64 ID arrays to avoid type-related issues with Faiss.
  • Fix search behavior by reshaping query vectors consistently and removing inappropriate L2 normalization for IndexFlatL2.
  • Prevent errors when saving by skipping save_index if the index or target path is not set.
  • Make delete operations more resilient by safely handling Faiss RuntimeError when removing IDs.

Enhancements:

  • Introduce helper utilities to locate a safe ASCII-only temp directory and create unique temp files for Faiss index I/O.
  • Simplify and tighten EmbeddingStorage method docstrings and validation for vector dimensionality.

Bridge Faiss C++ fopen() ANSI codepage limitation through
pure ASCII temp files using Python shutil. Also fix dtype=np.int64
for IDs, vector.reshape for search, and remove incorrect
normalize_L2 on IndexFlatL2.
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend labels May 25, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In _read_index and delete, swallowing all RuntimeError exceptions without logging or re-raising can hide genuine issues; consider at least logging the exception or narrowing the exception handling to the specific error patterns you expect (e.g., path-encoding failures or missing IDs).
  • For consistency with insert and insert_batch, it would be helpful for search to validate the query vector’s dimensionality against self.dimension before calling index.search, to fail fast with a clearer error message on mismatched shapes.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_read_index` and `delete`, swallowing all `RuntimeError` exceptions without logging or re-raising can hide genuine issues; consider at least logging the exception or narrowing the exception handling to the specific error patterns you expect (e.g., path-encoding failures or missing IDs).
- For consistency with `insert` and `insert_batch`, it would be helpful for `search` to validate the query vector’s dimensionality against `self.dimension` before calling `index.search`, to fail fast with a clearer error message on mismatched shapes.

## Individual Comments

### Comment 1
<location path="astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="136-139" />
<code_context>
+        self.index.add_with_ids(vectors, np.array(ids, dtype=np.int64))
         await self.save_index()

     async def search(self, vector: np.ndarray, k: int) -> tuple:
-        """搜索最相似的向量
-
-        Args:
-            vector (np.ndarray): 查询向量
-            k (int): 返回的最相似向量的数量
-        Returns:
-            tuple: (距离, 索引)
-
-        """
+        """搜索向量"""
         assert self.index is not None, "FAISS index is not initialized."
-        faiss.normalize_L2(vector)
-        distances, indices = self.index.search(vector, k)
+        distances, indices = self.index.search(vector.reshape(1, -1), k)
         return distances, indices

</code_context>
<issue_to_address>
**issue:** Restrict the accepted shape for `vector` in `search` to avoid unintended reshaping.

Because `search` now blindly applies `vector.reshape(1, -1)`, passing a 2D array with multiple rows will be flattened into one long vector, silently changing semantics and likely breaking alignment with `self.dimension`. It would be safer to enforce that `vector` is 1D (or already shape `(1, d)`), e.g. by checking `vector.ndim` and `vector.shape[-1] == self.dimension` and raising a clear error otherwise.
</issue_to_address>

### Comment 2
<location path="astrbot/core/db/vec_db/faiss_impl/embedding_storage.py" line_range="21" />
<code_context>
+# 本模块通过"纯 ASCII 临时文件桥接"规避此问题。
+
+
+def _safe_temp_dir() -> str:
+    """返回保证纯 ASCII 且可写的临时目录,用于 Faiss I/O 桥接。
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the Windows non-ASCII Faiss I/O bridging helpers so they only activate when necessary and use a more minimal temp file/exception-handling approach.

You can keep the new functionality and reduce complexity by:

1. Only bridging when it’s actually needed (Windows + non‑ASCII path).
2. Using a single, simple temp dir strategy.
3. Using simpler temp file APIs without extra UUID logic.
4. Narrowing the impact of broad exception handling.

Below are focused suggestions.

---

### 1. Simplify temp directory + temp file creation

You don’t need multiple candidates and a UUID on top of `mkstemp`’s uniqueness.

```python
def _safe_temp_dir() -> str:
    """Return a writable ASCII-only temp dir for Windows Faiss I/O bridging."""
    if os.name == "nt":
        # SystemRoot\Temp is usually ASCII & writable
        root = os.environ.get("SystemRoot", r"C:\Windows")
        temp_dir = os.path.join(root, "Temp")
        if temp_dir.isascii() and os.path.isdir(temp_dir) and os.access(temp_dir, os.W_OK):
            return temp_dir

        # Fallback: tempfile.gettempdir() if ASCII
        tmp = tempfile.gettempdir()
        if tmp.isascii():
            return tmp

        # As a last resort, current working dir if ASCII
        cwd = os.getcwd()
        if cwd.isascii():
            return cwd

        raise OSError("无法找到可写的纯 ASCII 临时目录用于 Faiss I/O.")
    else:
        return tempfile.gettempdir()
```

```python
def _make_temp_file(prefix: str) -> str:
    """Create a temp file for Faiss bridging and return its path."""
    safe_dir = _safe_temp_dir()
    fd, path = tempfile.mkstemp(prefix=f"{prefix}_", suffix=".faiss", dir=safe_dir)
    os.close(fd)
    return path
```

If you’re comfortable dropping the CWD fallback, you can shrink `_safe_temp_dir` further to just `SystemRoot\Temp` + `tempfile.gettempdir()`.

---

### 2. Add a small helper to decide when bridging is needed

Centralize the “is this path problematic?” logic:

```python
def _needs_bridge(path: str) -> bool:
    return os.name == "nt" and not path.isascii()
```

This keeps Windows‑specific handling local and makes the conditions obvious.

---

### 3. Narrow when `_read_index` uses the temp bridge

Avoid falling back for every `RuntimeError`; only do it when bridging is plausible:

```python
@staticmethod
def _read_index(path: str) -> "faiss.Index":
    """读取 Faiss 索引,兼容含非 ASCII 字符的 Windows 路径。"""
    try:
        return faiss.read_index(path)
    except RuntimeError:
        # 仅在 Windows + 非 ASCII 路径时尝试桥接
        if not _needs_bridge(path):
            raise

    tmp = _make_temp_file("_faiss_read")
    try:
        shutil.copy2(path, tmp)
        return faiss.read_index(tmp)
    finally:
        try:
            os.remove(tmp)
        except OSError:
            pass
```

This keeps real index errors visible on non‑Windows or ASCII paths.

---

### 4. Only bridge on write when required

You can keep the atomic write behavior for the non‑ASCII Windows case while using the simple path otherwise:

```python
@staticmethod
def _write_index(index: "faiss.Index", path: str) -> None:
    """保存 Faiss 索引,兼容含非 ASCII 字符的 Windows 路径。"""
    dirname = os.path.dirname(path)
    if dirname:
        os.makedirs(dirname, exist_ok=True)

    if not _needs_bridge(path):
        # Simple path for non‑Windows or ASCII paths
        faiss.write_index(index, path)
        return

    tmp = _make_temp_file("_faiss_write")
    try:
        faiss.write_index(index, tmp)
        shutil.move(tmp, path)
    finally:
        try:
            os.remove(tmp)
        except OSError:
            pass
```

Now:

- Non‑Windows platforms keep the straightforward behavior.
- Windows ASCII paths also stay simple.
- The bridging complexity is only paid when it’s actually needed.

---

These changes keep your functionality (Windows non‑ASCII path support, temporary bridging, cleanup) but localize complexity and reduce branching, making the Faiss wrapper easier to reason about.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/db/vec_db/faiss_impl/embedding_storage.py
Comment thread astrbot/core/db/vec_db/faiss_impl/embedding_storage.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a workaround for Faiss C++ path encoding issues on Windows by implementing a temporary file bridge using ASCII-only paths for index reading and writing. It also enhances the robustness of the EmbeddingStorage class by enforcing int64 data types for IDs, adding input shape validation for batch insertions, and refining search and delete operations. Feedback was provided suggesting that logging should be added when catching a RuntimeError during ID removal to avoid swallowing potential issues silently.

Comment on lines +145 to +148
try:
self.index.remove_ids(np.array(ids, dtype=np.int64))
except RuntimeError:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Swallowing RuntimeError during remove_ids without any logging can make it difficult to diagnose issues if the operation fails for unexpected reasons (e.g., index corruption or internal Faiss errors). While IndexIDMap generally supports removal, it is better practice to log a warning when an exception is caught.

Suggested change
try:
self.index.remove_ids(np.array(ids, dtype=np.int64))
except RuntimeError:
pass
try:
self.index.remove_ids(np.array(ids, dtype=np.int64))
except RuntimeError:
# Consider logging the error here if a logger is available
pass

irmia2026 added a commit to irmia2026/AstrBot that referenced this pull request May 25, 2026
…en needed

Addresses bot review feedback on PR AstrBotDevs#8323:
- Add _needs_bridge() helper to activate temp file bridge only on
  Windows + non-ASCII paths (Sourcery AstrBotDevs#5)
- _read_index: re-raise RuntimeError when bridging not needed,
  preventing silent swallowing of genuine Faiss errors (Sourcery AstrBotDevs#1)
- _write_index: skip temp file for ASCII/non-Windows paths (Sourcery AstrBotDevs#5)
- search(): validate ndim==1 and dimension before reshape,
  preventing silent semantic corruption on 2D input (Sourcery AstrBotDevs#3, AstrBotDevs#4)
- _safe_temp_dir & _make_temp_file: simplify (Sourcery AstrBotDevs#6, AstrBotDevs#7)
  - Remove redundant CWD fallback (never reached on non-ASCII paths)
  - Remove redundant UUID prefix (mkstemp O_EXCL guarantees uniqueness)

All changes tested: 119/119 pass covering bridge logic,
ASCII/non-ASCII paths, concurrent temp file uniqueness,
search validation, and exception propagation.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thanks for the update, @irmia2026. The implementation of the ASCII-only temporary file bridge for Faiss I/O on Windows is a robust solution to the reported encoding issues. The added unit tests in test_embedding_storage_fixes.py provide good coverage for these changes. I have no further comments on this implementation.

irmia2026 added a commit to irmia2026/AstrBot that referenced this pull request May 25, 2026
…en needed

Addresses bot review feedback on PR AstrBotDevs#8323:
- Add _needs_bridge() helper to activate temp file bridge only on
  Windows + non-ASCII paths (Sourcery AstrBotDevs#5)
- _read_index: re-raise RuntimeError when bridging not needed,
  preventing silent swallowing of genuine Faiss errors (Sourcery AstrBotDevs#1)
- _write_index: skip temp file for ASCII/non-Windows paths (Sourcery AstrBotDevs#5)
- search(): validate ndim==1 and dimension before reshape,
  preventing silent semantic corruption on 2D input (Sourcery AstrBotDevs#3, AstrBotDevs#4)
- _safe_temp_dir & _make_temp_file: simplify (Sourcery AstrBotDevs#6, AstrBotDevs#7)
  - Remove redundant CWD fallback (never reached on non-ASCII paths)
  - Remove redundant UUID prefix (mkstemp O_EXCL guarantees uniqueness)

All changes tested: 119/119 pass covering bridge logic,
ASCII/non-ASCII paths, concurrent temp file uniqueness,
search validation, and exception propagation.
@irmia2026 irmia2026 force-pushed the fix/faiss-unicode-path branch from df7dd28 to 31ff6b1 Compare May 25, 2026 06:48
@gemini-code-assist
Copy link
Copy Markdown
Contributor

You're welcome, @irmia2026.

…en needed

Addresses bot review feedback on PR AstrBotDevs#8323:
- Add _needs_bridge() helper to activate temp file bridge only on
  Windows + non-ASCII paths (Sourcery AstrBotDevs#5)
- _read_index: re-raise RuntimeError when bridging not needed,
  preventing silent swallowing of genuine Faiss errors (Sourcery AstrBotDevs#1)
- _write_index: skip temp file for ASCII/non-Windows paths (Sourcery AstrBotDevs#5)
- search(): validate ndim==1 and dimension before reshape,
  preventing silent semantic corruption on 2D input (Sourcery AstrBotDevs#3, AstrBotDevs#4)
- _safe_temp_dir & _make_temp_file: simplify (Sourcery AstrBotDevs#6, AstrBotDevs#7)
  - Remove redundant CWD fallback (never reached on non-ASCII paths)
  - Remove redundant UUID prefix (mkstemp O_EXCL guarantees uniqueness)

All changes tested: 119/119 pass covering bridge logic,
ASCII/non-ASCII paths, concurrent temp file uniqueness,
search validation, and exception propagation.
@irmia2026 irmia2026 force-pushed the fix/faiss-unicode-path branch from 31ff6b1 to 54607b7 Compare May 25, 2026 07:04
@irmia2026
Copy link
Copy Markdown
Author

Thanks @sourcery-ai and @gemini-code-assist! All suggestions addressed in 54607b7:

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Great to hear that the changes are working as expected, @irmia2026. The ASCII-only temporary file bridge is a solid approach for handling Windows path encoding issues with Faiss. Since all concerns have been addressed, this looks good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant