fix: prevent session message loss in update_session_metadata#1709
Conversation
Use locked read-modify-write when merging session stats so a stale in-memory copy cannot overwrite messages written by add_message. Adds a regression test for the two-store race introduced with UI session telemetry. Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
|
@coderabbitai review |
|
/review |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
✅ Actions performedReview triggered.
|
|
Important Review skippedBot user detected. To trigger a single review, invoke the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR refactors ChangesSession Metadata Atomic Update
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings. Review areas:
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/praisonai-agents/praisonaiagents/session/store.py`:
- Around line 483-531: Add an async variant of this metadata-update routine
(e.g., async def update_session_metadata_async(session_id, **fields)) that
performs the blocking file operations off the event loop using asyncio.to_thread
or loop.run_in_executor (or use aiofiles if preferred), preserves the same
semantics (use self._get_session_path, FileLock for file-level locking,
SessionData.from_dict/to_dict, atomic tempfile + os.replace, and the same error
handling), and protect the in-memory cache using an asyncio.Lock (e.g.,
self._async_lock) instead of the thread-based self._lock when updating
self._cache; keep the sync method unchanged and ensure callers can opt into the
async variant.
- Line 485: The code uses "with FileLock(filepath, self.lock_timeout):" assuming
the context manager always acquires the lock, but FileLock.acquire() can return
False so the critical section may run unlocked; update the call site to
explicitly acquire and verify the lock before proceeding (call lock =
FileLock(filepath, self.lock_timeout); if not lock.acquire(): raise an exception
or return an error) and only enter (or execute the critical section) when
acquisition succeeded; ensure you release the lock in a finally block (or use
lock.release()) to avoid deadlocks and reference the same symbols FileLock,
filepath, self.lock_timeout, and the surrounding method where this context was
used.
In `@src/praisonai-agents/tests/unit/session/test_session_store.py`:
- Around line 286-307: The test test_update_session_metadata_preserves_messages
currently only manipulates DefaultSessionStore instances and must be extended
into a real agentic integration test: create or reuse an Agent instance and call
Agent.start() with a real prompt, ensure the agent invokes an actual LLM (not a
mock) and produces a text response, then verify the session store
(DefaultSessionStore) preserves messages across metadata updates (use
add_user_message, update_session_metadata, invalidate_cache, get_chat_history,
get_session) to assert the agent-generated response and subsequent user messages
remain intact and metadata (model, total_tokens) is set; ensure the test setup
provisions a real LLM client or integration test fixture so the call to
Agent.start() performs a live request rather than a mocked response.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e0cbc31d-0371-4936-bb96-4272353f3a4e
📒 Files selected for processing (2)
src/praisonai-agents/praisonaiagents/session/store.pysrc/praisonai-agents/tests/unit/session/test_session_store.py
| filepath = self._get_session_path(session_id) | ||
|
|
||
| with FileLock(filepath, self.lock_timeout): | ||
| if os.path.exists(filepath): | ||
| try: | ||
| with open(filepath, "r", encoding="utf-8") as f: | ||
| data = json.load(f) | ||
| session = SessionData.from_dict(data) | ||
| except (json.JSONDecodeError, IOError): | ||
| session = SessionData(session_id=session_id) | ||
| else: | ||
| session = SessionData(session_id=session_id) | ||
|
|
||
| for key, value in fields.items(): | ||
| if value is None: | ||
| continue | ||
| session.metadata[key] = value | ||
| if key in ("agent_id", "agent_name", "user_id"): | ||
| setattr(session, key, value) | ||
| self._cache[session_id] = session | ||
| return self._save_session(session) | ||
|
|
||
| session.updated_at = datetime.now(timezone.utc).isoformat() | ||
|
|
||
| try: | ||
| dir_path = os.path.dirname(filepath) or "." | ||
| os.makedirs(dir_path, exist_ok=True) | ||
| with tempfile.NamedTemporaryFile( | ||
| mode="w", | ||
| encoding="utf-8", | ||
| dir=dir_path, | ||
| delete=False, | ||
| suffix=".tmp", | ||
| ) as f: | ||
| json.dump(session.to_dict(), f, indent=2, ensure_ascii=False) | ||
| temp_path = f.name | ||
|
|
||
| os.replace(temp_path, filepath) | ||
|
|
||
| with self._lock: | ||
| self._cache[session_id] = session | ||
|
|
||
| return True | ||
| except (IOError, OSError) as e: | ||
| logger.error(f"Failed to update session metadata {session_id}: {e}") | ||
| try: | ||
| if "temp_path" in locals(): | ||
| os.remove(temp_path) | ||
| except (IOError, OSError): | ||
| pass | ||
| return False |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Add an async metadata-update variant for core SDK I/O parity.
This method is pure blocking file I/O. In async agent flows, it can block the event loop unless an async variant exists and callers can opt into it.
As per coding guidelines, All I/O operations must have both sync and async variants; never block the event loop with sync I/O in async context; use asyncio primitives for coordination, not threading.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/praisonai-agents/praisonaiagents/session/store.py` around lines 483 -
531, Add an async variant of this metadata-update routine (e.g., async def
update_session_metadata_async(session_id, **fields)) that performs the blocking
file operations off the event loop using asyncio.to_thread or
loop.run_in_executor (or use aiofiles if preferred), preserves the same
semantics (use self._get_session_path, FileLock for file-level locking,
SessionData.from_dict/to_dict, atomic tempfile + os.replace, and the same error
handling), and protect the in-memory cache using an asyncio.Lock (e.g.,
self._async_lock) instead of the thread-based self._lock when updating
self._cache; keep the sync method unchanged and ensure callers can opt into the
async variant.
|
|
||
| filepath = self._get_session_path(session_id) | ||
|
|
||
| with FileLock(filepath, self.lock_timeout): |
There was a problem hiding this comment.
Handle lock-acquisition failure before running the critical section.
At Line 485, with FileLock(...) assumes lock ownership, but FileLock.acquire() can return False and __enter__ does not enforce success. Under contention, this path can still execute unlocked and reintroduce overwrite/data-loss risk.
Suggested fix
- with FileLock(filepath, self.lock_timeout):
+ lock = FileLock(filepath, self.lock_timeout)
+ if not lock.acquire():
+ logger.error(f"Failed to update session metadata {session_id}: could not acquire file lock")
+ return False
+ try:
if os.path.exists(filepath):
try:
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
session = SessionData.from_dict(data)
except (json.JSONDecodeError, IOError):
session = SessionData(session_id=session_id)
else:
session = SessionData(session_id=session_id)
@@
- return False
+ return False
+ finally:
+ lock.release()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/praisonai-agents/praisonaiagents/session/store.py` at line 485, The code
uses "with FileLock(filepath, self.lock_timeout):" assuming the context manager
always acquires the lock, but FileLock.acquire() can return False so the
critical section may run unlocked; update the call site to explicitly acquire
and verify the lock before proceeding (call lock = FileLock(filepath,
self.lock_timeout); if not lock.acquire(): raise an exception or return an
error) and only enter (or execute the critical section) when acquisition
succeeded; ensure you release the lock in a finally block (or use
lock.release()) to avoid deadlocks and reference the same symbols FileLock,
filepath, self.lock_timeout, and the surrounding method where this context was
used.
| def test_update_session_metadata_preserves_messages(self, temp_store): | ||
| """Metadata updates must not drop messages added by another store instance.""" | ||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| writer = DefaultSessionStore(session_dir=tmpdir) | ||
| reader = DefaultSessionStore(session_dir=tmpdir) | ||
|
|
||
| writer.add_user_message("session-1", "first") | ||
| reader._load_session("session-1") | ||
| writer.add_user_message("session-1", "second") | ||
|
|
||
| assert reader.update_session_metadata( | ||
| "session-1", model="gpt-4o-mini", total_tokens=42 | ||
| ) | ||
|
|
||
| writer.invalidate_cache("session-1") | ||
| history = writer.get_chat_history("session-1") | ||
| assert len(history) == 2 | ||
| assert history[1]["content"] == "second" | ||
|
|
||
| session = writer.get_session("session-1") | ||
| assert session.metadata.get("model") == "gpt-4o-mini" | ||
| assert session.metadata.get("total_tokens") == 42 |
There was a problem hiding this comment.
Add a real agentic integration test for this regression scenario.
This unit test is valuable, but the feature still lacks the required Agent.start() + real prompt/LLM response coverage, so orchestration-layer regressions may go undetected.
As per coding guidelines, Real agentic tests are MANDATORY for every feature: Agent must call agent.start() with a real prompt, call the LLM, and produce actual text response—not just smoke tests of object construction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/praisonai-agents/tests/unit/session/test_session_store.py` around lines
286 - 307, The test test_update_session_metadata_preserves_messages currently
only manipulates DefaultSessionStore instances and must be extended into a real
agentic integration test: create or reuse an Agent instance and call
Agent.start() with a real prompt, ensure the agent invokes an actual LLM (not a
mock) and produces a text response, then verify the session store
(DefaultSessionStore) preserves messages across metadata updates (use
add_user_message, update_session_metadata, invalidate_cache, get_chat_history,
get_session) to assert the agent-generated response and subsequent user messages
remain intact and metadata (model, total_tokens) is set; ensure the test setup
provisions a real LLM client or integration test fixture so the call to
Agent.start() performs a live request rather than a mocked response.
Reviewed thoroughly at commit Findings:
Reviewer feedback incorporated:
Optional improvement (non-blocking): factor duplicated atomic write block into a private helper used by both |
|
@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding. Phase 1: Review per AGENTS.md
Phase 2: FIX Valid Issues Phase 3: Final Verdict |
|
Claude finished @MervinPraison's task —— View job Final Architecture Review ✅Phase 1: Architecture Review per AGENTS.md
Phase 2: Address Valid Reviewer Issues
Phase 3: Implementation & Validation
Phase 4: Summary
Summary✅ Critical Fix Applied
|
CodeRabbit identified a critical security bug where FileLock.__enter__() didn't check if acquire() succeeded, potentially allowing unlocked execution. Fixed to properly raise IOError when lock acquisition fails. Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Greptile SummaryThis PR fixes a session message-loss bug where
Confidence Score: 3/5Safe to merge only if callers of set_agent_info and clear_session do not share a session directory across processes; those methods can still overwrite disk with a stale in-memory copy and drop messages. update_session_metadata is correctly fixed, but set_agent_info and clear_session retain the same unlocked load-modify-save pattern that was causing the original bug. Additionally, FileLock.enter now throws IOError on timeout, which propagates uncaught through add_message — a method documented to return bool — potentially crashing callers under lock contention. src/praisonai-agents/praisonaiagents/session/store.py — specifically the set_agent_info and clear_session methods, and the exception-propagation behaviour of add_message under lock timeout Important Files Changed
Sequence DiagramsequenceDiagram
participant W as Writer (store A)
participant D as Disk (session file)
participant R as Reader (store B)
W->>D: add_message("second") [FileLock, atomic write]
R->>R: _load_session() → hits stale cache (has only "first")
Note over R,D: update_session_metadata (NEW - fixed)
R->>D: FileLock acquired
R->>D: reload from disk (sees both messages)
R->>D: write metadata + both messages atomically
R->>D: FileLock released
R->>R: update cache
Note over R,D: set_agent_info / clear_session (OLD - still vulnerable)
R->>R: _load_session() → cache hit → stale copy (only "first")
R->>D: _save_session(stale) → "second" is lost
|
| def __enter__(self): | ||
| self.acquire() | ||
| if not self.acquire(): | ||
| raise IOError(f"Failed to acquire file lock for {self.filepath} after {self.timeout}s") | ||
| return self |
There was a problem hiding this comment.
IOError from __enter__ is now unhandled in several callers
FileLock.__enter__ now raises IOError on lock-acquisition timeout instead of silently returning without a lock. While the intent is correct, the callers add_message, _load_session, and _save_session all wrap only their I/O steps in a try/except (IOError, OSError) block that sits inside the with FileLock(...) statement. If the lock itself times out the IOError propagates uncaught to the caller. add_message is documented to return bool, but under lock contention it now raises instead. Code that calls add_message in a bare if store.add_message(...) guard will crash rather than degrade gracefully.
Bug and impact
After each assistant turn,
_persist_session_stats()callsupdate_session_metadata()immediately afteradd_message(). The new method loaded a possibly stale in-memory session and saved it without a file lock, which could silently drop the latest chat messages when:This is user-visible data loss in chat/session persistence (PraisonAI UI host integration, May 21).
Root cause
update_session_metadata()used_load_session()+_save_session()instead of the locked read-modify-write pattern already used byadd_message().Fix and validation
update_session_metadata()withFileLock, disk reload, atomic write, and cache update (mirrorsadd_message).test_update_session_metadata_preserves_messagesthat reproduces the two-store stale-cache race.python3 -m pytest tests/unit/session/test_session_store.py::TestDefaultSessionStore::test_update_session_metadata_preserves_messages -q -o addopts=""Summary by CodeRabbit
Bug Fixes