fix lock_hold_tx wlock_ts_ and txm CmdSet's wlock_ts_#307
Conversation
WalkthroughUpdated transaction coordination and cache-entry handling: write-lock timestamp update now requires existing timestamp to be zero, AddObjectCommand enforces stricter version and non-decreasing timestamp checks when reusing entries, and ReplayLogCc::Execute now logs InitCcm errors before existing error handling. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tx_service/include/cc/cc_request.h (1)
4726-4745: Good addition of error logging; consider adding more context.The new
LOG(ERROR)onInitCcmfailure inReplayLogCc::Executeis helpful and does not change control flow. For easier debugging of replay issues, you might also log the table name andnode_group_id_(and possiblyng_term_) so operators can correlate errors to specific shards/tables.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tx_service/include/cc/cc_request.h(1 hunks)tx_service/include/command_set.h(1 hunks)tx_service/src/cc/cc_shard.cpp(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-09T03:56:58.811Z
Learnt from: thweetkomputer
Repo: eloqdata/tx_service PR: 150
File: include/cc/local_cc_shards.h:626-631
Timestamp: 2025-10-09T03:56:58.811Z
Learning: For the LocalCcShards class in include/cc/local_cc_shards.h: Writer locks (unique_lock) should continue using the original meta_data_mux_ (std::shared_mutex) rather than fast_meta_data_mux_ (FastMetaDataMutex) at this stage. Only reader locks may use the FastMetaDataMutex wrapper.
Applied to files:
tx_service/src/cc/cc_shard.cpp
📚 Learning: 2025-12-02T10:43:27.431Z
Learnt from: lokax
Repo: eloqdata/tx_service PR: 254
File: tx_service/src/cc/local_cc_shards.cpp:2949-3188
Timestamp: 2025-12-02T10:43:27.431Z
Learning: In tx_service/src/cc/local_cc_shards.cpp, whenever TryPinNodeGroupData is used, only call Sharder::Instance().UnpinNodeGroupData(node_group) if the recorded term is >= 0 (i.e., pin succeeded). Example: LocalCcShards::PostProcessFlushTaskEntries guards the unpin with `if (term >= 0)`.
Applied to files:
tx_service/src/cc/cc_shard.cpp
🔇 Additional comments (2)
tx_service/include/command_set.h (1)
94-103: Updating entry timestamps on reuse looks correct; verify monotonicity assumptions.Updating
object_version_,lock_ts_, andlast_vali_ts_when an existingCmdSetEntryis reused fixes the stale-state issue and keeps the command set in sync with the latest lock state. The added assertions thatcce_version,lock_ts, andlast_vali_tsare non-decreasing are reasonable invariants for version/timestamp fields.It would be good to double-check that all call sites (including recovery/replay paths and any lock re-acquisition/upgrade flows) never legitimately pass a lower
lock_tsorlast_vali_ts; otherwise these debug-only asserts could start firing even though production behavior is fine.tx_service/src/cc/cc_shard.cpp (1)
775-783: Initialization ofwlock_ts_is properly handled — code change is correct.The
TxLockInfoconstructor (line 133) andResetmethod (line 154) both correctly initializewlock_ts_to 0. The change ensures thatwlock_ts_captures the timestamp of the first write lock acquisition and remains unchanged on subsequent acquisitions by the same transaction. This preserves the initial write lock timestamp needed for checkpoint timestamp calculations (as confirmed by usage at lines 535-537).
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tx_service/include/command_set.h (1)
95-103: Consider using always-on runtime checks instead ofassert()for these critical invariants.The invariants being enforced here (version consistency and non-decreasing timestamps) are important for correctness—this is a bug fix PR after all. However,
assert()is typically compiled out in release builds whenNDEBUGis defined. If these invariants are violated in production, the code will silently proceed with potentially incorrect values, which could lead to data corruption in transaction coordination.Consider using a runtime check that logs and handles the failure, or a macro that remains active in release builds:
🔎 Suggested approach using runtime checks
CmdSetEntry &entry = cce_it->second; // Update the entry fields since a txn might have many commands that // operate on the same cce. - assert(entry.object_version_ == 0 || - entry.object_version_ == cce_version); + if (entry.object_version_ != 0 && entry.object_version_ != cce_version) + { + LOG(ERROR) << "AddObjectCommand version mismatch: existing=" + << entry.object_version_ << " incoming=" << cce_version; + // Handle error appropriately - return early, throw, or abort + } entry.object_version_ = cce_version; - assert(lock_ts >= entry.lock_ts_); + if (lock_ts < entry.lock_ts_) + { + LOG(ERROR) << "AddObjectCommand lock_ts regressed: existing=" + << entry.lock_ts_ << " incoming=" << lock_ts; + } entry.lock_ts_ = lock_ts; - assert(last_vali_ts >= entry.last_vali_ts_); + if (last_vali_ts < entry.last_vali_ts_) + { + LOG(ERROR) << "AddObjectCommand last_vali_ts regressed: existing=" + << entry.last_vali_ts_ << " incoming=" << last_vali_ts; + } entry.last_vali_ts_ = last_vali_ts;Alternatively, if crashing is acceptable for invariant violations, use a custom macro like
CHECK()(from butil/logging.h) that remains active in release builds.
Fixes https://github.com/eloqdata/project_tracker/issues/109.
May also address at least one cause of https://github.com/eloqdata/project_tracker/issues/61.
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.