Use native PG17+ logical slot failover; retire spock worker on PG18.#409
Use native PG17+ logical slot failover; retire spock worker on PG18.#409ibrarahmad wants to merge 4 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds version-aware logical slot failover and related documentation, GUC docs, slot-creation changes (FAILOVER flag), failover-worker behavior gated by PostgreSQL version (PG17 yield, PG18+ disabled), conflict-handling/reporting changes, new tests (TAP/regress) and sample adjustments for remote-version logic. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Duplication | 0 |
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/configuring.md`:
- Around line 247-250: The fenced code blocks in docs/configuring.md (e.g., the
block containing synchronized_standby_slots = 'physical_slot_name' and the other
similar blocks around those sections) lack language identifiers and trigger
markdownlint MD040; update each triple-backtick fence to include an appropriate
language tag (for example use conf or ini) so they become ```conf or ```ini to
silence MD040 and improve syntax highlighting; ensure you apply the same change
to the other fenced blocks noted in the comment (the ones with
postgres/postgresql config snippets).
- Around line 230-236: The per-GUC notes incorrectly state that
spock.synchronize_slot_names and spock.pg_standby_slot_names apply only to
PostgreSQL 15/16; update the documentation so these GUCs are described as
effective whenever the legacy Spock worker is registered (including PG17 when
the Spock path is used) rather than blanket “15/16 only.” Reference the
registration behavior implemented in src/spock_failover_slots.c (the worker
registration and ClientAuthentication_hook on PG17) and adjust the table text
and the related GUC notes (the per-GUC paragraphs currently marked as
15/16-only) so they say “applies when Spock failover slot worker is active
(legacy Spock path, which may apply on PG17)” or similar wording; mirror this
change in the other occurrences you flagged (around the other GUC notes).
In `@samples/Z0DAN/zodan.sql`:
- Around line 371-386: Query the remote server_version_num over the same DSN
before building remotesql so you branch based on the remote PostgreSQL version
rather than the local pg_settings; use dblink(node_dsn, 'SHOW
server_version_num') to select INTO a variable (e.g., remote_server_version) and
then if remote_server_version >= 170000 set remotesql to the 5-arg
pg_create_logical_replication_slot(...) form (with failover => true) else set
remotesql to the 2-arg form; apply the same change to the other code path that
builds remotesql (the duplicate occurrence that currently checks
pg_settings/local server_version_num) so both places base the decision on the
remote DSN.
In `@src/spock_failover_slots.c`:
- Around line 1088-1100: The current guard uses IsSyncingReplicationSlots() (a
process-local flag) which doesn't reflect PostgreSQL's cross-process slot-sync
state; instead read the shared SlotSyncCtx->syncing flag before entering the
sync path (in addition to RecoveryInProgress() and hot_standby_feedback) so the
Spock worker yields when the native slotsync worker is active; update the
condition around IsSyncingReplicationSlots() to query SlotSyncCtx->syncing
(using the same shared-memory access and locking protocol that PostgreSQL uses
to read SlotSyncCtx) rather than relying solely on the process-local
syncing_slots flag.
In `@src/spock_sync.c`:
- Around line 368-386: The code currently uses the compile-time PG_VERSION_NUM
to decide whether to append " FAILOVER" to the slot CREATE query; instead, query
the remote provider's server version via PQserverVersion(repl_conn) (or reuse
any existing remote capability check) and base the PG17+ decision on that
runtime value; specifically, replace the PG_VERSION_NUM branch with a runtime
check using PQserverVersion(repl_conn) and keep the older-versions fallback that
only appends " FAILOVER" when use_failover_slot is true, ensuring
appendStringInfo(&query, " FAILOVER") is executed only when the remote server
reports version >= 170000.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 13578561-d80d-41c3-a456-83a0b10a81a2
📒 Files selected for processing (6)
docs/configuring.mddocs/spock_release_notes.mdmkdocs.ymlsamples/Z0DAN/zodan.sqlsrc/spock_failover_slots.csrc/spock_sync.c
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/spock_sync.c (1)
368-389:⚠️ Potential issue | 🔴 CriticalUse the provider's version/capability here, not
PG_VERSION_NUM.This command is sent over
repl_conn, but both changed branches still key off the local build version.FAILOVERis a PostgreSQL 17+ replication-slot option, so a PG17+ build talking to a pre-17 provider during mixed-version add-node/upgrade flows will still emit(FAILOVER)and fail slot creation. The caller-side#if PG_VERSION_NUM < 170000also means remote failover support is no longer probed at all on the PG17+ build path. Base this onPQserverVersion(...)and/or the remote capability probe instead of the subscriber's build version. (postgresql.org)Suggested direction
-#if PG_VERSION_NUM < 170000 - bool use_failover_slot; -#endif + bool use_failover_slot = false; ... -#if PG_VERSION_NUM < 170000 - /* 2QPG9.6 and 2QPG11 support failover slots */ - use_failover_slot = - spock_remote_function_exists(origin_conn, "pg_catalog", - "pg_create_logical_replication_slot", - -1, - "failover"); -#endif + /* Native PG17+ support, or older providers with explicit failover support. */ + use_failover_slot = + PQserverVersion(origin_conn) >= 170000 || + spock_remote_function_exists(origin_conn, "pg_catalog", + "pg_create_logical_replication_slot", + -1, + "failover"); ... -#if PG_VERSION_NUM >= 170000 - appendStringInfo(&query, " (FAILOVER)"); -#else - if (use_failover_slot) - appendStringInfo(&query, " (FAILOVER)"); -#endif + if (use_failover_slot) + appendStringInfo(&query, " (FAILOVER)");Also applies to: 1226-1256
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/spock_sync.c` around lines 368 - 389, The code currently branches on the local build macro PG_VERSION_NUM when deciding to append " (FAILOVER)" to the CREATE_REPLICATION_SLOT query; instead detect remote provider capability via the replication connection (e.g. use PQserverVersion(repl_conn) and/or the existing remote capability probe) and use that to decide whether to append FAILOVER; modify the logic around repl_conn, PG_VERSION_NUM, and use_failover_slot so that on PG17+ builds you still check PQserverVersion(repl_conn) (or the remote capability flag) before appending " (FAILOVER)" to the query, and ensure the older-branch behavior still respects use_failover_slot when the remote reports support.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/logical_slot_failover.md`:
- Around line 43-56: Add language tags to the fenced code blocks containing the
PostgreSQL configuration so markdownlint MD040 is satisfied; locate the blocks
that include settings like synchronized_standby_slots = 'spock_standby_slot' and
the block with sync_replication_slots = on / primary_conninfo =
'host=<primary_host> ...' / primary_slot_name = 'spock_standby_slot' /
hot_standby_feedback = on and change their fences from ``` to ```conf (or
```ini) so the code fences are explicitly marked as config/ini.
- Around line 124-128: The docs currently suggest using
pg_stat_replication_slots to check if the slotsync worker is active; instead
update the example to query pg_stat_activity filtering on backend_type = 'slot
sync worker' (and optionally inspect wait_event = 'ReplicationSlotsyncMain' when
idle) so readers can detect the slotsync worker process; locate the section
titled "Check if slotsync worker is active (PG17+)" and replace the SELECT
against pg_stat_replication_slots with a query against pg_stat_activity WHERE
backend_type = 'slot sync worker'.
In `@tests/tap/t/018_failover_slots.pl`:
- Around line 286-309: Add a dedicated PG17 branch: use the same qport call that
queries pg_stat_activity for application_name 'spock_failover_slots worker'
(same as the existing checks that use $pg_major and qport) and assert the worker
is still registered (count > 0) with a message like "PG17: spock_failover_slots
worker registered but yields to native slotsync"; this ensures PG15/16 keep the
existing running check, PG17 is explicitly validated as registered (per
src/spock_failover_slots.c:1460-1489), and PG18+ remains the is(..., '0') case.
---
Duplicate comments:
In `@src/spock_sync.c`:
- Around line 368-389: The code currently branches on the local build macro
PG_VERSION_NUM when deciding to append " (FAILOVER)" to the
CREATE_REPLICATION_SLOT query; instead detect remote provider capability via the
replication connection (e.g. use PQserverVersion(repl_conn) and/or the existing
remote capability probe) and use that to decide whether to append FAILOVER;
modify the logic around repl_conn, PG_VERSION_NUM, and use_failover_slot so that
on PG17+ builds you still check PQserverVersion(repl_conn) (or the remote
capability flag) before appending " (FAILOVER)" to the query, and ensure the
older-branch behavior still respects use_failover_slot when the remote reports
support.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8c04b428-578f-42ed-9df5-ded247273251
📒 Files selected for processing (4)
docs/logical_slot_failover.mdsrc/spock_sync.ctests/tap/scheduletests/tap/t/018_failover_slots.pl
✅ Files skipped from review due to trivial changes (1)
- tests/tap/schedule
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/tap/t/018_failover_slots.pl (1)
413-420: Consider usingpg_ctl startfor cleaner cleanup.Starting postgres directly with
&and relying onsleep(10)may be fragile if the server takes longer to initialize. Usingpg_ctl start -wwould wait for startup completion.Suggested improvement
# Undo primary GUC change so destroy_cluster can restart n1 cleanly -system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &"); -sleep(10); +system_or_bail("$pg_bin/pg_ctl", 'start', '-D', $primary_dir, '-w'); system_maybe("$pg_bin/psql", '-h', $host, '-p', $primary_port,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/tap/t/018_failover_slots.pl` around lines 413 - 420, The test starts Postgres directly with system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &") and then sleeps, which is fragile; replace that startup with a waiting pg_ctl invocation (e.g., use system_maybe to call pg_ctl start -D $primary_dir -w or equivalent) so the script blocks until the server is ready instead of relying on sleep(10); update the startup call that currently uses postgres and sleep to use pg_ctl start -w and keep the subsequent ALTER SYSTEM RESET and SELECT pg_reload_conf() calls as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/tap/t/018_failover_slots.pl`:
- Around line 413-420: The test starts Postgres directly with
system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &") and then sleeps,
which is fragile; replace that startup with a waiting pg_ctl invocation (e.g.,
use system_maybe to call pg_ctl start -D $primary_dir -w or equivalent) so the
script blocks until the server is ready instead of relying on sleep(10); update
the startup call that currently uses postgres and sleep to use pg_ctl start -w
and keep the subsequent ALTER SYSTEM RESET and SELECT pg_reload_conf() calls
as-is.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1146285c-cce4-4c27-96a4-fbb7cf3bac6c
📒 Files selected for processing (9)
docs/configuring.mddocs/logical_slot_failover.mddocs/spock_release_notes.mdmkdocs.ymlsamples/Z0DAN/zodan.sqlsrc/spock_failover_slots.csrc/spock_sync.ctests/tap/scheduletests/tap/t/018_failover_slots.pl
✅ Files skipped from review due to trivial changes (3)
- tests/tap/schedule
- mkdocs.yml
- docs/spock_release_notes.md
🚧 Files skipped from review as they are similar to previous changes (2)
- src/spock_failover_slots.c
- samples/Z0DAN/zodan.sql
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/tap/t/018_failover_slots.pl (2)
318-322: Consider polling for DDL replication instead of fixed sleep.The 5-second sleep assumes DDL replication completes within that time. For more robust testing:
psql_or_bail(1, "CREATE TABLE IF NOT EXISTS failover_test (id int primary key, val text)"); -sleep(5); +# Wait for table to exist on n2 +wait_until(30, 2, sub { + my $exists = scalar_query(2, + "SELECT 1 FROM information_schema.tables WHERE table_name='failover_test'"); + $exists =~ s/\s+//g; + return $exists eq '1'; +}) or diag("WARNING: table may not have replicated to n2"); psql_or_bail(1, "INSERT INTO failover_test VALUES (1, 'before_failover')");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/tap/t/018_failover_slots.pl` around lines 318 - 322, The fixed sleep should be replaced with polling that verifies DDL replication before proceeding: after creating failover_test (created via psql_or_bail) loop with a short delay and use psql_or_bail (targeting the replica connection) to check for the presence of the table or the inserted row (e.g., SELECT 1 FROM failover_test WHERE id=1) until it succeeds or a timeout is reached; remove sleep(5) and instead break the loop on success or fail the test after the timeout so the test is robust against variable replication lag.
417-424: Consider usingpg_isreadypolling instead of hard-coded sleep.The cleanup section starts n1 in the background and waits 10 seconds before attempting to reset GUCs. This is fragile on slow systems. Consider:
-system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &"); -sleep(10); +system("$pg_bin/pg_ctl start -D $primary_dir -l $primary_dir/cleanup.log -w");Using
pg_ctl start -wwaits until the server is ready, or you could poll withpg_isreadylike you do at line 202.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/tap/t/018_failover_slots.pl` around lines 417 - 424, Replace the fragile background start + sleep sequence (system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &"); sleep(10);) with a readiness-based start/polling approach: either use pg_ctl start -w pointing at $primary_dir (so the call blocks until server is ready) or loop with pg_isready (like the polling used earlier around line 202) before calling system_maybe to run ALTER SYSTEM RESET synchronized_standby_slots and SELECT pg_reload_conf(); ensure you reference the same $pg_bin, $primary_dir and $primary_port variables when invoking pg_ctl or pg_isready so the reset commands run only after the server is ready.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/tap/t/018_failover_slots.pl`:
- Around line 318-322: The fixed sleep should be replaced with polling that
verifies DDL replication before proceeding: after creating failover_test
(created via psql_or_bail) loop with a short delay and use psql_or_bail
(targeting the replica connection) to check for the presence of the table or the
inserted row (e.g., SELECT 1 FROM failover_test WHERE id=1) until it succeeds or
a timeout is reached; remove sleep(5) and instead break the loop on success or
fail the test after the timeout so the test is robust against variable
replication lag.
- Around line 417-424: Replace the fragile background start + sleep sequence
(system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &"); sleep(10);)
with a readiness-based start/polling approach: either use pg_ctl start -w
pointing at $primary_dir (so the call blocks until server is ready) or loop with
pg_isready (like the polling used earlier around line 202) before calling
system_maybe to run ALTER SYSTEM RESET synchronized_standby_slots and SELECT
pg_reload_conf(); ensure you reference the same $pg_bin, $primary_dir and
$primary_port variables when invoking pg_ctl or pg_isready so the reset commands
run only after the server is ready.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 401c642f-9d79-4970-8e82-5ad31e02a2c9
📒 Files selected for processing (6)
docs/configuring.mddocs/logical_slot_failover.mdsamples/Z0DAN/zodan.sqlsrc/spock_failover_slots.csrc/spock_sync.ctests/tap/t/018_failover_slots.pl
✅ Files skipped from review due to trivial changes (2)
- docs/configuring.md
- docs/logical_slot_failover.md
🚧 Files skipped from review as they are similar to previous changes (2)
- samples/Z0DAN/zodan.sql
- src/spock_failover_slots.c
susan-pgedge
left a comment
There was a problem hiding this comment.
I'll approve, but when we cherry-pick the v5_STABLE branch into main, we'll have merge conflicts to bring the new content into our 5.1 branch.
Formatting issues are addressed in the new content.... and I only reviewed .md files :D
There was a problem hiding this comment.
🧹 Nitpick comments (6)
tests/regress/sql/read_only.sql (3)
251-255: Unnecessary\gsetonrepair_mode()calls.The
\gsetcommands on lines 252 and 254 appear superfluous. Ifrepair_mode()returns void,\gsethas no effect. If it returns a value, the captured variable isn't used. Consider removing\gsetfor clarity:🔧 Proposed simplification
BEGIN; -SELECT spock.repair_mode(true) \gset +SELECT spock.repair_mode(true); INSERT INTO ro_repl_test VALUES (99, 'repair data'); -SELECT spock.repair_mode(false) \gset +SELECT spock.repair_mode(false); COMMIT;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/regress/sql/read_only.sql` around lines 251 - 255, The SELECT calls to spock.repair_mode(true) and spock.repair_mode(false) include unnecessary psql meta-command "\gset" which either does nothing for void results or captures an unused variable; remove the "\gset" from the two SELECT lines (i.e., call SELECT spock.repair_mode(true); and SELECT spock.repair_mode(false); or otherwise invoke repair_mode without using \gset) so the script is clearer and has no redundant \gset usage.
277-278: Consider adding deterministic ordering for PID comparison.Using
LIMIT 1withoutORDER BYrelies on implicit row ordering. While this works when there's a single replication connection (typical for this test), addingORDER BY pidwould make the comparison deterministic if the test environment ever has multiple connections.🔧 Optional improvement for robustness
-SELECT pid AS repl_pid FROM pg_stat_replication LIMIT 1 +SELECT pid AS repl_pid FROM pg_stat_replication ORDER BY pid LIMIT 1 \gset--- Same walsender PID proves the apply worker stayed alive (gentle wait) -SELECT :repl_pid = pid AS worker_survived FROM pg_stat_replication LIMIT 1; +SELECT :repl_pid = pid AS worker_survived FROM pg_stat_replication ORDER BY pid LIMIT 1;Also applies to: 297-298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/regress/sql/read_only.sql` around lines 277 - 278, The SELECT that captures replication PID (SELECT pid AS repl_pid FROM pg_stat_replication LIMIT 1 \gset) relies on implicit ordering; modify it to use an explicit ORDER BY pid (e.g., SELECT pid AS repl_pid FROM pg_stat_replication ORDER BY pid LIMIT 1 \gset) to ensure deterministic PID selection, and make the same change to the other identical query instance used for comparison.
283-284: Hardcoded sleep may be fragile in slow CI environments.The 2-second sleep for the apply worker to detect the config change could be insufficient under heavy CI load. Consider using a polling loop or increasing the timeout if flakiness occurs. However, this is a common pattern in test suites.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/regress/sql/read_only.sql` around lines 283 - 284, Hardcoded SELECT pg_sleep(2) is fragile; replace the fixed sleep with a short polling loop that repeatedly queries the apply worker's state until it reports readonly (or a timeout), or at minimum increase the delay to a safer value; locate the SELECT pg_sleep(2) line in read_only.sql and implement a loop that runs small sleeps and checks the apply worker status (or change pg_sleep(2) to something like pg_sleep(5)) so the test waits deterministically for the worker to enter readonly mode.src/spock_conflict.c (1)
371-387: Clarify the condition grouping for maintainability.The condition at line 371 groups
SPOCK_CT_UPDATE_EXISTSandSPOCK_CT_DELETE_ORIGIN_DIFFERStogether for the same-origin early-return logic, then transformsUPDATE_EXISTStoUPDATE_ORIGIN_DIFFERSat line 386-387. This works correctly but the dual-purpose condition could benefit from a brief inline comment explaining thatUPDATE_EXISTSis transformed toUPDATE_ORIGIN_DIFFERSwhen origins differ.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/spock_conflict.c` around lines 371 - 387, The early-return branch that checks conflict_type against SPOCK_CT_UPDATE_EXISTS and SPOCK_CT_DELETE_ORIGIN_DIFFERS and later mutates conflict_type to SPOCK_CT_UPDATE_ORIGIN_DIFFERS is confusing; update the block around conflict_type, local_tuple_origin, replorigin_session_origin, InvalidRepOriginId, local_tuple_xid and GetTopTransactionId() to include a short inline comment explaining that SPOCK_CT_UPDATE_EXISTS is treated like UPDATE_ORIGIN_DIFFERS when origins differ (and thus is converted) so readers understand the dual-purpose condition and the subsequent reassignment to SPOCK_CT_UPDATE_ORIGIN_DIFFERS.src/spock_apply_heap.c (1)
848-853: Minor: Conflict log format differs fromspock_report_conflict()output.The simplified
elogformat here differs from the detailedereportinspock_report_conflict()(which includes tuple details, timestamps, origins, etc.). This is acceptable given the comment's explanation that tuple slot data may be invalidated, but consider adding a brief note in the log message indicating limited details are available due to the error context.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/spock_apply_heap.c` around lines 848 - 853, The conflict log call using elog currently prints a short message; update the message passed to elog (the call around SpockConflictTypeName(SPOCK_CT_UPDATE_EXISTS) and edata->message) to append a brief note that tuple-level details are unavailable in this context (e.g. "detailed tuple info not available due to error context"), mirroring the intent of spock_report_conflict() without trying to access invalidated slot data; keep the same log level (spock_conflict_log_level) and the existing fields (conflict type, schema, relname, edata->message) and only extend the formatted string to include the limited-details note.docs/conflict_types.md (1)
63-66: Consider documenting replication origin value semantics.The description of origin comparison could benefit from clarifying what specific origin values mean:
- Origin
0indicates the row was changed locally (not via replication)NULLinspock.resolutions.local_origin(or "unknown" in logs) indicates the local origin is genuinely unavailable (e.g., for pre-existing data afterpg_restore)This distinction is important for users interpreting conflict logs and the
spock.resolutionstable. Based on learnings about origin value semantics in this codebase.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/conflict_types.md` around lines 63 - 66, Add a short section to the origin comparison paragraph explaining the semantics of replication origin values: state that replorigin_session_origin value 0 means the row was changed locally (not via replication), and that NULL in spock.resolutions.local_origin (or "unknown" in logs) indicates the local origin is genuinely unavailable (e.g., pre-existing data after pg_restore); reference replorigin_session_origin and spock.resolutions.local_origin in the text so readers can map log/table values to these meanings and update the surrounding sentences about "same origin" behavior accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/conflict_types.md`:
- Around line 63-66: Add a short section to the origin comparison paragraph
explaining the semantics of replication origin values: state that
replorigin_session_origin value 0 means the row was changed locally (not via
replication), and that NULL in spock.resolutions.local_origin (or "unknown" in
logs) indicates the local origin is genuinely unavailable (e.g., pre-existing
data after pg_restore); reference replorigin_session_origin and
spock.resolutions.local_origin in the text so readers can map log/table values
to these meanings and update the surrounding sentences about "same origin"
behavior accordingly.
In `@src/spock_apply_heap.c`:
- Around line 848-853: The conflict log call using elog currently prints a short
message; update the message passed to elog (the call around
SpockConflictTypeName(SPOCK_CT_UPDATE_EXISTS) and edata->message) to append a
brief note that tuple-level details are unavailable in this context (e.g.
"detailed tuple info not available due to error context"), mirroring the intent
of spock_report_conflict() without trying to access invalidated slot data; keep
the same log level (spock_conflict_log_level) and the existing fields (conflict
type, schema, relname, edata->message) and only extend the formatted string to
include the limited-details note.
In `@src/spock_conflict.c`:
- Around line 371-387: The early-return branch that checks conflict_type against
SPOCK_CT_UPDATE_EXISTS and SPOCK_CT_DELETE_ORIGIN_DIFFERS and later mutates
conflict_type to SPOCK_CT_UPDATE_ORIGIN_DIFFERS is confusing; update the block
around conflict_type, local_tuple_origin, replorigin_session_origin,
InvalidRepOriginId, local_tuple_xid and GetTopTransactionId() to include a short
inline comment explaining that SPOCK_CT_UPDATE_EXISTS is treated like
UPDATE_ORIGIN_DIFFERS when origins differ (and thus is converted) so readers
understand the dual-purpose condition and the subsequent reassignment to
SPOCK_CT_UPDATE_ORIGIN_DIFFERS.
In `@tests/regress/sql/read_only.sql`:
- Around line 251-255: The SELECT calls to spock.repair_mode(true) and
spock.repair_mode(false) include unnecessary psql meta-command "\gset" which
either does nothing for void results or captures an unused variable; remove the
"\gset" from the two SELECT lines (i.e., call SELECT spock.repair_mode(true);
and SELECT spock.repair_mode(false); or otherwise invoke repair_mode without
using \gset) so the script is clearer and has no redundant \gset usage.
- Around line 277-278: The SELECT that captures replication PID (SELECT pid AS
repl_pid FROM pg_stat_replication LIMIT 1 \gset) relies on implicit ordering;
modify it to use an explicit ORDER BY pid (e.g., SELECT pid AS repl_pid FROM
pg_stat_replication ORDER BY pid LIMIT 1 \gset) to ensure deterministic PID
selection, and make the same change to the other identical query instance used
for comparison.
- Around line 283-284: Hardcoded SELECT pg_sleep(2) is fragile; replace the
fixed sleep with a short polling loop that repeatedly queries the apply worker's
state until it reports readonly (or a timeout), or at minimum increase the delay
to a safer value; locate the SELECT pg_sleep(2) line in read_only.sql and
implement a loop that runs small sleeps and checks the apply worker status (or
change pg_sleep(2) to something like pg_sleep(5)) so the test waits
deterministically for the worker to enter readonly mode.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0dc23c1c-f482-4b3a-9dd4-131997c5ac2b
⛔ Files ignored due to path filters (2)
tests/regress/expected/conflict_stat.outis excluded by!**/*.outtests/regress/expected/read_only.outis excluded by!**/*.out
📒 Files selected for processing (11)
Makefiledocs/conflict_types.mddocs/conflicts.mddocs/modify/zodan/zodan_readme.mddocs/modify/zodan/zodan_tutorial.mdmkdocs.ymlsrc/spock_apply_heap.csrc/spock_conflict.csrc/spock_failover_slots.ctests/regress/sql/conflict_stat.sqltests/regress/sql/read_only.sql
✅ Files skipped from review due to trivial changes (3)
- mkdocs.yml
- docs/modify/zodan/zodan_readme.md
- docs/modify/zodan/zodan_tutorial.md
🚧 Files skipped from review as they are similar to previous changes (1)
- src/spock_failover_slots.c
PostgreSQL 17 introduced built-in logical slot synchronization to physical standbys via the slotsync worker (sync_replication_slots) and the FAILOVER flag on logical slots. PostgreSQL 18 completes the feature with synchronized_standby_slots replacing the need for any third-party slot sync worker. Mark all spock logical slots with FAILOVER at creation time on PG17+ so the native slotsync worker picks them up automatically. On PG17, spock's failover worker checks IsSyncingReplicationSlots() and yields if the native worker is active, preventing conflicts. On PG18+, the spock_failover_slots background worker is not registered at all; users must set sync_replication_slots = on. For PG15 and PG16, behavior is unchanged: spock's bgworker syncs slots and the ClientAuthentication_hook holds walsenders back until standbys confirm via spock.pg_standby_slot_names. ZODAN (zodan.sql) also creates logical slots via dblink; update both slot creation sites to pass failover => true on PG17+ using a runtime server_version_num check. Add docs/logical_slot_failover.md covering setup for all supported PostgreSQL versions, required postgresql.conf settings, monitoring queries, and a version behaviour matrix. Update configuring.md with the five failover-slot GUCs and a cross-reference. Add the new page to mkdocs.yml navigation.
…ded test case.
PG17+ requires parenthesised options for CREATE_REPLICATION_SLOT;
the bare FAILOVER keyword fails to parse. Fix spock_sync.c and add
TAP test 018_failover_slots to verify logical slot sync to a physical
standby and end-to-end replication after promotion.
Add TAP test 018_failover_slots.pl that:
- builds a 2-node spock cluster (n1 provider, n2 subscriber)
- creates a physical streaming standby of n1 via pg_basebackup
- verifies the logical slot is synced to the standby (PG17+ native
slotsync; PG15/16 spock_failover_slots worker)
- promotes the standby, reconnects n2, and confirms replication
resumes end-to-end
…tions Use sql_conn/conn (regular SQL connection) instead of repl_conn for PQserverVersion() checks — replication protocol connections return 0. Also address CodeRabbit review comments on docs, zodan.sql, and tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PostgreSQL 17 introduced built-in logical slot synchronization to physical standbys via the slotsync worker (sync_replication_slots) and the FAILOVER flag on logical slots. PostgreSQL 18 completes the feature with synchronized_standby_slots replacing the need for any third-party slot sync worker.
Mark all spock logical slots with FAILOVER at creation time on PG17+ so the native slotsync worker picks them up automatically. On PG17, spock's failover worker checks IsSyncingReplicationSlots() and yields if the native worker is active, preventing conflicts. On PG18+, the spock_failover_slots background worker is not registered at all; users must set sync_replication_slots = on.
For PG15 and PG16, behavior is unchanged: spock's bgworker syncs slots and the ClientAuthentication_hook holds walsenders back until standbys confirm via spock.pg_standby_slot_names.
ZODAN (zodan.sql) also creates logical slots via dblink; update both slot creation sites to pass failover => true on PG17+ using a runtime server_version_num check.
Add docs/logical_slot_failover.md covering setup for all supported PostgreSQL versions, required postgresql.conf settings, monitoring queries, and a version behaviour matrix. Update configuring.md with the five failover-slot GUCs and a cross-reference. Add the new page to mkdocs.yml navigation.