Skip to content

Use native PG17+ logical slot failover; retire spock worker on PG18.#409

Open
ibrarahmad wants to merge 4 commits intomainfrom
SPOC-74
Open

Use native PG17+ logical slot failover; retire spock worker on PG18.#409
ibrarahmad wants to merge 4 commits intomainfrom
SPOC-74

Conversation

@ibrarahmad
Copy link
Copy Markdown
Contributor

PostgreSQL 17 introduced built-in logical slot synchronization to physical standbys via the slotsync worker (sync_replication_slots) and the FAILOVER flag on logical slots. PostgreSQL 18 completes the feature with synchronized_standby_slots replacing the need for any third-party slot sync worker.

Mark all spock logical slots with FAILOVER at creation time on PG17+ so the native slotsync worker picks them up automatically. On PG17, spock's failover worker checks IsSyncingReplicationSlots() and yields if the native worker is active, preventing conflicts. On PG18+, the spock_failover_slots background worker is not registered at all; users must set sync_replication_slots = on.

For PG15 and PG16, behavior is unchanged: spock's bgworker syncs slots and the ClientAuthentication_hook holds walsenders back until standbys confirm via spock.pg_standby_slot_names.

ZODAN (zodan.sql) also creates logical slots via dblink; update both slot creation sites to pass failover => true on PG17+ using a runtime server_version_num check.

Add docs/logical_slot_failover.md covering setup for all supported PostgreSQL versions, required postgresql.conf settings, monitoring queries, and a version behaviour matrix. Update configuring.md with the five failover-slot GUCs and a cross-reference. Add the new page to mkdocs.yml navigation.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds version-aware logical slot failover and related documentation, GUC docs, slot-creation changes (FAILOVER flag), failover-worker behavior gated by PostgreSQL version (PG17 yield, PG18+ disabled), conflict-handling/reporting changes, new tests (TAP/regress) and sample adjustments for remote-version logic.

Changes

Cohort / File(s) Summary
Documentation (new/updated)
docs/logical_slot_failover.md, docs/configuring.md, docs/spock_release_notes.md, docs/conflict_types.md, docs/conflicts.md, docs/modify/zodan/zodan_readme.md, docs/modify/zodan/zodan_tutorial.md, mkdocs.yml
Added Logical Slot Failover doc and release notes; added conflict-types doc and updated conflicts.md; ZODAN README/tutorial warnings expanded; mkdocs nav updated; new/updated GUC documentation entries and version-specific configuration guidance.
Source: failover worker
src/spock_failover_slots.c
PG version guards added: include replication/slotsync.h for PG>=17; startup/registration and hooks suppressed for PG>=18; run-time gating adjusted so PG17 yields to native sync_replication_slots when enabled; slot-safety checks adjusted for invalid catalog_xmin.
Source: slot creation & sync
src/spock_sync.c
CREATE_REPLICATION_SLOT/replication-protocol calls append (FAILOVER) when connected to PG>=17; pre-17 behavior preserved via compile-time/runtime guards.
Source: apply & conflict handling
src/spock_apply_heap.c, src/spock_conflict.c
Wrapped UPDATE apply in PG_TRY/PG_CATCH to detect unique-violation and log SPOCK_CT_UPDATE_EXISTS; spock_report_conflict changed to avoid saving origin-diff conflicts in spock.resolutions.
Samples
samples/Z0DAN/zodan.sql
Remote SQL now queries remote server_version_num (via dblink) and chooses legacy vs extended pg_create_logical_replication_slot call (adds FAILOVER/true for PG>=17); added remote_pgver lookups with fallback.
Tests: TAP integration
tests/tap/t/018_failover_slots.pl, tests/tap/schedule
New TAP test and schedule entry performing 2-node end-to-end failover/standby slot sync, PG-version assertions (failover flag, worker presence), physical standby bootstrap, promotion and subscriber reconnection checks.
Tests: regression SQL
tests/regress/sql/read_only.sql, tests/regress/sql/conflict_stat.sql
Added read_only regression to exercise spock.readonly modes and effects on replication/apply; expanded conflict_stat to test UPDATE_EXISTS unique-violation logging and stats.
Test harness / build
Makefile
Updated REGRESS list to include read_only.

Poem

🐇 I nibble logs and chase each LSN,

I tuck a FAILOVER flag in for Seventeen,
I step aside when native sync walks in,
For Eighteen I sleep — the system hums serene,
Hooray — the standby wakes and keeps the scene!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: enabling native PostgreSQL 17+ logical slot failover while retiring the Spock worker on PG18.
Description check ✅ Passed The description comprehensively explains the PR's purpose, implementation approach across PostgreSQL versions, and specific code/documentation changes, all directly related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch SPOC-74

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 6, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results
Duplication 0

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/configuring.md`:
- Around line 247-250: The fenced code blocks in docs/configuring.md (e.g., the
block containing synchronized_standby_slots = 'physical_slot_name' and the other
similar blocks around those sections) lack language identifiers and trigger
markdownlint MD040; update each triple-backtick fence to include an appropriate
language tag (for example use conf or ini) so they become ```conf or ```ini to
silence MD040 and improve syntax highlighting; ensure you apply the same change
to the other fenced blocks noted in the comment (the ones with
postgres/postgresql config snippets).
- Around line 230-236: The per-GUC notes incorrectly state that
spock.synchronize_slot_names and spock.pg_standby_slot_names apply only to
PostgreSQL 15/16; update the documentation so these GUCs are described as
effective whenever the legacy Spock worker is registered (including PG17 when
the Spock path is used) rather than blanket “15/16 only.” Reference the
registration behavior implemented in src/spock_failover_slots.c (the worker
registration and ClientAuthentication_hook on PG17) and adjust the table text
and the related GUC notes (the per-GUC paragraphs currently marked as
15/16-only) so they say “applies when Spock failover slot worker is active
(legacy Spock path, which may apply on PG17)” or similar wording; mirror this
change in the other occurrences you flagged (around the other GUC notes).

In `@samples/Z0DAN/zodan.sql`:
- Around line 371-386: Query the remote server_version_num over the same DSN
before building remotesql so you branch based on the remote PostgreSQL version
rather than the local pg_settings; use dblink(node_dsn, 'SHOW
server_version_num') to select INTO a variable (e.g., remote_server_version) and
then if remote_server_version >= 170000 set remotesql to the 5-arg
pg_create_logical_replication_slot(...) form (with failover => true) else set
remotesql to the 2-arg form; apply the same change to the other code path that
builds remotesql (the duplicate occurrence that currently checks
pg_settings/local server_version_num) so both places base the decision on the
remote DSN.

In `@src/spock_failover_slots.c`:
- Around line 1088-1100: The current guard uses IsSyncingReplicationSlots() (a
process-local flag) which doesn't reflect PostgreSQL's cross-process slot-sync
state; instead read the shared SlotSyncCtx->syncing flag before entering the
sync path (in addition to RecoveryInProgress() and hot_standby_feedback) so the
Spock worker yields when the native slotsync worker is active; update the
condition around IsSyncingReplicationSlots() to query SlotSyncCtx->syncing
(using the same shared-memory access and locking protocol that PostgreSQL uses
to read SlotSyncCtx) rather than relying solely on the process-local
syncing_slots flag.

In `@src/spock_sync.c`:
- Around line 368-386: The code currently uses the compile-time PG_VERSION_NUM
to decide whether to append " FAILOVER" to the slot CREATE query; instead, query
the remote provider's server version via PQserverVersion(repl_conn) (or reuse
any existing remote capability check) and base the PG17+ decision on that
runtime value; specifically, replace the PG_VERSION_NUM branch with a runtime
check using PQserverVersion(repl_conn) and keep the older-versions fallback that
only appends " FAILOVER" when use_failover_slot is true, ensuring
appendStringInfo(&query, " FAILOVER") is executed only when the remote server
reports version >= 170000.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 13578561-d80d-41c3-a456-83a0b10a81a2

📥 Commits

Reviewing files that changed from the base of the PR and between f84b853 and 85d8af4.

📒 Files selected for processing (6)
  • docs/configuring.md
  • docs/spock_release_notes.md
  • mkdocs.yml
  • samples/Z0DAN/zodan.sql
  • src/spock_failover_slots.c
  • src/spock_sync.c

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/spock_sync.c (1)

368-389: ⚠️ Potential issue | 🔴 Critical

Use the provider's version/capability here, not PG_VERSION_NUM.

This command is sent over repl_conn, but both changed branches still key off the local build version. FAILOVER is a PostgreSQL 17+ replication-slot option, so a PG17+ build talking to a pre-17 provider during mixed-version add-node/upgrade flows will still emit (FAILOVER) and fail slot creation. The caller-side #if PG_VERSION_NUM < 170000 also means remote failover support is no longer probed at all on the PG17+ build path. Base this on PQserverVersion(...) and/or the remote capability probe instead of the subscriber's build version. (postgresql.org)

Suggested direction
-#if PG_VERSION_NUM < 170000
-		bool		use_failover_slot;
-#endif
+		bool		use_failover_slot = false;
 ...
-#if PG_VERSION_NUM < 170000
-		/* 2QPG9.6 and 2QPG11 support failover slots */
-		use_failover_slot =
-			spock_remote_function_exists(origin_conn, "pg_catalog",
-										 "pg_create_logical_replication_slot",
-										 -1,
-										 "failover");
-#endif
+		/* Native PG17+ support, or older providers with explicit failover support. */
+		use_failover_slot =
+			PQserverVersion(origin_conn) >= 170000 ||
+			spock_remote_function_exists(origin_conn, "pg_catalog",
+										 "pg_create_logical_replication_slot",
+										 -1,
+										 "failover");
 ...
-#if PG_VERSION_NUM >= 170000
-	appendStringInfo(&query, " (FAILOVER)");
-#else
-	if (use_failover_slot)
-		appendStringInfo(&query, " (FAILOVER)");
-#endif
+	if (use_failover_slot)
+		appendStringInfo(&query, " (FAILOVER)");

Also applies to: 1226-1256

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_sync.c` around lines 368 - 389, The code currently branches on the
local build macro PG_VERSION_NUM when deciding to append " (FAILOVER)" to the
CREATE_REPLICATION_SLOT query; instead detect remote provider capability via the
replication connection (e.g. use PQserverVersion(repl_conn) and/or the existing
remote capability probe) and use that to decide whether to append FAILOVER;
modify the logic around repl_conn, PG_VERSION_NUM, and use_failover_slot so that
on PG17+ builds you still check PQserverVersion(repl_conn) (or the remote
capability flag) before appending " (FAILOVER)" to the query, and ensure the
older-branch behavior still respects use_failover_slot when the remote reports
support.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/logical_slot_failover.md`:
- Around line 43-56: Add language tags to the fenced code blocks containing the
PostgreSQL configuration so markdownlint MD040 is satisfied; locate the blocks
that include settings like synchronized_standby_slots = 'spock_standby_slot' and
the block with sync_replication_slots = on / primary_conninfo =
'host=<primary_host> ...' / primary_slot_name = 'spock_standby_slot' /
hot_standby_feedback = on and change their fences from ``` to ```conf (or
```ini) so the code fences are explicitly marked as config/ini.
- Around line 124-128: The docs currently suggest using
pg_stat_replication_slots to check if the slotsync worker is active; instead
update the example to query pg_stat_activity filtering on backend_type = 'slot
sync worker' (and optionally inspect wait_event = 'ReplicationSlotsyncMain' when
idle) so readers can detect the slotsync worker process; locate the section
titled "Check if slotsync worker is active (PG17+)" and replace the SELECT
against pg_stat_replication_slots with a query against pg_stat_activity WHERE
backend_type = 'slot sync worker'.

In `@tests/tap/t/018_failover_slots.pl`:
- Around line 286-309: Add a dedicated PG17 branch: use the same qport call that
queries pg_stat_activity for application_name 'spock_failover_slots worker'
(same as the existing checks that use $pg_major and qport) and assert the worker
is still registered (count > 0) with a message like "PG17: spock_failover_slots
worker registered but yields to native slotsync"; this ensures PG15/16 keep the
existing running check, PG17 is explicitly validated as registered (per
src/spock_failover_slots.c:1460-1489), and PG18+ remains the is(..., '0') case.

---

Duplicate comments:
In `@src/spock_sync.c`:
- Around line 368-389: The code currently branches on the local build macro
PG_VERSION_NUM when deciding to append " (FAILOVER)" to the
CREATE_REPLICATION_SLOT query; instead detect remote provider capability via the
replication connection (e.g. use PQserverVersion(repl_conn) and/or the existing
remote capability probe) and use that to decide whether to append FAILOVER;
modify the logic around repl_conn, PG_VERSION_NUM, and use_failover_slot so that
on PG17+ builds you still check PQserverVersion(repl_conn) (or the remote
capability flag) before appending " (FAILOVER)" to the query, and ensure the
older-branch behavior still respects use_failover_slot when the remote reports
support.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c04b428-578f-42ed-9df5-ded247273251

📥 Commits

Reviewing files that changed from the base of the PR and between 85d8af4 and 392661d.

📒 Files selected for processing (4)
  • docs/logical_slot_failover.md
  • src/spock_sync.c
  • tests/tap/schedule
  • tests/tap/t/018_failover_slots.pl
✅ Files skipped from review due to trivial changes (1)
  • tests/tap/schedule

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/tap/t/018_failover_slots.pl (1)

413-420: Consider using pg_ctl start for cleaner cleanup.

Starting postgres directly with & and relying on sleep(10) may be fragile if the server takes longer to initialize. Using pg_ctl start -w would wait for startup completion.

Suggested improvement
 # Undo primary GUC change so destroy_cluster can restart n1 cleanly
-system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &");
-sleep(10);
+system_or_bail("$pg_bin/pg_ctl", 'start', '-D', $primary_dir, '-w');
 system_maybe("$pg_bin/psql", '-h', $host, '-p', $primary_port,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/018_failover_slots.pl` around lines 413 - 420, The test starts
Postgres directly with system("$pg_bin/postgres -D $primary_dir >> /dev/null
2>&1 &") and then sleeps, which is fragile; replace that startup with a waiting
pg_ctl invocation (e.g., use system_maybe to call pg_ctl start -D $primary_dir
-w or equivalent) so the script blocks until the server is ready instead of
relying on sleep(10); update the startup call that currently uses postgres and
sleep to use pg_ctl start -w and keep the subsequent ALTER SYSTEM RESET and
SELECT pg_reload_conf() calls as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/tap/t/018_failover_slots.pl`:
- Around line 413-420: The test starts Postgres directly with
system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &") and then sleeps,
which is fragile; replace that startup with a waiting pg_ctl invocation (e.g.,
use system_maybe to call pg_ctl start -D $primary_dir -w or equivalent) so the
script blocks until the server is ready instead of relying on sleep(10); update
the startup call that currently uses postgres and sleep to use pg_ctl start -w
and keep the subsequent ALTER SYSTEM RESET and SELECT pg_reload_conf() calls
as-is.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1146285c-cce4-4c27-96a4-fbb7cf3bac6c

📥 Commits

Reviewing files that changed from the base of the PR and between 392661d and 7b6ca70.

📒 Files selected for processing (9)
  • docs/configuring.md
  • docs/logical_slot_failover.md
  • docs/spock_release_notes.md
  • mkdocs.yml
  • samples/Z0DAN/zodan.sql
  • src/spock_failover_slots.c
  • src/spock_sync.c
  • tests/tap/schedule
  • tests/tap/t/018_failover_slots.pl
✅ Files skipped from review due to trivial changes (3)
  • tests/tap/schedule
  • mkdocs.yml
  • docs/spock_release_notes.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/spock_failover_slots.c
  • samples/Z0DAN/zodan.sql

@ibrarahmad ibrarahmad self-assigned this Apr 7, 2026
@ibrarahmad ibrarahmad added the enhancement New feature or request label Apr 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/tap/t/018_failover_slots.pl (2)

318-322: Consider polling for DDL replication instead of fixed sleep.

The 5-second sleep assumes DDL replication completes within that time. For more robust testing:

 psql_or_bail(1,
     "CREATE TABLE IF NOT EXISTS failover_test (id int primary key, val text)");
-sleep(5);
+# Wait for table to exist on n2
+wait_until(30, 2, sub {
+    my $exists = scalar_query(2,
+        "SELECT 1 FROM information_schema.tables WHERE table_name='failover_test'");
+    $exists =~ s/\s+//g;
+    return $exists eq '1';
+}) or diag("WARNING: table may not have replicated to n2");
 psql_or_bail(1,
     "INSERT INTO failover_test VALUES (1, 'before_failover')");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/018_failover_slots.pl` around lines 318 - 322, The fixed sleep
should be replaced with polling that verifies DDL replication before proceeding:
after creating failover_test (created via psql_or_bail) loop with a short delay
and use psql_or_bail (targeting the replica connection) to check for the
presence of the table or the inserted row (e.g., SELECT 1 FROM failover_test
WHERE id=1) until it succeeds or a timeout is reached; remove sleep(5) and
instead break the loop on success or fail the test after the timeout so the test
is robust against variable replication lag.

417-424: Consider using pg_isready polling instead of hard-coded sleep.

The cleanup section starts n1 in the background and waits 10 seconds before attempting to reset GUCs. This is fragile on slow systems. Consider:

-system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &");
-sleep(10);
+system("$pg_bin/pg_ctl start -D $primary_dir -l $primary_dir/cleanup.log -w");

Using pg_ctl start -w waits until the server is ready, or you could poll with pg_isready like you do at line 202.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/018_failover_slots.pl` around lines 417 - 424, Replace the
fragile background start + sleep sequence (system("$pg_bin/postgres -D
$primary_dir >> /dev/null 2>&1 &"); sleep(10);) with a readiness-based
start/polling approach: either use pg_ctl start -w pointing at $primary_dir (so
the call blocks until server is ready) or loop with pg_isready (like the polling
used earlier around line 202) before calling system_maybe to run ALTER SYSTEM
RESET synchronized_standby_slots and SELECT pg_reload_conf(); ensure you
reference the same $pg_bin, $primary_dir and $primary_port variables when
invoking pg_ctl or pg_isready so the reset commands run only after the server is
ready.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/tap/t/018_failover_slots.pl`:
- Around line 318-322: The fixed sleep should be replaced with polling that
verifies DDL replication before proceeding: after creating failover_test
(created via psql_or_bail) loop with a short delay and use psql_or_bail
(targeting the replica connection) to check for the presence of the table or the
inserted row (e.g., SELECT 1 FROM failover_test WHERE id=1) until it succeeds or
a timeout is reached; remove sleep(5) and instead break the loop on success or
fail the test after the timeout so the test is robust against variable
replication lag.
- Around line 417-424: Replace the fragile background start + sleep sequence
(system("$pg_bin/postgres -D $primary_dir >> /dev/null 2>&1 &"); sleep(10);)
with a readiness-based start/polling approach: either use pg_ctl start -w
pointing at $primary_dir (so the call blocks until server is ready) or loop with
pg_isready (like the polling used earlier around line 202) before calling
system_maybe to run ALTER SYSTEM RESET synchronized_standby_slots and SELECT
pg_reload_conf(); ensure you reference the same $pg_bin, $primary_dir and
$primary_port variables when invoking pg_ctl or pg_isready so the reset commands
run only after the server is ready.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 401c642f-9d79-4970-8e82-5ad31e02a2c9

📥 Commits

Reviewing files that changed from the base of the PR and between 7b6ca70 and 4826fde.

📒 Files selected for processing (6)
  • docs/configuring.md
  • docs/logical_slot_failover.md
  • samples/Z0DAN/zodan.sql
  • src/spock_failover_slots.c
  • src/spock_sync.c
  • tests/tap/t/018_failover_slots.pl
✅ Files skipped from review due to trivial changes (2)
  • docs/configuring.md
  • docs/logical_slot_failover.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • samples/Z0DAN/zodan.sql
  • src/spock_failover_slots.c

Copy link
Copy Markdown
Member

@susan-pgedge susan-pgedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll approve, but when we cherry-pick the v5_STABLE branch into main, we'll have merge conflicts to bring the new content into our 5.1 branch.

Formatting issues are addressed in the new content.... and I only reviewed .md files :D

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (6)
tests/regress/sql/read_only.sql (3)

251-255: Unnecessary \gset on repair_mode() calls.

The \gset commands on lines 252 and 254 appear superfluous. If repair_mode() returns void, \gset has no effect. If it returns a value, the captured variable isn't used. Consider removing \gset for clarity:

🔧 Proposed simplification
 BEGIN;
-SELECT spock.repair_mode(true) \gset
+SELECT spock.repair_mode(true);
 INSERT INTO ro_repl_test VALUES (99, 'repair data');
-SELECT spock.repair_mode(false) \gset
+SELECT spock.repair_mode(false);
 COMMIT;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/read_only.sql` around lines 251 - 255, The SELECT calls to
spock.repair_mode(true) and spock.repair_mode(false) include unnecessary psql
meta-command "\gset" which either does nothing for void results or captures an
unused variable; remove the "\gset" from the two SELECT lines (i.e., call SELECT
spock.repair_mode(true); and SELECT spock.repair_mode(false); or otherwise
invoke repair_mode without using \gset) so the script is clearer and has no
redundant \gset usage.

277-278: Consider adding deterministic ordering for PID comparison.

Using LIMIT 1 without ORDER BY relies on implicit row ordering. While this works when there's a single replication connection (typical for this test), adding ORDER BY pid would make the comparison deterministic if the test environment ever has multiple connections.

🔧 Optional improvement for robustness
-SELECT pid AS repl_pid FROM pg_stat_replication LIMIT 1
+SELECT pid AS repl_pid FROM pg_stat_replication ORDER BY pid LIMIT 1
 \gset
--- Same walsender PID proves the apply worker stayed alive (gentle wait)
-SELECT :repl_pid = pid AS worker_survived FROM pg_stat_replication LIMIT 1;
+SELECT :repl_pid = pid AS worker_survived FROM pg_stat_replication ORDER BY pid LIMIT 1;

Also applies to: 297-298

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/read_only.sql` around lines 277 - 278, The SELECT that
captures replication PID (SELECT pid AS repl_pid FROM pg_stat_replication LIMIT
1 \gset) relies on implicit ordering; modify it to use an explicit ORDER BY pid
(e.g., SELECT pid AS repl_pid FROM pg_stat_replication ORDER BY pid LIMIT 1
\gset) to ensure deterministic PID selection, and make the same change to the
other identical query instance used for comparison.

283-284: Hardcoded sleep may be fragile in slow CI environments.

The 2-second sleep for the apply worker to detect the config change could be insufficient under heavy CI load. Consider using a polling loop or increasing the timeout if flakiness occurs. However, this is a common pattern in test suites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/regress/sql/read_only.sql` around lines 283 - 284, Hardcoded SELECT
pg_sleep(2) is fragile; replace the fixed sleep with a short polling loop that
repeatedly queries the apply worker's state until it reports readonly (or a
timeout), or at minimum increase the delay to a safer value; locate the SELECT
pg_sleep(2) line in read_only.sql and implement a loop that runs small sleeps
and checks the apply worker status (or change pg_sleep(2) to something like
pg_sleep(5)) so the test waits deterministically for the worker to enter
readonly mode.
src/spock_conflict.c (1)

371-387: Clarify the condition grouping for maintainability.

The condition at line 371 groups SPOCK_CT_UPDATE_EXISTS and SPOCK_CT_DELETE_ORIGIN_DIFFERS together for the same-origin early-return logic, then transforms UPDATE_EXISTS to UPDATE_ORIGIN_DIFFERS at line 386-387. This works correctly but the dual-purpose condition could benefit from a brief inline comment explaining that UPDATE_EXISTS is transformed to UPDATE_ORIGIN_DIFFERS when origins differ.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_conflict.c` around lines 371 - 387, The early-return branch that
checks conflict_type against SPOCK_CT_UPDATE_EXISTS and
SPOCK_CT_DELETE_ORIGIN_DIFFERS and later mutates conflict_type to
SPOCK_CT_UPDATE_ORIGIN_DIFFERS is confusing; update the block around
conflict_type, local_tuple_origin, replorigin_session_origin,
InvalidRepOriginId, local_tuple_xid and GetTopTransactionId() to include a short
inline comment explaining that SPOCK_CT_UPDATE_EXISTS is treated like
UPDATE_ORIGIN_DIFFERS when origins differ (and thus is converted) so readers
understand the dual-purpose condition and the subsequent reassignment to
SPOCK_CT_UPDATE_ORIGIN_DIFFERS.
src/spock_apply_heap.c (1)

848-853: Minor: Conflict log format differs from spock_report_conflict() output.

The simplified elog format here differs from the detailed ereport in spock_report_conflict() (which includes tuple details, timestamps, origins, etc.). This is acceptable given the comment's explanation that tuple slot data may be invalidated, but consider adding a brief note in the log message indicating limited details are available due to the error context.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply_heap.c` around lines 848 - 853, The conflict log call using
elog currently prints a short message; update the message passed to elog (the
call around SpockConflictTypeName(SPOCK_CT_UPDATE_EXISTS) and edata->message) to
append a brief note that tuple-level details are unavailable in this context
(e.g. "detailed tuple info not available due to error context"), mirroring the
intent of spock_report_conflict() without trying to access invalidated slot
data; keep the same log level (spock_conflict_log_level) and the existing fields
(conflict type, schema, relname, edata->message) and only extend the formatted
string to include the limited-details note.
docs/conflict_types.md (1)

63-66: Consider documenting replication origin value semantics.

The description of origin comparison could benefit from clarifying what specific origin values mean:

  • Origin 0 indicates the row was changed locally (not via replication)
  • NULL in spock.resolutions.local_origin (or "unknown" in logs) indicates the local origin is genuinely unavailable (e.g., for pre-existing data after pg_restore)

This distinction is important for users interpreting conflict logs and the spock.resolutions table. Based on learnings about origin value semantics in this codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/conflict_types.md` around lines 63 - 66, Add a short section to the
origin comparison paragraph explaining the semantics of replication origin
values: state that replorigin_session_origin value 0 means the row was changed
locally (not via replication), and that NULL in spock.resolutions.local_origin
(or "unknown" in logs) indicates the local origin is genuinely unavailable
(e.g., pre-existing data after pg_restore); reference replorigin_session_origin
and spock.resolutions.local_origin in the text so readers can map log/table
values to these meanings and update the surrounding sentences about "same
origin" behavior accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/conflict_types.md`:
- Around line 63-66: Add a short section to the origin comparison paragraph
explaining the semantics of replication origin values: state that
replorigin_session_origin value 0 means the row was changed locally (not via
replication), and that NULL in spock.resolutions.local_origin (or "unknown" in
logs) indicates the local origin is genuinely unavailable (e.g., pre-existing
data after pg_restore); reference replorigin_session_origin and
spock.resolutions.local_origin in the text so readers can map log/table values
to these meanings and update the surrounding sentences about "same origin"
behavior accordingly.

In `@src/spock_apply_heap.c`:
- Around line 848-853: The conflict log call using elog currently prints a short
message; update the message passed to elog (the call around
SpockConflictTypeName(SPOCK_CT_UPDATE_EXISTS) and edata->message) to append a
brief note that tuple-level details are unavailable in this context (e.g.
"detailed tuple info not available due to error context"), mirroring the intent
of spock_report_conflict() without trying to access invalidated slot data; keep
the same log level (spock_conflict_log_level) and the existing fields (conflict
type, schema, relname, edata->message) and only extend the formatted string to
include the limited-details note.

In `@src/spock_conflict.c`:
- Around line 371-387: The early-return branch that checks conflict_type against
SPOCK_CT_UPDATE_EXISTS and SPOCK_CT_DELETE_ORIGIN_DIFFERS and later mutates
conflict_type to SPOCK_CT_UPDATE_ORIGIN_DIFFERS is confusing; update the block
around conflict_type, local_tuple_origin, replorigin_session_origin,
InvalidRepOriginId, local_tuple_xid and GetTopTransactionId() to include a short
inline comment explaining that SPOCK_CT_UPDATE_EXISTS is treated like
UPDATE_ORIGIN_DIFFERS when origins differ (and thus is converted) so readers
understand the dual-purpose condition and the subsequent reassignment to
SPOCK_CT_UPDATE_ORIGIN_DIFFERS.

In `@tests/regress/sql/read_only.sql`:
- Around line 251-255: The SELECT calls to spock.repair_mode(true) and
spock.repair_mode(false) include unnecessary psql meta-command "\gset" which
either does nothing for void results or captures an unused variable; remove the
"\gset" from the two SELECT lines (i.e., call SELECT spock.repair_mode(true);
and SELECT spock.repair_mode(false); or otherwise invoke repair_mode without
using \gset) so the script is clearer and has no redundant \gset usage.
- Around line 277-278: The SELECT that captures replication PID (SELECT pid AS
repl_pid FROM pg_stat_replication LIMIT 1 \gset) relies on implicit ordering;
modify it to use an explicit ORDER BY pid (e.g., SELECT pid AS repl_pid FROM
pg_stat_replication ORDER BY pid LIMIT 1 \gset) to ensure deterministic PID
selection, and make the same change to the other identical query instance used
for comparison.
- Around line 283-284: Hardcoded SELECT pg_sleep(2) is fragile; replace the
fixed sleep with a short polling loop that repeatedly queries the apply worker's
state until it reports readonly (or a timeout), or at minimum increase the delay
to a safer value; locate the SELECT pg_sleep(2) line in read_only.sql and
implement a loop that runs small sleeps and checks the apply worker status (or
change pg_sleep(2) to something like pg_sleep(5)) so the test waits
deterministically for the worker to enter readonly mode.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0dc23c1c-f482-4b3a-9dd4-131997c5ac2b

📥 Commits

Reviewing files that changed from the base of the PR and between 4826fde and b7f4bc4.

⛔ Files ignored due to path filters (2)
  • tests/regress/expected/conflict_stat.out is excluded by !**/*.out
  • tests/regress/expected/read_only.out is excluded by !**/*.out
📒 Files selected for processing (11)
  • Makefile
  • docs/conflict_types.md
  • docs/conflicts.md
  • docs/modify/zodan/zodan_readme.md
  • docs/modify/zodan/zodan_tutorial.md
  • mkdocs.yml
  • src/spock_apply_heap.c
  • src/spock_conflict.c
  • src/spock_failover_slots.c
  • tests/regress/sql/conflict_stat.sql
  • tests/regress/sql/read_only.sql
✅ Files skipped from review due to trivial changes (3)
  • mkdocs.yml
  • docs/modify/zodan/zodan_readme.md
  • docs/modify/zodan/zodan_tutorial.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/spock_failover_slots.c

Ibrar Ahmed and others added 4 commits April 9, 2026 01:44
PostgreSQL 17 introduced built-in logical slot synchronization to
physical standbys via the slotsync worker (sync_replication_slots)
and the FAILOVER flag on logical slots.  PostgreSQL 18 completes
the feature with synchronized_standby_slots replacing the need for
any third-party slot sync worker.

Mark all spock logical slots with FAILOVER at creation time on
PG17+ so the native slotsync worker picks them up automatically.
On PG17, spock's failover worker checks IsSyncingReplicationSlots()
and yields if the native worker is active, preventing conflicts.
On PG18+, the spock_failover_slots background worker is not
registered at all; users must set sync_replication_slots = on.

For PG15 and PG16, behavior is unchanged: spock's bgworker syncs
slots and the ClientAuthentication_hook holds walsenders back until
standbys confirm via spock.pg_standby_slot_names.

ZODAN (zodan.sql) also creates logical slots via dblink; update
both slot creation sites to pass failover => true on PG17+ using
a runtime server_version_num check.

Add docs/logical_slot_failover.md covering setup for all supported
PostgreSQL versions, required postgresql.conf settings, monitoring
queries, and a version behaviour matrix.  Update configuring.md
with the five failover-slot GUCs and a cross-reference.  Add the
new page to mkdocs.yml navigation.
…ded test case.

PG17+ requires parenthesised options for CREATE_REPLICATION_SLOT;
the bare FAILOVER keyword fails to parse.  Fix spock_sync.c and add
TAP test 018_failover_slots to verify logical slot sync to a physical
standby and end-to-end replication after promotion.

Add TAP test 018_failover_slots.pl that:
  - builds a 2-node spock cluster (n1 provider, n2 subscriber)
  - creates a physical streaming standby of n1 via pg_basebackup
  - verifies the logical slot is synced to the standby (PG17+ native
    slotsync; PG15/16 spock_failover_slots worker)
  - promotes the standby, reconnects n2, and confirms replication
    resumes end-to-end
…tions

Use sql_conn/conn (regular SQL connection) instead of repl_conn for
PQserverVersion() checks — replication protocol connections return 0.
Also address CodeRabbit review comments on docs, zodan.sql, and tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants