Replace wal_sender_timeout-based liveness with TCP keepalive. by ibrarahmad · Pull Request #373 · pgEdge/spock

ibrarahmad · 2026-03-04T06:54:44Z

The apply worker previously relied on wal_sender_timeout as both a server-side disconnect trigger and an indirect keepalive pressure on the subscriber. This caused spurious disconnects in two scenarios: a flood of 'w' messages keeping the subscriber too busy to send 'r' feedback in time, and large transactions whose apply time exceeded wal_sender_timeout.

The workaround was maybe_send_feedback(), which force-sent 'r' after every 10 'w' messages or wal_sender_timeout/2, whichever came first. This was a fragile band-aid that coupled subscriber behavior to a server GUC it cannot control.

Replace the entire mechanism with a clean two-layer model:

TCP keepalive (keepalives_idle=10, keepalives_interval=5, keepalives_count=3) is the primary liveness detector on both sides. A dead network or crashed host is detected in ~25 seconds.
wal_sender_timeout=0 is set on replication connections so the walsender never disconnects due to missing 'r' feedback. Liveness on the server side is now handled entirely by TCP keepalive.
spock.apply_idle_timeout (default 300s) is a subscriber-side safety net for a hung-but-connected walsender whose TCP keepalive probes are answered by the kernel but sends no data. Set to 0 to disable.

Fix a bug in last_receive_timestamp handling: it was updated unconditionally after every PQgetCopyData call, including when r==0 (no data available). Each 1-second WL_TIMEOUT spin silently reset the timer, making apply_idle_timeout never fire. Move the update to after the r==0 guard so it reflects actual data receipt only.

Remove maybe_send_feedback() as it is no longer needed.

SPOC-419

coderabbitai · 2026-03-04T06:55:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an apply-worker idle-timeout GUC (header exported), removes the feedback-frequency config and helper logic, adjusts replication connection parameters and TCP keepalive defaults, and refactors apply-worker idle-timer and COPY receive handling to use the new idle-timeout.

Changes

Cohort / File(s)	Summary
Header Export `include/spock.h`	Removed `extern int spock_feedback_frequency;` and added `extern int spock_apply_idle_timeout;`.
GUC Configuration & Connection Setup `src/spock.c`	Added `int spock_apply_idle_timeout = 300` and registered `spock.apply_idle_timeout` GUC. Removed `spock.feedback_frequency` GUC. Increased `CONN_PARAM_ARRAY_SIZE` 9→10 and adjusted TCP keepalive defaults (`keepalives_idle` 20→10, `keepalives_interval` 20→5, `keepalives_count` 5→3).
Apply Worker Timeout & Feedback `src/spock_apply.c`	Removed static `maybe_send_feedback(...)` and its scheduling/trigger logic. `apply_work()` enforces WL_TIMEOUT only when `spock_apply_idle_timeout > 0`, uses `spock_apply_idle_timeout` (seconds→ms) for the deadline, updates `last_receive_timestamp` only after successful `PQgetCopyData()` (`r > 0`), and no longer forces feedback from the `'w'` message path.
Documentation `docs/configuring.md`	Removed documentation for `spock.feedback_frequency`; added guidance about `wal_sender_timeout` and documented the new `spock.apply_idle_timeout` option (default 300, 0 to disable) as an apply-side idle reconnect safety net.
Build `Makefile`	Quoted `realpath` expansions and wrapped realpath-derived paths/arguments in quotes for `PG_CPPFLAGS`, `grep`, and `sed` invocations to improve path handling.

Poem

🐰 I watch the stream with patient eyes,
A gentle timeout guards the skies,
No noisy pings to stir the night,
A quiet reconnect when data's light,
Then hop, resume, and blink—delight!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main architectural change in the PR: replacing wal_sender_timeout-based liveness detection with TCP keepalive as the primary mechanism.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the problem, solution, and specific fixes including the new spock.apply_idle_timeout parameter and the last_receive_timestamp bug fix.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch FEED_BACK

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/spock.c (1)
345-356: Consider making TCP keepalive parameters configurable.

The hardcoded values (idle=10s, interval=5s, count=3) result in ~25s dead connection detection. While reasonable for most deployments, high-latency or unreliable network environments may experience false-positive disconnects.

Consider exposing these as GUCs (e.g., spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count) to allow tuning without code changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock.c` around lines 345 - 356, Replace the hardcoded TCP keepalive
literals in the keys/vals block with configurable GUC-backed values: define GUCs
(e.g., spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count)
as int variables (suggest names spock_keepalives_idle,
spock_keepalives_interval, spock_keepalives_count) during module initialization
(e.g., in _PG_init or the existing GUC registration area), register them with
DefineCustomIntVariable, and then use those variables' stringified values when
populating vals[] for the keys "keepalives_idle", "keepalives_interval", and
"keepalives_count" in the code that sets keys[i]/vals[i]; keep the default
values 10/5/3 and ensure bounds checking on the GUCs (positive integers) when
registering.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/spock.c`:
- Around line 345-356: Replace the hardcoded TCP keepalive literals in the
keys/vals block with configurable GUC-backed values: define GUCs (e.g.,
spock.keepalives_idle, spock.keepalives_interval, spock.keepalives_count) as int
variables (suggest names spock_keepalives_idle, spock_keepalives_interval,
spock_keepalives_count) during module initialization (e.g., in _PG_init or the
existing GUC registration area), register them with DefineCustomIntVariable, and
then use those variables' stringified values when populating vals[] for the keys
"keepalives_idle", "keepalives_interval", and "keepalives_count" in the code
that sets keys[i]/vals[i]; keep the default values 10/5/3 and ensure bounds
checking on the GUCs (positive integers) when registering.

ℹ️ Review info

Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 69aec04d-1d81-4de7-913f-14b9f06f8761

📥 Commits

Reviewing files that changed from the base of the PR and between 50764f4 and 1b0262a.

📒 Files selected for processing (3)

include/spock.h
src/spock.c
src/spock_apply.c

danolivo · 2026-03-04T09:53:21Z

src/spock_apply.c

+			 * kernel ACKs them, but no data is being sent.
 			 */
-			if (rc & WL_TIMEOUT)
+			if (rc & WL_TIMEOUT && spock_apply_idle_timeout > 0)


It seems like if walsender just doesn't have data to send for a long time, subscriber will restart. Am I wrong?

It would be better to modify walsender little: skip keepalive messages being busy and rely on TCP status. But send keepalive messages if no data arrives from the WAL. In this case we don't need any subscriber-side GUC at all.

It does a slightly different task. We are relying on TCP_KEEPALIVE. The idle time is just for the guard; if the work is stuck at the application level, but the kernel TCP keep-alive will continue, there will be no way to restart the wal_sender.

mason-sharp · 2026-03-24T20:52:26Z

@ibrarahmad Needs rebase

mason-sharp

Added a comment about having a non-zero wal_sender_timeout.

Also, needs a rebase.

Also, could use that test file.

src/spock.c

coderabbitai

♻️ Duplicate comments (1)

src/spock_apply.c (1)
2852-2863: ⚠️ Potential issue | 🟠 Major

This can reconnect an idle-but-healthy upstream.

last_receive_timestamp only moves when CopyData arrives. Because the replication connection now runs with wal_sender_timeout = 0, PostgreSQL's walsender returns early from WalSndKeepaliveIfNecessary() and won't send protocol keepalives, so a publisher that is simply caught up and idle can still hit this reconnect path after spock.apply_idle_timeout. As written, this safety net can't distinguish “hung” from “idle”; it needs a sender heartbeat that survives wal_sender_timeout = 0, or this GUC should default to 0. (doxygen.postgresql.org)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/spock_apply.c` around lines 2852 - 2863, The idle-timeout check using
last_receive_timestamp can disconnect a healthy but caught-up publisher because
CopyData isn't updated by walsender keepalives when wal_sender_timeout = 0;
update the logic so spock.apply_idle_timeout does not trigger in that case:
either change the GUC spock_apply_idle_timeout default to 0, or add a guard in
the timeout branch (the block that sets MySpockWorker->worker_status =
SPOCK_WORKER_STATUS_STOPPED and elog(ERROR, ...)) to skip reconnect when the
upstream is only sending WAL keepalives (i.e., detect/update a separate
last_keepalive_timestamp on WalSndKeepaliveIfNecessary/keepalive handling and
use that for idle detection), or check wal_sender_timeout and treat
spock_apply_idle_timeout as disabled when wal_sender_timeout == 0; update
references to last_receive_timestamp, CopyData handling, and the reconnect path
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/spock_apply.c`:
- Around line 2852-2863: The idle-timeout check using last_receive_timestamp can
disconnect a healthy but caught-up publisher because CopyData isn't updated by
walsender keepalives when wal_sender_timeout = 0; update the logic so
spock.apply_idle_timeout does not trigger in that case: either change the GUC
spock_apply_idle_timeout default to 0, or add a guard in the timeout branch (the
block that sets MySpockWorker->worker_status = SPOCK_WORKER_STATUS_STOPPED and
elog(ERROR, ...)) to skip reconnect when the upstream is only sending WAL
keepalives (i.e., detect/update a separate last_keepalive_timestamp on
WalSndKeepaliveIfNecessary/keepalive handling and use that for idle detection),
or check wal_sender_timeout and treat spock_apply_idle_timeout as disabled when
wal_sender_timeout == 0; update references to last_receive_timestamp, CopyData
handling, and the reconnect path accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5f499ec7-1af5-4d57-bcd9-56189b254fe8

📥 Commits

Reviewing files that changed from the base of the PR and between 1b0262a and e4c970d.

📒 Files selected for processing (3)

include/spock.h
src/spock.c
src/spock_apply.c

✅ Files skipped from review due to trivial changes (1)

include/spock.h

🚧 Files skipped from review as they are similar to previous changes (1)

src/spock.c

mason-sharp · 2026-03-31T18:59:31Z

Did a rebase + one additional commit

codacy-production · 2026-03-31T18:59:55Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results

Duplication 0

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

coderabbitai

🧹 Nitpick comments (2)

docs/configuring.md (2)
194-209: Documentation looks good; consider adding language specifier to code block.

The guidance to set wal_sender_timeout to a conservative value like 5min is appropriate and aligns well with the new TCP keepalive-based liveness detection. The explanation of why PostgreSQL's default 60s is problematic for busy apply workers is clear and helpful.
📝 Optional: Add language specifier to code block

The code block at line 199 should specify a language for better rendering:
-```
+```conf
 wal_sender_timeout = '5min'
This addresses the markdownlint warning and improves documentation quality.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/configuring.md around lines 194 - 209, Add a language specifier to the
fenced code block showing the postgresql.conf setting so the snippet renders
with proper highlighting; update the block that contains "wal_sender_timeout =
'5min'" to use a conf (or ini) code fence (e.g., ```conf) and ensure the snippet
still shows the postgresql.conf context and the wal_sender_timeout setting.
</details>

---

`210-220`: **Add language specifier to code block for better rendering.**

The documentation accurately describes `spock.apply_idle_timeout` with the correct default value of `300` seconds. The explanation of the safety net mechanism and timer reset behavior is clear.

Add a language specifier to the code example for improved rendering:

<details>
<summary>📝 Code block improvement</summary>

```diff
-```
+```conf
 spock.apply_idle_timeout = 300
 ```
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/configuring.md around lines 210 - 220, Update the fenced code block
that shows the spock.apply_idle_timeout example to include a language specifier
by changing the opening fence to conf so the snippet `spock.apply_idle_timeout = 300` is rendered with the "conf" language; locate the code block near the `spock.apply_idle_timeout` description and replace the current triple backtick opener with conf (keep the existing content and
closing backticks unchanged).
</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @docs/configuring.md:

Around line 194-209: Add a language specifier to the fenced code block showing
the postgresql.conf setting so the snippet renders with proper highlighting;
update the block that contains "wal_sender_timeout = '5min'" to use a conf (or
ini) code fence (e.g., ```conf) and ensure the snippet still shows the
postgresql.conf context and the wal_sender_timeout setting.

Around line 210-220: Update the fenced code block that shows the
spock.apply_idle_timeout example to include a language specifier by changing the
opening fence to conf so the snippet `spock.apply_idle_timeout = 300` is rendered with the "conf" language; locate the code block near the `spock.apply_idle_timeout` description and replace the current triple backtick opener with conf (keep the existing content and closing backticks unchanged).
</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `ed326f14-e345-446b-9fff-bff123ef2a88`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between ed05b18925556944bdfd17a423ec7e2b3d29cd9e and e43ab707b9c82ab025b1de254a6c81693489f6d9.

</details>

<details>
<summary>📒 Files selected for processing (2)</summary>

* `docs/configuring.md`
* `src/spock.c`

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* src/spock.c

</details>

</details>

The apply worker previously relied on wal_sender_timeout as both a server-side disconnect trigger and an indirect keepalive pressure on the subscriber. This caused spurious disconnects in two scenarios: a flood of 'w' messages keeping the subscriber too busy to send 'r' feedback in time, and large transactions whose apply time exceeded wal_sender_timeout. The workaround was maybe_send_feedback(), which force-sent 'r' after every 10 'w' messages or wal_sender_timeout/2, whichever came first. This was a fragile band-aid that coupled subscriber behavior to a server GUC it cannot control. Replace the entire mechanism with a clean two-layer model: - TCP keepalive (keepalives_idle=10, keepalives_interval=5, keepalives_count=3) is the primary liveness detector on both sides. A dead network or crashed host is detected in ~25 seconds. - wal_sender_timeout=0 is set on replication connections so the walsender never disconnects due to missing 'r' feedback. Liveness on the server side is now handled entirely by TCP keepalive. - spock.apply_idle_timeout (default 300s) is a subscriber-side safety net for a hung-but-connected walsender whose TCP keepalive probes are answered by the kernel but sends no data. Set to 0 to disable. Fix a bug in last_receive_timestamp handling: it was updated unconditionally after every PQgetCopyData call, including when r==0 (no data available). Each 1-second WL_TIMEOUT spin silently reset the timer, making apply_idle_timeout never fire. Move the update to after the r==0 guard so it reflects actual data receipt only. Remove maybe_send_feedback() as it is no longer needed.

Also document conservative wal_sender_timeout and apply_idle_timeout settings.

ibrarahmad requested a review from mason-sharp March 4, 2026 06:55

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

danolivo reviewed Mar 4, 2026

View reviewed changes

mason-sharp reviewed Mar 29, 2026

View reviewed changes

src/spock.c Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

mason-sharp force-pushed the FEED_BACK branch from e4c970d to ed05b18 Compare March 31, 2026 18:58

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

Ibrar Ahmed and others added 6 commits April 3, 2026 09:27

Remove feedback_frequency GUC, not used

de48404

Remove wal_sender_timeout=0 from code.

c0d4423

Also document conservative wal_sender_timeout and apply_idle_timeout settings.

Quote paths in Makefile to handle spaces in directory names.

20eaf46

Remove no longer used maybe_send_feedback()

894ac52

Adjust recently added quoting

518a9d8

mason-sharp force-pushed the FEED_BACK branch from ff0c0fe to 518a9d8 Compare April 3, 2026 16:50

mason-sharp approved these changes Apr 3, 2026

View reviewed changes

mason-sharp merged commit 8938d6f into main Apr 3, 2026
10 checks passed

mason-sharp deleted the FEED_BACK branch April 3, 2026 19:31

Conversation

ibrarahmad commented Mar 4, 2026 • edited by mason-sharp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

danolivo Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

ibrarahmad Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

mason-sharp commented Mar 24, 2026

Uh oh!

mason-sharp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mason-sharp commented Mar 31, 2026

Uh oh!

codacy-production bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ibrarahmad commented Mar 4, 2026 •

edited by mason-sharp

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

codacy-production bot commented Mar 31, 2026 •

edited

Loading