[SS-187] Add support for replication from a physical postgres replica#37020
Conversation
5ffe092 to
39d0ca8
Compare
f3ddf99 to
2c96a04
Compare
| } | ||
| } | ||
|
|
||
| async fn ensure_physical_replica( |
There was a problem hiding this comment.
This check is modeled after how we check the "timeline_id" and will proactively error out the source if the replica status changes. My biggest concern would be if there's an unexpected transient moment where the primary returns true for pg_is_in_recovery, which would make the failure permanent.
That said, I'm pretty confident that a primary server should never return true for is_physical_replica -- even during crash recovery when it would be true, it doesn't accept connections.
All that said, this is a relatively risky point and I could be convinced to downgrade to a transient error, or to just log an error/warning to derisk the case where we unexpectedly get a true. I could also branch based on expected_is_physical_replica and be more restrictive only when it was recorded as a physical replica initially (the new case).
There was a problem hiding this comment.
Having done some research, this does not appear possible, so I have no objection leaving it as a definite error.
martykulma
left a comment
There was a problem hiding this comment.
Looking good! A couple of questions for inline. It doesn't look like there is test coverage for promoting the standby, is that possible to add?
| (Some(is_physical_replica), Some(is_physical_replica_other)) => { | ||
| is_physical_replica == is_physical_replica_other | ||
| } | ||
| (None, Some(_)) => true, |
There was a problem hiding this comment.
Should this allow the case of (None, Some(true))? I read that as we went from upstream is not a replica (because it was not possible) to replica.
There was a problem hiding this comment.
Oh, interesting point. I think starting out with an error seems reasonable and more defensive, so will do that - good catch.
|
|
||
| Err(PostgresError::Generic(anyhow::anyhow!( | ||
| "pg_current_wal_lsn() mysteriously has no value" | ||
| "WAL LSN mysteriously has no value" |
There was a problem hiding this comment.
If is_physical_standby path produces a None, should this call get_is_in_recovery(..) to determine if a failover has occurred and report a more descriptive error?
There was a problem hiding this comment.
Good catch, I'll add that! I actually missed that pg_last_wal_replay_lsn is nullable. I assumed it would error like pg_current_wal_lsn.
|
Added testing for standby promotion. I added periodic checking (with the schema checking code) to produce definite errors more reliably on promotion. |
| (None, Some(is_physical_replica_other)) => !is_physical_replica_other, | ||
| // New values must always have is_physical_replica | ||
| (_, None) => false, | ||
| }, |
There was a problem hiding this comment.
per shindig: ALTER SOURCE won't run pg_is_in_recovery(), so a PG source created at an older version would fail to alter after upgrade to this version.
need to allow None->None:
(Some, Some) => cur == new // all the things are upgraded
(None, _) => true // upgrade case
(Some, None) => false // should never happen
if we can capture in an upgrade test, that would be great!
There was a problem hiding this comment.
Made this change, did a slightly more verbose version I thought was clearer.
Unfortunately, I wasn't able to get a test reproduction because we don't support changes to any of the sibling fields of the source details, so on ALTER SOURCE we skip the deeper compatibility check because the before/after objects are equal (and there's an equality check upstream of this check). I left an ALTER SOURCE check in the cross-version pg cdc test anyway, but it didn't cause any issues.
Instead, I added a unit test.
martykulma
left a comment
There was a problem hiding this comment.
CI still in flight, but I don't see any issues in the PR - great job!
🐟
Motivation
We would like to be able to set up replication against a physical Postgres replica. This fails hard today because you can't read pg_current_wal_lsn from a standby.
Relevant issue: https://linear.app/materializeinc/issue/SS-187/replication-from-postgres-replica
This is adapted from #36923.
Description
Tradeoffs to call out
hot_standby_feedback=onon the replica risks snapshots timing out (potentially quickly) and increases the risk of the replication slot being terminated.hot_standby_feedback=onthis risks bloating the disk of the primary. I think setting max_slot_wal_keep_size to a reasonably high level mitigates the scariest versions of this. This style of tradeoff seems somewhat similar to existing trade-offs around max_slot_wall_keep_size settings.Verification