5.5 KiB
Protocol Version Simulation
Date: 2026-03-26 Status: design proposal Purpose: define how the simulator should model WAL V1, WAL V1.5 (Phase 13), and WAL V2 on the same scenario set
Why This Exists
The simulator is more valuable if the same scenario can answer:
- how WAL V1 behaves
- how WAL V1.5 behaves
- how WAL V2 should behave
That turns the simulator into:
- a regression tool for V1/V1.5
- a justification tool for V2
- a comparison framework across protocol generations
Principle
Do not fork three separate simulators.
Instead:
- keep one simulator core
- add protocol-version behavior modes
- run the same named scenario under different modes
Proposed Versions
ProtocolV1
Intent:
- represent pre-Phase-13 behavior
Behavior shape:
- WAL is streamed optimistically
- lagging replica is degraded/excluded quickly
- no real short-gap catch-up contract
- no retention-backed recovery window
- replica usually falls toward rebuild rather than incremental recovery
What scenarios should expose:
- short outage still causes unnecessary degrade/rebuild
- transient jitter may be over-penalized
- poor graceful rejoin story
ProtocolV15
Intent:
- represent Phase-13 WAL V1.5 behavior
Behavior shape:
- reconnect handshake exists
- WAL catch-up exists
- primary may retain WAL longer for lagging replica
- recovery still depends heavily on address stability and control-plane timing
- catch-up may still tail-chase or stall operationally
What scenarios should expose:
- transient disconnects may recover
- restart with new receiver address may still fail practical recovery
- tail-chasing / retention pressure remain structural risks
ProtocolV2
Intent:
- represent the target design
Behavior shape:
- explicit recovery reservation
- explicit catch-up vs rebuild boundary
- lineage-first promotion
- version-correct recovery sources
- explicit abort/rebuild path on non-convergence or lost recoverability
What scenarios should show:
- short gap recovers cleanly
- impossible catch-up fails cleanly
- rebuild is explicit, not accidental
Behavior Axes To Toggle
The simulator does not need completely different code paths. It needs protocol-version-sensitive policy on these axes:
1. Lagging replica treatment
V1:
- degrade quickly
- no meaningful WAL catch-up window
V1.5:
- allow WAL catch-up while history remains available
V2:
- allow catch-up only with explicit recoverability / reservation
2. WAL retention / recoverability
V1:
- little or no retention for lagging-replica recovery
V1.5:
- retention-based recovery window
- but no strong reservation contract
V2:
- recoverability check plus reservation
3. Restart / address stability
V1:
- generally poor rejoin path
V1.5:
- reconnect may work only if replica address is stable
V2:
- address/identity assumptions should be explicit in the model
4. Tail-chasing behavior
V1:
- usually degrades rather than catches up
V1.5:
- catch-up may be attempted but may never converge
V2:
- non-convergence should explicitly abort/escalate
5. Promotion policy
V1:
- weaker lineage reasoning
V1.5:
- improved epoch/LSN handling
V2:
- lineage-first promotion is a first-class rule
Recommended Simulator API
Add a version enum, for example:
type ProtocolVersion string
const (
ProtocolV1 ProtocolVersion = "v1"
ProtocolV15 ProtocolVersion = "v1_5"
ProtocolV2 ProtocolVersion = "v2"
)
Attach it to the simulator or cluster:
type Cluster struct {
Protocol ProtocolVersion
...
}
Policy Hooks
Rather than branching everywhere, centralize the differences in a few hooks:
CanAttemptCatchup(...)CatchupConvergencePolicy(...)RecoverabilityPolicy(...)RestartRejoinPolicy(...)PromotionPolicy(...)
That keeps the simulator readable.
Example Scenario Comparisons
Scenario: brief disconnect
V1:
- likely degrade / no efficient catch-up
V1.5:
- catch-up may succeed if address/history remain stable
V2:
- explicit recoverability + reservation
- catch-up only if the missing window is still recoverable
- otherwise explicit rebuild
Scenario: replica restart with new receiver port
V1:
- poor recovery path
V1.5:
- background reconnect fails if it retries stale address
V2:
- identity/address model must make this explicit
- direct reconnect is not assumed
- use explicit reassignment plus catch-up if recoverable, otherwise rebuild cleanly
Scenario: primary writes faster than catch-up
V1:
- replica degrades
V1.5:
- may tail-chase indefinitely or pin WAL too long
V2:
- explicit non-convergence detection -> abort / rebuild
What To Measure
For each scenario, compare:
- does committed data remain safe?
- does uncommitted data stay out of committed lineage?
- does recovery complete or stall?
- does protocol choose catch-up or rebuild?
- is the outcome explicit or accidental?
Immediate Next Step
Start with a minimal versioned policy layer:
- add
ProtocolVersion - implement one or two version-sensitive hooks:
CanAttemptCatchupCatchupConvergencePolicy
- run existing scenarios under:
ProtocolV1ProtocolV15ProtocolV2
That is enough to begin proving:
- V1 breaks
- V1.5 improves but still strains
- V2 handles the same scenario more cleanly
Bottom Line
The same scenario set should become a comparison harness across protocol generations.
That is one of the strongest uses of the simulator:
- not only "does V2 work?"
- but "why is V2 better than V1 and V1.5?"