# Protocol Version Simulation

Date: 2026-03-26
Status: design proposal
Purpose: define how the simulator should model WAL V1, WAL V1.5 (Phase 13), and WAL V2 on the same scenario set

## Why This Exists

The simulator is more valuable if the same scenario can answer:

1. how WAL V1 behaves
2. how WAL V1.5 behaves
3. how WAL V2 should behave

That turns the simulator into:
- a regression tool for V1/V1.5
- a justification tool for V2
- a comparison framework across protocol generations

## Principle

Do not fork three separate simulators.

Instead:
- keep one simulator core
- add protocol-version behavior modes
- run the same named scenario under different modes

## Proposed Versions

### `ProtocolV1`

Intent:
- represent pre-Phase-13 behavior

Behavior shape:
- WAL is streamed optimistically
- lagging replica is degraded/excluded quickly
- no real short-gap catch-up contract
- no retention-backed recovery window
- replica usually falls toward rebuild rather than incremental recovery

What scenarios should expose:
- short outage still causes unnecessary degrade/rebuild
- transient jitter may be over-penalized
- poor graceful rejoin story

### `ProtocolV15`

Intent:
- represent Phase-13 WAL V1.5 behavior

Behavior shape:
- reconnect handshake exists
- WAL catch-up exists
- primary may retain WAL longer for lagging replica
- recovery still depends heavily on address stability and control-plane timing
- catch-up may still tail-chase or stall operationally

What scenarios should expose:
- transient disconnects may recover
- restart with new receiver address may still fail practical recovery
- tail-chasing / retention pressure remain structural risks

### `ProtocolV2`

Intent:
- represent the target design

Behavior shape:
- explicit recovery reservation
- explicit catch-up vs rebuild boundary
- lineage-first promotion
- version-correct recovery sources
- explicit abort/rebuild path on non-convergence or lost recoverability

What scenarios should show:
- short gap recovers cleanly
- impossible catch-up fails cleanly
- rebuild is explicit, not accidental

## Behavior Axes To Toggle

The simulator does not need completely different code paths.
It needs protocol-version-sensitive policy on these axes:

### 1. Lagging replica treatment

`V1`:
- degrade quickly
- no meaningful WAL catch-up window

`V1.5`:
- allow WAL catch-up while history remains available

`V2`:
- allow catch-up only with explicit recoverability / reservation

### 2. WAL retention / recoverability

`V1`:
- little or no retention for lagging-replica recovery

`V1.5`:
- retention-based recovery window
- but no strong reservation contract

`V2`:
- recoverability check plus reservation

### 3. Restart / address stability

`V1`:
- generally poor rejoin path

`V1.5`:
- reconnect may work only if replica address is stable

`V2`:
- address/identity assumptions should be explicit in the model

### 4. Tail-chasing behavior

`V1`:
- usually degrades rather than catches up

`V1.5`:
- catch-up may be attempted but may never converge

`V2`:
- non-convergence should explicitly abort/escalate

### 5. Promotion policy

`V1`:
- weaker lineage reasoning

`V1.5`:
- improved epoch/LSN handling

`V2`:
- lineage-first promotion is a first-class rule

## Recommended Simulator API

Add a version enum, for example:

```go
type ProtocolVersion string

const (
    ProtocolV1  ProtocolVersion = "v1"
    ProtocolV15 ProtocolVersion = "v1_5"
    ProtocolV2  ProtocolVersion = "v2"
)
```

Attach it to the simulator or cluster:

```go
type Cluster struct {
    Protocol ProtocolVersion
    ...
}
```

## Policy Hooks

Rather than branching everywhere, centralize the differences in a few hooks:

1. `CanAttemptCatchup(...)`
2. `CatchupConvergencePolicy(...)`
3. `RecoverabilityPolicy(...)`
4. `RestartRejoinPolicy(...)`
5. `PromotionPolicy(...)`

That keeps the simulator readable.

## Example Scenario Comparisons

### Scenario: brief disconnect

`V1`:
- likely degrade / no efficient catch-up

`V1.5`:
- catch-up may succeed if address/history remain stable

`V2`:
- explicit recoverability + reservation
- catch-up only if the missing window is still recoverable
- otherwise explicit rebuild

### Scenario: replica restart with new receiver port

`V1`:
- poor recovery path

`V1.5`:
- background reconnect fails if it retries stale address

`V2`:
- identity/address model must make this explicit
- direct reconnect is not assumed
- use explicit reassignment plus catch-up if recoverable, otherwise rebuild cleanly

### Scenario: primary writes faster than catch-up

`V1`:
- replica degrades

`V1.5`:
- may tail-chase indefinitely or pin WAL too long

`V2`:
- explicit non-convergence detection -> abort / rebuild

## What To Measure

For each scenario, compare:

1. does committed data remain safe?
2. does uncommitted data stay out of committed lineage?
3. does recovery complete or stall?
4. does protocol choose catch-up or rebuild?
5. is the outcome explicit or accidental?

## Immediate Next Step

Start with a minimal versioned policy layer:

1. add `ProtocolVersion`
2. implement one or two version-sensitive hooks:
   - `CanAttemptCatchup`
   - `CatchupConvergencePolicy`
3. run existing scenarios under:
   - `ProtocolV1`
   - `ProtocolV15`
   - `ProtocolV2`

That is enough to begin proving:
- V1 breaks
- V1.5 improves but still strains
- V2 handles the same scenario more cleanly

## Bottom Line

The same scenario set should become a comparison harness across protocol generations.

That is one of the strongest uses of the simulator:
- not only "does V2 work?"
- but "why is V2 better than V1 and V1.5?"