feat: add V2 protocol simulator and enginev2 sender/session prototype

Adds sw-block/ directory with: - distsim: protocol correctness simulator (96 tests) - cluster model with epoch fencing, barrier semantics, commit modes - endpoint identity, control-plane flow, candidate eligibility - timeout events, timer races, same-tick ordering - session ownership tracking with ID-based stale fencing - enginev2: standalone V2 sender/session implementation (63 tests) - per-replica Sender with identity-preserving reconciliation - RecoverySession with FSM phase transitions and session ID - execution APIs: BeginConnect, RecordHandshake, BeginCatchUp, RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated - recovery outcome branching: zero-gap, catch-up, needs-rebuild - assignment-intent orchestration with epoch fencing - design docs: acceptance criteria, open questions, first-slice spec, protocol development process Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 days ago · edec7098e8
90 changed files with 19363 additions and 0 deletions
--- a/sw-block/.gocache_v2/README
+++ b/sw-block/.gocache_v2/README
@ -0,0 +1,4 @@
+This directory holds cached build artifacts from the Go build system.
+Run "go clean -cache" if the directory is getting too large.
+Run "go clean -fuzzcache" to delete the fuzz cache.
+See go.dev to learn more about Go.
--- a/sw-block/.gocache_v2/trim.txt
+++ b/sw-block/.gocache_v2/trim.txt
@ -0,0 +1 @@
+1774577367
--- a/sw-block/.private/README.md
+++ b/sw-block/.private/README.md
@ -0,0 +1,27 @@
+# .private
+
+Private working area for `sw-block`.
+
+Use this for:
+- phase development notes
+- roadmap/progress tracking
+- draft handoff notes
+- temporary design comparisons
+- prototype scratch work not ready for `design/` or `prototype/`
+
+Recommended layout:
+- `.private/phase/`: phase-by-phase development notes
+- `.private/roadmap/`: short-term and medium-term execution notes
+- `.private/handoff/`: notes for `sw`, `qa`, or future sessions
+
+Phase protocol:
+- each phase should normally have:
+  - `phase-xx.md`
+  - `phase-xx-log.md`
+  - `phase-xx-decisions.md`
+- details are defined in `.private/phase/README.md`
+
+Promotion rules:
+- stable vision/design docs go to `../design/`
+- real prototype code stays in `../prototype/`
+- `.private/` is for working material, not source of truth
--- a/sw-block/.private/phase/README.md
+++ b/sw-block/.private/phase/README.md
@ -0,0 +1,36 @@
+# Phase Dev
+
+Use this directory for private phase development notes.
+
+## Phase Protocol
+
+Each phase should use this file set:
+
+- `phase-01.md`
+  - plan
+  - scope
+  - progress
+  - active tasks
+  - exit criteria
+- `phase-01-log.md`
+  - dated development log
+  - experiments
+  - test runs
+  - failures and findings
+- `phase-01-decisions.md`
+  - key algorithm decisions
+  - tradeoffs
+  - rejected alternatives
+
+Suggested naming pattern:
+- `phase-01.md`
+- `phase-01-log.md`
+- `phase-01-decisions.md`
+- `phase-02.md`
+- `phase-02-log.md`
+- `phase-02-decisions.md`
+
+Rule of use:
+1. if it is what we are doing -> `phase-xx.md`
+2. if it is what happened -> `phase-xx-log.md`
+3. if it is why we chose something -> `phase-xx-decisions.md`
--- a/sw-block/.private/phase/phase-01-decisions.md
+++ b/sw-block/.private/phase/phase-01-decisions.md
@ -0,0 +1,97 @@
+# Phase 01 Decisions
+
+Date: 2026-03-26
+Status: active
+
+## Purpose
+
+Capture the key design decisions made during Phase 01 simulator work.
+
+## Initial Decisions
+
+### 1. `design/` vs `.private/phase/`
+
+Decision:
+- `sw-block/design/` holds shared design truth
+- `sw-block/.private/phase/` holds execution planning and progress
+
+Reason:
+- design backlog and execution checklist should not be mixed
+
+### 2. Scenario source of truth
+
+Decision:
+- `sw-block/design/v2_scenarios.md` is the scenario backlog and coverage matrix
+
+Reason:
+- all contributors need one visible scenario list
+
+### 3. Phase 01 priority
+
+Decision:
+- first close:
+  - `S19`
+  - `S20`
+
+Reason:
+- they are the biggest remaining distributed lineage/partition scenarios
+
+### 4. Current simulator scope
+
+Decision:
+- use the simulator as a V2 design-validation tool, not a product/perf harness
+
+Reason:
+- current goal is correctness and protocol coverage, not productization
+
+### 5. Phase execution format
+
+Decision:
+- keep phase execution in three files:
+  - `phase-xx.md`
+  - `phase-xx-log.md`
+  - `phase-xx-decisions.md`
+
+Reason:
+- separates plan, evidence, and reasoning
+- reduces drift between roadmap and findings
+
+### 6. Design backlog vs execution plan
+
+Decision:
+- `sw-block/design/v2_scenarios.md` remains the source of truth for scenario backlog and coverage
+- `.private/phase/phase-01.md` is the execution layer for `sw`
+
+Reason:
+- design truth should be stable and shareable
+- execution tasks should be easier to edit without polluting design docs
+
+### 7. Immediate Phase 01 priorities
+
+Decision:
+- prioritize:
+  - `S19` chain of custody across multiple promotions
+  - `S20` live partition with competing writes
+
+Reason:
+- these are the biggest remaining distributed-lineage gaps after current simulator milestone
+
+### 8. Coverage status should be conservative
+
+Decision:
+- mark scenarios as `partial` unless the test actually exercises the core protocol obligation, not just a simplified happy path
+
+Reason:
+- avoids overstating simulator coverage
+- keeps the backlog honest for follow-up strengthening
+
+### 9. Protocol-version comparison belongs in the simulator
+
+Decision:
+- compare `V1`, `V1.5`, and `V2` using the same scenario set where possible
+
+Reason:
+- this is the clearest way to show:
+  - where V1 breaks
+  - where V1.5 improves but still strains
+  - why V2 is architecturally cleaner
--- a/sw-block/.private/phase/phase-01-log.md
+++ b/sw-block/.private/phase/phase-01-log.md
@ -0,0 +1,67 @@
+# Phase 01 Log
+
+Date: 2026-03-26
+Status: active
+
+## Log Protocol
+
+Use dated entries like:
+
+## 2026-03-26
+- work completed
+- tests run
+- failures found
+- seeds/traces worth keeping
+- follow-up items
+
+## Initial State
+
+- Phase 01 created from the earlier `phase-01-v2-scenarios.md` working note
+- scenario source of truth remains:
+  - `sw-block/design/v2_scenarios.md`
+- current active asks for `sw`:
+  - `S19`
+  - `S20`
+
+## 2026-03-26
+
+- created Phase 01 file set:
+  - `phase-01.md`
+  - `phase-01-log.md`
+  - `phase-01-decisions.md`
+- promoted scenario execution checklist into `phase-01.md`
+- kept `sw-block/design/v2_scenarios.md` as the shared backlog and coverage matrix
+- current simulator milestone:
+  - `fsmv2` passing
+  - `volumefsm` passing
+  - `distsim` passing
+  - randomized `distsim` seeds passing
+  - event/interleaving simulator work present in `sw-block/prototype/distsim/simulator.go`
+- current immediate development priority for `sw`:
+  - implement `S19`
+  - implement `S20`
+- `sw` added Phase 01 P0/P1 scenario tests in `distsim`:
+  - `S19`
+  - `S20`
+  - `S5`
+  - `S6`
+  - `S18`
+  - stronger `S12`
+- review result:
+  - `S19` looks solid
+  - stronger `S12` now looks solid
+  - `S20`, `S5`, `S6`, `S18` are better classified as `partial` than fully closed
+- updated `v2_scenarios.md` coverage matrix to reflect actual status
+- next development focus:
+  - P2 scenarios
+  - stronger versions of current partial scenarios
+- added protocol-version comparison design:
+  - `sw-block/design/protocol-version-simulation.md`
+- added minimal protocol policy prototype in `distsim`:
+  - `ProtocolV1`
+  - `ProtocolV15`
+  - `ProtocolV2`
+  - focused on:
+    - catch-up policy
+    - tail-chasing outcome policy
+    - restart/rejoin policy
--- a/sw-block/.private/phase/phase-01-v2-scenarios.md
+++ b/sw-block/.private/phase/phase-01-v2-scenarios.md
@ -0,0 +1,11 @@
+# Deprecated
+
+This file is deprecated.
+
+Use instead:
+- `phase-01.md`
+- `phase-01-log.md`
+- `phase-01-decisions.md`
+
+The scenario source of truth remains:
+- `sw-block/design/v2_scenarios.md`
--- a/sw-block/.private/phase/phase-01.md
+++ b/sw-block/.private/phase/phase-01.md
@ -0,0 +1,164 @@
+# Phase 01
+
+Date: 2026-03-26
+Status: completed
+Purpose: drive V2 simulator development by closing the scenario backlog in `sw-block/design/v2_scenarios.md`
+
+## Goal
+
+Make the V2 simulator cover the important protocol scenarios as explicitly as possible.
+
+This phase is about:
+- simulator fidelity
+- scenario coverage
+- invariant quality
+
+This phase is not about:
+- product integration
+- SPDK
+- raw allocator
+- production transport
+
+## Source Of Truth
+
+Design/source-of-truth:
+- `sw-block/design/v2_scenarios.md`
+
+Prototype code:
+- `sw-block/prototype/fsmv2/`
+- `sw-block/prototype/volumefsm/`
+- `sw-block/prototype/distsim/`
+
+## Assigned Tasks For `sw`
+
+### P0
+
+1. `S19` chain of custody across multiple promotions
+- add fixed test(s)
+- verify committed data from `A -> B -> C`
+- update coverage matrix
+
+2. `S20` live partition with competing writes
+- add fixed test(s)
+- stale side must not advance committed lineage
+- update coverage matrix
+
+### P1
+
+3. `S5` flapping replica stays recoverable
+- repeated disconnect/reconnect
+- no unnecessary rebuild while recovery remains possible
+
+4. `S6` tail-chasing under load
+- primary keeps writing while replica catches up
+- explicit outcome:
+  - converge and promote
+  - or abort to rebuild
+
+5. `S18` primary restart without failover
+- same-lineage restart behavior
+- no stale session assumptions
+
+6. stronger `S12`
+- more than one promotion candidate
+- choose valid lineage, not merely highest apparent LSN
+
+### P2
+
+7. protocol-version comparison support
+- model:
+  - `V1`
+  - `V1.5`
+  - `V2`
+- use the same scenario set to show:
+  - V1 breaks
+  - V1.5 improves but still strains
+  - V2 handles recovery more explicitly
+
+8. richer Smart WAL scenarios
+- time-varying `ExtentReferenced` availability
+- recoverable then unrecoverable transitions
+
+9. delayed/drop network scenarios beyond simple disconnect
+
+10. multi-node reservation expiry / rebuild timeout cases
+
+## Invariants To Preserve
+
+After every scenario or random run, preserve:
+
+1. committed data is durable per policy
+2. uncommitted data is not revived as committed
+3. stale epoch traffic does not mutate current lineage
+4. recovered/promoted node matches reference state at target `LSN`
+5. committed prefix remains contiguous
+
+## Required Updates Per Task
+
+For each completed scenario:
+
+1. add or update test(s)
+2. update `sw-block/design/v2_scenarios.md`
+   - package
+   - test name
+   - status
+3. note any missing simulator capability
+
+## Current Progress
+
+Already in place before this phase:
+- `fsmv2` local FSM prototype
+- `volumefsm` orchestrator prototype
+- `distsim` distributed simulator
+- randomized `distsim` runs
+- first event/interleaving simulator work in `distsim/simulator.go`
+
+Open focus:
+- `S19` covered in `distsim`
+- `S20` partially covered in `distsim`
+- `S5` partially covered in `distsim`
+- `S6` partially covered in `distsim`
+- `S18` partially covered in `distsim`
+- stronger `S12` covered in `distsim`
+- protocol-version comparison design added in:
+  - `sw-block/design/protocol-version-simulation.md`
+- remaining focus is now P2 plus stronger versions of partial scenarios
+
+## Phase Status
+
+### P0
+
+- `S19` chain of custody across multiple promotions: done
+- `S20` live partition with competing writes: partial
+
+### P1
+
+- `S5` flapping replica stays recoverable: partial
+- `S6` tail-chasing under load: partial
+- `S18` primary restart without failover: partial
+- stronger `S12`: done
+
+### P2
+
+- active next step:
+  - protocol-version comparison support
+  - stronger versions of current partial scenarios
+
+## Exit Criteria
+
+Phase 01 is done when:
+
+1. `S19` and `S20` are covered
+2. `S5`, `S6`, `S18`, and stronger `S12` are at least partially covered
+3. coverage matrix in `v2_scenarios.md` is current
+4. random simulation still passes after added scenarios
+
+## Completion Note
+
+Phase 01 completed with:
+- `S19` covered
+- stronger `S12` covered
+- `S20`, `S5`, `S6`, `S18` strengthened but correctly left as `partial`
+
+Next execution phase:
+- `sw-block/.private/phase/phase-02.md`
--- a/sw-block/.private/phase/phase-02-decisions.md
+++ b/sw-block/.private/phase/phase-02-decisions.md
@ -0,0 +1,51 @@
+# Phase 02 Decisions
+
+Date: 2026-03-26
+Status: active
+
+## Decision 1: Extend `distsim` Instead Of Forking A New Protocol Simulator
+
+Reason:
+- current `distsim` already has:
+  - node/storage model
+  - coordinator/epoch model
+  - reference oracle
+  - randomized runs
+- the missing layer is protocol-state fidelity, not a new simulation foundation
+
+Implication:
+- add lightweight per-node replication state and protocol decisions to `distsim`
+- do not build a separate fourth simulator yet
+
+## Decision 2: Keep Coverage Status Conservative
+
+Reason:
+- `S20`, `S6`, and `S18` currently prove important safety properties
+- but they do not yet fully assert message-level or explicit state-transition behavior
+
+Implication:
+- leave them `partial` until the model can assert protocol behavior directly
+
+## Decision 3: Use Versioned Scenario Comparison To Justify V2
+
+Reason:
+- the simulator should not only say "V2 works"
+- it should show:
+  - where `V1` fails
+  - where `V1.5` improves but still strains
+  - why `V2` is worth the complexity
+
+Implication:
+- Phase 02 includes explicit `V1` / `V1.5` / `V2` scenario comparison work
+
+## Decision 4: V2 Must Not Be Described As "Always Catch-Up"
+
+Reason:
+- that wording is too optimistic and hides the real V2 design rule
+- V2 is better because it makes recoverability explicit, not because it retries forever
+
+Implication:
+- describe V2 as:
+  - catch-up if explicitly recoverable
+  - otherwise explicit rebuild
+- keep this wording consistent in tests and docs
--- a/sw-block/.private/phase/phase-02-log.md
+++ b/sw-block/.private/phase/phase-02-log.md
@ -0,0 +1,93 @@
+# Phase 02 Log
+
+Date: 2026-03-26
+Status: active
+
+## 2026-03-26
+
+- Phase 02 created to move `distsim` from final-state safety validation toward explicit protocol-state simulation.
+- Initial focus:
+  - close `S20`, `S6`, and `S18` at protocol level
+  - compare `V1`, `V1.5`, and `V2` on the same scenarios
+- Known model gap at phase start:
+  - current `distsim` is strong at final-state safety invariants
+  - current `distsim` is weaker at mid-flow protocol assertions and message-level rejection reasons
+- Phase 02 progress now in place:
+  - delivery accept/reject tracking
+  - protocol-level stale-epoch rejection assertions
+  - explicit non-convergent catch-up state transition assertions
+  - initial version-comparison tests for disconnect, tail-chasing, and restart/rejoin policy
+- Next simulator target:
+  - reproduce real `V1.5` address-instability and control-plane-recovery failures as named scenarios
+- Immediate coding asks for `sw`:
+  - changed-address restart failure in `V1.5`
+  - same-address transient outage comparison across `V1` / `V1.5` / `V2`
+  - slow control-plane reassignment scenario derived from `CP13-8 T4b`
+- Local housekeeping done:
+  - corrected V2 wording from "always catch-up" to "catch-up if explicitly recoverable; otherwise rebuild"
+  - added explicit brief-disconnect and changed-address restart policy helpers
+  - verified `distsim` test suite still passes with the Windows-safe runner
+- Scenario status update:
+  - `S20` now covered via protocol-level stale-traffic rejection + committed-prefix stability
+  - `S6` now covered via explicit `CatchingUp -> NeedsRebuild` assertions
+  - `S18` now covered via explicit stale `MsgBarrierAck` rejection + prefix stability
+- Next asks for `sw` after this closure:
+  - changed-address restart scenario tied directly to `CP13-8 T4b`
+  - same-address transient outage comparison across `V1` / `V1.5` / `V2`
+  - slow control-plane reassignment scenario
+  - Smart WAL recoverable -> unrecoverable transition scenarios
+- Additional closure completed:
+  - `S5` now covered with both:
+    - repeated recoverable flapping
+    - budget-exceeded escalation to `NeedsRebuild`
+  - Smart WAL transitions now exercised with:
+    - recoverable -> unrecoverable during active recovery
+    - mixed `WALInline` + `ExtentReferenced` success
+    - time-varying payload availability
+- Updated next asks for `sw`:
+  - changed-address restart scenario tied directly to `CP13-8 T4b`
+  - same-address transient outage comparison across `V1` / `V1.5` / `V2`
+  - slow control-plane reassignment scenario
+  - delayed/drop network beyond simple disconnect
+  - multi-node reservation expiry / rebuild timeout cases
+- Additional Phase 02 coverage delivered:
+  - delayed stale messages after promote/failover
+  - delayed stale barrier ack rejection
+  - selective write-drop with barrier delivery under `sync_all`
+  - multi-node mixed reservation expiry outcome
+  - multi-node `NeedsRebuild` / snapshot rebuild recovery
+  - partial rebuild timeout / retry completion
+- Remaining asks are now narrower:
+  - changed-address restart scenario tied directly to `CP13-8 T4b`
+  - same-address transient outage comparison across `V1` / `V1.5` / `V2`
+  - slow control-plane reassignment scenario
+  - stronger coordinator candidate-selection scenarios
+- Additional closure after review:
+  - safe default promotion selector now refuses `NeedsRebuild` candidates
+  - explicit desperate-promotion API separated from safe selection
+  - changed-address and slow-control-plane comparison tests now prove actual data divergence / healing, not only policy shape
+- New next-step assignment:
+  - strengthen model depth around endpoint identity and control-plane reassignment
+  - replace abstract repair helpers with more explicit event flow where practical
+  - reduce direct recovery state injection in comparison tests
+  - extend candidate selection from ranking into validity rules
+
+## 2026-03-27
+
+- Phase 02 core simulator hardening is effectively complete.
+- Delivered since the previous checkpoint:
+  - endpoint identity / endpoint-version modeling
+  - stale-endpoint rejection in delivery path
+  - heartbeat -> coordinator detect -> assignment-update control-plane flow
+  - recovery-session trigger API for `V1.5` and `V2`
+  - explicit candidate eligibility checks:
+    - running
+    - epoch alignment
+    - state eligibility
+    - committed-prefix sufficiency
+  - safe default promotion now rejects candidates without the committed prefix
+- Current `distsim` status at latest review:
+  - 73 tests passing
+- Manager bookkeeping decision:
+  - keep Phase 02 active only for doc maintenance / wrap-up
+  - treat further simulator depth as likely Phase 03 work, not unbounded Phase 02 scope creep
--- a/sw-block/.private/phase/phase-02.md
+++ b/sw-block/.private/phase/phase-02.md
@ -0,0 +1,191 @@
+# Phase 02
+
+Date: 2026-03-27
+Status: active
+Purpose: extend the V2 simulator from final-state safety checking into protocol-state simulation that can reproduce `V1`, `V1.5`, and `V2` behavior on the same scenarios
+
+## Goal
+
+Make the simulator model enough node-local replication state and message-level behavior to:
+
+1. reproduce `V1` / `V1.5` failure modes
+2. show why those failures are structural
+3. close the current `partial` V2 scenarios with stronger protocol assertions
+
+This phase is about:
+- protocol-version comparison
+- per-node replication state
+- message-level fencing / accept / reject behavior
+- explicit catch-up abort / rebuild transitions
+
+This phase is not about:
+- product integration
+- production transport
+- SPDK
+- raw allocator
+
+## Source Of Truth
+
+Design/source-of-truth:
+- `sw-block/design/v2_scenarios.md`
+- `sw-block/design/protocol-version-simulation.md`
+- `sw-block/design/v1-v15-v2-simulator-goals.md`
+
+Prototype code:
+- `sw-block/prototype/distsim/`
+
+## Assigned Tasks For `sw`
+
+### P0
+
+1. Add per-node replication state to `distsim`
+- minimum states:
+  - `InSync`
+  - `Lagging`
+  - `CatchingUp`
+  - `NeedsRebuild`
+  - `Rebuilding`
+- keep state lightweight; do not clone full `fsmv2` into `distsim`
+
+2. Add message-level protocol decisions
+- stale-epoch write / ship / barrier traffic must be explicitly rejected
+- record whether a message was:
+  - accepted
+  - rejected by epoch
+  - rejected by state
+
+3. Add explicit catch-up abort / rebuild entry
+- non-convergent catch-up must move to explicit modeled failure:
+  - `NeedsRebuild`
+  - or equivalent abort outcome
+
+### P1
+
+4. Re-close `S20` at protocol level
+- stale-side writes must go through protocol delivery path
+- prove stale-side traffic cannot advance committed lineage
+
+5. Re-close `S6` at protocol level
+- assert explicit abort/escalation on non-convergence
+- not only final-state safety
+
+6. Re-close `S18` at protocol level
+- assert committed-prefix behavior around delayed old ack / restart races
+- not only final-state oracle checks
+
+### P2
+
+7. Expand protocol-version comparison
+- run selected scenarios under:
+  - `V1`
+  - `V1.5`
+  - `V2`
+- at minimum:
+  - brief disconnect
+  - restart with changed address
+  - tail-chasing
+
+8. Add V1.5-derived failure scenarios
+- replica restart with changed receiver address
+- same-address transient outage
+- slow control-plane recovery vs fast local reconnect
+
+9. Prepare richer recovery modeling
+- time-varying recoverability
+- reservation loss during active catch-up
+- rebuild timeout / retry in mixed-state cluster
+
+## Invariants To Preserve
+
+After every scenario or random run, preserve:
+
+1. committed data is durable per policy
+2. uncommitted data is not revived as committed
+3. stale epoch traffic does not mutate current lineage
+4. recovered/promoted node matches reference state at target `LSN`
+5. committed prefix remains contiguous
+6. protocol-state transitions are explicit, not inferred from final data only
+
+## Required Updates Per Task
+
+For each completed task:
+
+1. add or update test(s)
+2. update `sw-block/design/v2_scenarios.md`
+   - package
+   - test name
+   - status
+   - source if new scenario was derived from V1/V1.5 behavior
+3. add a short note to:
+   - `sw-block/.private/phase/phase-02-log.md`
+4. if a design choice changed, record it in:
+   - `sw-block/.private/phase/phase-02-decisions.md`
+
+## Current Progress
+
+Already in place before this phase:
+- `distsim` final-state safety invariants
+- randomized simulation
+- event/interleaving simulator work
+- initial `ProtocolVersion` / policy scaffold
+- `S19` covered
+- stronger `S12` covered
+
+Known partials to close in this phase:
+- none in the current named backlog slice
+
+Delivered in this phase so far:
+- delivery accept/reject tracking added
+- protocol-level rejection assertions added
+- explicit `CatchingUp -> NeedsRebuild` state transition tested
+- selected protocol-version comparison tests added
+- `S20`, `S6`, and `S18` moved from `partial` to `covered`
+- Smart WAL transition scenarios added
+- `S5` moved from `partial` to `covered`
+- endpoint identity / endpoint-version modeling added
+- explicit heartbeat -> detect -> assignment-update control-plane flow added for changed-address restart
+- explicit recovery-session triggers added for `V1.5` and `V2`
+- promotion selection now uses explicit eligibility, including committed-prefix gating
+- safe and desperate promotion paths are separated
+- full `distsim` suite at latest review: 73 tests passing
+
+Remaining focus for `sw`:
+- Phase 02 core scope is now largely delivered
+- remaining work should be treated as future-strengthening, not baseline closure
+- if more simulator depth is needed next, it should likely start as Phase 03:
+  - timeout semantics
+  - timer races
+  - richer event/interleaving behavior
+  - stronger endpoint/control-plane realism beyond the current abstract model
+
+## Immediate Next Tasks For `sw`
+
+1. Add a documented compare artifact for new scenarios
+- for each new `V1` / `V1.5` / `V2` comparison:
+  - record scenario name
+  - what fails in `V1`
+  - what improves in `V1.5`
+  - what is explicit in `V2`
+- keep `sw-block/design/v1-v15-v2-comparison.md` updated
+
+2. Keep the coverage matrix honest
+- do not mark a scenario `covered` unless the test asserts protocol behavior directly
+- final-state oracle checks alone are not enough
+
+3. Prepare Phase 03 proposal instead of broadening ad hoc
+- if more depth is needed, define it cleanly first:
+  - timers / timeout events
+  - event ordering races
+  - richer endpoint lifecycle
+  - recovery-session uniqueness across competing triggers
+
+## Exit Criteria
+
+Phase 02 is done when:
+
+1. `S5`, `S6`, `S18`, and `S20` are covered at protocol level
+2. `distsim` can reproduce at least one `V1` failure, one `V1.5` failure, and the corresponding `V2` behavior on the same named scenario
+3. protocol-level rejection/accept behavior is asserted in tests, not only inferred from final-state oracle checks
+4. coverage matrix in `v2_scenarios.md` is current
+5. changed-address and reconnect scenarios are modeled through explicit endpoint / control-plane behavior rather than helper-only abstraction
+6. promotion selection uses explicit eligibility, including committed-prefix safety
--- a/sw-block/.private/phase/phase-03-decisions.md
+++ b/sw-block/.private/phase/phase-03-decisions.md
@ -0,0 +1,97 @@
+# Phase 03 Decisions
+
+Date: 2026-03-27
+Status: initial
+
+## Why Phase 03 Exists
+
+Phase 02 already covered the main protocol-state story:
+
+- V1 / V1.5 / V2 comparison
+- stale traffic rejection
+- catch-up vs rebuild
+- changed-address restart control-plane flow
+- committed-prefix-safe promotion eligibility
+
+The next simulator problems are different:
+
+- timer semantics
+- timeout races
+- event ordering under contention
+
+That deserves a separate phase so the model boundary stays clear.
+
+## Initial Boundary
+
+### `distsim`
+
+Keep for:
+
+- protocol correctness
+- reference-state validation
+- recoverability logic
+- promotion / lineage rules
+
+### `eventsim`
+
+Grow for:
+
+- explicit event queue behavior
+- timeout events
+- equal-time scheduling choices
+- race exploration
+
+## Working Rule
+
+Do not move all scenarios into `eventsim`.
+
+Only move or duplicate scenarios when:
+
+- timer or event ordering is the real bug surface
+- `distsim` abstraction hides the important behavior
+
+## Accepted Phase 03 Decisions
+
+### Same-tick rule
+
+Within one tick:
+
+- data/message delivery is evaluated before timeout firing
+
+Meaning:
+
+- if an ack arrives in the same tick as a timeout deadline, the ack wins and may cancel the timeout
+
+This is now an explicit simulator rule, not accidental behavior.
+
+### Timeout authority
+
+Not every timeout that reaches its deadline still has authority to mutate state.
+
+So we now distinguish:
+
+- `FiredTimeouts`
+  - timeout had authority and changed the model
+- `IgnoredTimeouts`
+  - timeout reached deadline but was stale and ignored
+
+This keeps replay/debug output honest.
+
+### Late barrier ack rule
+
+Once a barrier instance times out:
+
+- it is marked expired
+- late ack for that barrier instance is rejected
+
+That prevents a stale ack from reviving old durability state.
+
+### Review gate rule for timer work
+
+Timer/race work is easy to get subtly wrong while still having green tests.
+
+So timer-related work is not accepted until:
+
+- code path is reviewed
+- tests assert the real protocol obligation
+- stale and authoritative timer behavior are clearly distinguished
--- a/sw-block/.private/phase/phase-03-log.md
+++ b/sw-block/.private/phase/phase-03-log.md
@ -0,0 +1,36 @@
+# Phase 03 Log
+
+Date: 2026-03-27
+Status: active
+
+## 2026-03-27
+
+- Phase 03 created after Phase 02 core scope was effectively delivered.
+- Reason for new phase:
+  - remaining simulator work is about timer semantics and race behavior, not basic protocol-state coverage
+- Initial target:
+  - define `distsim` vs `eventsim` split more clearly
+  - add explicit timeout semantics
+  - add timer-race scenarios without bloating `distsim` ad hoc
+- P0 delivered:
+  - timeout model added for barrier / catch-up / reservation
+  - timeout-backed scenarios added
+  - same-tick ordering rule defined as data-before-timers
+- First review result:
+  - timeout semantics accepted only after making cancellation model-driven
+  - late barrier ack after timeout required explicit rejection
+- P0 hardening delivered:
+  - recovery timeout cancellation moved into model logic
+  - stale late barrier ack rejected via expired-barrier tracking
+  - stale vs authoritative timeout distinction added:
+    - `FiredTimeouts`
+    - `IgnoredTimeouts`
+- P1 delivered and reviewed:
+  - promotion vs stale timeout race
+  - rebuild completion vs epoch bump race
+  - trace builder moved into reusable code
+- Current suite state at latest accepted review:
+  - 86 `distsim` tests passing
+- Manager decision:
+  - Phase 03 P0/P1 are accepted
+  - next work should move to deliberate P2 selection rather than broadening the phase ad hoc
--- a/sw-block/.private/phase/phase-03.md
+++ b/sw-block/.private/phase/phase-03.md
@ -0,0 +1,193 @@
+# Phase 03
+
+Date: 2026-03-27
+Status: active
+Purpose: define the next simulator tier after Phase 02, focused on timeout semantics, timer races, and a cleaner split between protocol simulation and event/interleaving simulation
+
+## Goal
+
+Phase 03 exists to cover behavior that current `distsim` still abstracts away:
+
+1. timeout semantics
+2. timer races
+3. event ordering under competing triggers
+4. clearer separation between:
+   - protocol / lineage simulation
+   - event / race simulation
+
+This phase should not reopen already-closed Phase 02 protocol scope unless a clear bug is found.
+
+## Why A New Phase
+
+Phase 02 already delivered:
+
+- protocol-state assertions
+- V1 / V1.5 / V2 comparison scenarios
+- endpoint identity modeling
+- control-plane assignment-update flow
+- committed-prefix-aware promotion eligibility
+
+What remains is different in character:
+
+- timers
+- delayed events racing with each other
+- timeout-triggered state changes
+- more explicit event scheduling
+
+That deserves a new phase boundary.
+
+## Source Of Truth
+
+Design/source-of-truth:
+- `sw-block/design/v2_scenarios.md`
+- `sw-block/design/v2-dist-fsm.md`
+- `sw-block/design/v2-scenario-sources-from-v1.md`
+- `sw-block/design/v1-v15-v2-comparison.md`
+
+Current prototype base:
+- `sw-block/prototype/distsim/`
+- `sw-block/prototype/distsim/simulator.go`
+
+## Scope
+
+### In scope
+
+1. timeout semantics
+- barrier timeout
+- catch-up timeout
+- reservation expiry timeout
+- rebuild timeout
+
+2. timer races
+- delayed ack vs timeout
+- timeout vs promotion
+- reconnect vs timeout
+- catch-up completion vs expiry
+- rebuild completion vs epoch bump
+
+3. simulator split clarification
+- `distsim` keeps:
+  - protocol correctness
+  - lineage
+  - recoverability
+  - reference-state checking
+- `eventsim` grows into:
+  - event scheduling
+  - timer firing
+  - same-time interleavings
+  - race exploration
+
+### Out of scope
+
+- production integration
+- real transport
+- real disk timings
+- SPDK
+- raw allocator
+
+## Assigned Tasks For `sw`
+
+### P0
+
+1. Write a concrete `eventsim` scope note in code/docs
+- define what stays in `distsim`
+- define what moves to `eventsim`
+- avoid overlap and duplicated semantics
+
+2. Add minimal timeout event model
+- first-class timeout event type(s)
+- at minimum:
+  - barrier timeout
+  - catch-up timeout
+  - reservation expiry
+
+3. Add timeout-backed scenarios
+- stale delayed ack vs timeout
+- catch-up timeout before convergence
+- reservation expiry during active recovery
+
+### P1
+
+4. Add race-focused tests
+- promotion vs delayed stale ack
+- rebuild completion vs epoch bump
+- reconnect success vs timeout firing
+
+5. Keep traces debuggable
+- failing runs must dump:
+  - seed
+  - event order
+  - timer events
+  - node states
+  - committed prefix
+
+### P2
+
+6. Decide whether selected `distsim` scenarios should also exist in `eventsim`
+- only when timer/event ordering is the real point
+- do not duplicate every scenario blindly
+
+## Current Progress
+
+Delivered in this phase so far:
+
+- `eventsim` scope note added in code
+- explicit timeout model added:
+  - barrier timeout
+  - catch-up timeout
+  - reservation timeout
+- timeout-backed scenarios added and reviewed
+- same-tick rule made explicit:
+  - data before timers
+- recovery timeout cancellation is now model-driven, not test-driven
+- stale barrier ack after timeout is explicitly rejected
+- stale timeouts are separated from authoritative timeouts:
+  - `FiredTimeouts`
+  - `IgnoredTimeouts`
+- race-focused scenarios added and reviewed:
+  - promotion vs stale catch-up timeout
+  - promotion vs stale barrier timeout
+  - rebuild completion vs epoch bump
+  - epoch bump vs stale catch-up timeout
+- reusable trace builder added for replay/debug support
+- current `distsim` suite at latest review:
+  - 86 tests passing
+
+Remaining focus for `sw`:
+
+- Phase 03 P0 and P1 are effectively complete
+- Phase 03 P2 is also effectively complete after review
+- any further simulator work should now be narrow and evidence-driven
+- recommended next simulator additions only:
+  - control-plane latency parameter
+  - sustained-write convergence / tail-chasing load test
+  - one multi-promotion lineage extension
+
+## Invariants To Preserve
+
+1. committed data remains durable per policy
+2. uncommitted data is never revived as committed
+3. stale epoch traffic never mutates current lineage
+4. committed prefix remains contiguous
+5. timeout-triggered transitions are explicit and explainable
+6. races do not silently bypass fencing or rebuild boundaries
+
+## Required Updates Per Task
+
+For each completed task:
+
+1. add or update tests
+2. update `sw-block/design/v2_scenarios.md` if scenario coverage changed
+3. add a short note to:
+   - `sw-block/.private/phase/phase-03-log.md`
+4. if the simulator boundary changed, record it in:
+   - `sw-block/.private/phase/phase-03-decisions.md`
+
+## Exit Criteria
+
+Phase 03 is done when:
+
+1. timeout semantics exist as explicit simulator behavior
+2. at least three important timer-race scenarios are modeled and tested
+3. `distsim` vs `eventsim` responsibilities are clearly separated
+4. failure traces from race/timeout scenarios are replayable enough to debug
--- a/sw-block/.private/phase/phase-04-decisions.md
+++ b/sw-block/.private/phase/phase-04-decisions.md
@ -0,0 +1,97 @@
+# Phase 04 Decisions
+
+Date: 2026-03-27
+Status: initial
+
+## First Slice Decision
+
+The first standalone V2 implementation slice is:
+
+- per-replica sender ownership
+- one active recovery session per replica per epoch
+
+## Why Not Start In V1
+
+V1/V1.5 remains:
+
+- production line
+- maintenance/fix line
+
+It should not be the place where V2 architecture is first implemented.
+
+## Why This Slice
+
+This slice:
+
+- directly addresses the clearest V1.5 structural pain
+- maps cleanly to the V2-boundary tests
+- is narrow enough to implement without dragging in the entire future architecture
+
+## Accepted P0 Refinements
+
+### Sender epoch coherence
+
+Sender-owned epoch is real state, not decoration.
+
+So:
+
+- reconcile/update paths must refresh sender epoch
+- stale active session must be invalidated on epoch advance
+
+### Session lifecycle
+
+The first slice should not use a totally loose lifecycle shell.
+
+So:
+
+- session phase changes now follow an explicit transition map
+- invalid jumps are rejected
+
+### Session attach rule
+
+Attaching a session at the wrong epoch is invalid.
+
+So:
+
+- `AttachSession(epoch, kind)` must reject epoch mismatch with the owning sender
+
+## Accepted P1 Refinements
+
+### Session identity fencing
+
+The standalone V2 slice must reject stale completion by explicit session identity.
+
+So:
+
+- `RecoverySession` has stable unique identity
+- sender completion must be by session ID, not by "current pointer"
+- stale session results are rejected at the sender authority boundary
+
+### Ownership vs execution
+
+Ownership creation is not the same as execution start.
+
+So:
+
+- `AttachSession()` and `SupersedeSession()` establish ownership only
+- `BeginConnect()` is the first execution-state mutation
+
+### Completion authority
+
+An ID match alone is not enough to complete recovery.
+
+So:
+
+- completion must require a valid completion-ready phase
+- normal completion requires converged catch-up
+- zero-gap fast completion is allowed explicitly from handshake
+
+## P2 Direction
+
+The next prototype step is not broader simulation.
+
+It is:
+
+- recovery outcome branching
+- assignment-intent orchestration
+- prototype-level end-to-end recovery flow
--- a/sw-block/.private/phase/phase-04-log.md
+++ b/sw-block/.private/phase/phase-04-log.md
@ -0,0 +1,46 @@
+# Phase 04 Log
+
+Date: 2026-03-27
+Status: active
+
+## 2026-03-27
+
+- Phase 04 created to start the first standalone V2 implementation slice.
+- Decision:
+  - do not begin in `weed/storage/blockvol/`
+  - begin under `sw-block/`
+- first slice chosen:
+  - per-replica sender ownership
+  - explicit recovery-session ownership
+- Initial slice delivered under `sw-block/prototype/enginev2/`:
+  - sender
+  - recovery session
+  - sender group
+- First review found:
+  - sender/session epoch coherence gap
+  - session lifecycle was shell-only, not enforcing real transitions
+  - attach-session epoch mismatch was not rejected
+- Follow-up delivered and accepted:
+  - reconcile updates preserved sender epoch
+  - epoch bump invalidates stale session
+  - session transition map enforced
+  - attach-session rejects epoch mismatch
+  - enginev2 tests increased to 26 passing
+- Phase 04a created to close the ownership-validation gap:
+  - explicit session identity in `distsim`
+  - bridge tests into `enginev2`
+- Phase 04a ownership problem closed well enough:
+  - stale completion rejected by session ID
+  - endpoint invalidation includes `CtrlAddr`
+  - boundary doc aligned with real simulator/prototype evidence
+- Phase 04 P1 delivered and accepted:
+  - sender-owned execution APIs added
+  - all execution APIs fence on `sessionID`
+  - completion now requires valid completion point
+  - attach/supersede now establish ownership only
+  - handshake range validation added
+  - enginev2 tests increased to 46 passing
+- Next phase focus narrowed to P2:
+  - recovery outcome branching
+  - assignment-intent orchestration
+  - prototype end-to-end recovery flow
--- a/sw-block/.private/phase/phase-04.md
+++ b/sw-block/.private/phase/phase-04.md
@ -0,0 +1,153 @@
+# Phase 04
+
+Date: 2026-03-27
+Status: active
+Purpose: start the first standalone V2 implementation slice under `sw-block/`, centered on per-replica sender ownership and explicit recovery-session ownership
+
+## Goal
+
+Build the first real V2 implementation slice without destabilizing V1.
+
+This slice should prove:
+
+1. per-replica sender identity
+2. explicit one-session-per-replica recovery ownership
+3. endpoint/assignment-driven recovery updates
+4. clean handoff between normal sender and recovery session
+
+## Why This Phase Exists
+
+The simulator and design work are now strong enough to support a narrow implementation slice.
+
+We should not start with:
+
+- Smart WAL
+- new storage engine
+- frontend integration
+
+We should start with the ownership problem that most clearly separates V2 from V1.5.
+
+## Source Of Truth
+
+Design:
+- `sw-block/design/v2-first-slice-session-ownership.md`
+- `sw-block/design/v2-acceptance-criteria.md`
+- `sw-block/design/v2-open-questions.md`
+
+Simulator reference:
+- `sw-block/prototype/distsim/`
+
+## Scope
+
+### In scope
+
+1. per-replica sender owner object
+2. explicit recovery session object
+3. session lifecycle rules
+4. endpoint update handling
+5. basic tests for sender/session ownership
+
+### Out of scope
+
+- Smart WAL in production code
+- real block backend redesign
+- V1 integration
+- frontend publication
+
+## Assigned Tasks For `sw`
+
+### P0
+
+1. create standalone V2 implementation area under `sw-block/`
+- recommended:
+  - `sw-block/prototype/enginev2/`
+
+2. define sender/session types
+- sender owner per replica
+- recovery session per replica per epoch
+
+3. implement basic lifecycle
+- create sender
+- attach session
+- supersede stale session
+- close session on success / invalidation
+
+## Current Progress
+
+Delivered in this phase so far:
+
+- standalone V2 area created under:
+  - `sw-block/prototype/enginev2/`
+- core types added:
+  - `Sender`
+  - `RecoverySession`
+  - `SenderGroup`
+- sender/session lifecycle shell implemented
+- per-replica ownership implemented
+- endpoint-change invalidation implemented
+- sender epoch coherence implemented
+- session epoch attach validation implemented
+- session phase transitions now enforce a real transition map
+- session identity fencing implemented
+- stale completion rejected by session ID
+- execution APIs implemented:
+  - `BeginConnect`
+  - `RecordHandshake`
+  - `BeginCatchUp`
+  - `RecordCatchUpProgress`
+  - `CompleteSessionByID`
+- completion authority tightened:
+  - catch-up must converge
+  - zero-gap handshake fast path allowed
+- attach/supersede now establish ownership only
+- sender-group orchestration tests added
+- current `enginev2` test state at latest review:
+  - 46 tests passing
+
+Next focus for `sw`:
+
+- continue Phase 04 beyond execution gating:
+  - recovery outcome branching
+  - sender-group orchestration from assignment intent
+  - prototype-level end-to-end recovery flow
+- do not integrate into V1 production tree yet
+
+### P1
+
+4. implement endpoint update handling
+- changed-address update must refresh the right sender owner
+
+5. implement epoch invalidation
+- stale session must stop after epoch bump
+
+6. add tests matching the slice acceptance
+
+### P2
+
+7. add recovery outcome branching
+- distinguish:
+  - zero-gap fast completion
+  - positive-gap catch-up completion
+  - unrecoverable gap / `NeedsRebuild`
+
+8. add assignment-intent driven orchestration
+- move beyond raw reconcile-only tests
+- make sender-group react to explicit recovery intent
+
+9. add prototype-level end-to-end flow tests
+- assignment/update
+- session creation
+- execution
+- completion / invalidation
+- rebuild escalation
+
+## Exit Criteria
+
+Phase 04 is done when:
+
+1. standalone V2 sender/session slice exists under `sw-block/`
+2. sender ownership is per replica, not set-global
+3. one active recovery session per replica per epoch is enforced
+4. endpoint update and epoch invalidation are tested
+5. sender-owned execution flow is validated
+6. recovery outcome branching exists at prototype level
--- a/sw-block/.private/phase/phase-04a-decisions.md
+++ b/sw-block/.private/phase/phase-04a-decisions.md
@ -0,0 +1,49 @@
+# Phase 04a Decisions
+
+Date: 2026-03-27
+Status: initial
+
+## Core Decision
+
+The next must-fix validation problem is:
+
+- sender/session ownership semantics
+
+This outranks:
+
+- more timing realism
+- more WAL detail
+- broader scenario growth
+
+## Why
+
+V2's core claim over V1.5 is not only:
+
+- better recovery policy
+
+It is also:
+
+- stable per-replica sender identity
+- one active recovery owner
+- stale work cannot mutate current state
+
+If those ownership rules are not validated, the simulator can overstate confidence.
+
+## Validation Rule
+
+For this phase, a scenario is only complete when it is expressed at two levels:
+
+1. simulator ownership model (`distsim`)
+2. standalone implementation slice (`enginev2`)
+
+Real `weed/` adversarial tests remain the system-level gate.
+
+## Scope Discipline
+
+Do not expand this phase into:
+
+- generic simulator feature growth
+- Smart WAL design growth
+- V1 integration work
+
+Keep it focused on the ownership model.
--- a/sw-block/.private/phase/phase-04a-log.md
+++ b/sw-block/.private/phase/phase-04a-log.md
@ -0,0 +1,22 @@
+# Phase 04a Log
+
+Date: 2026-03-27
+Status: active
+
+## 2026-03-27
+
+- Phase 04a created as a narrow validation phase.
+- Reason:
+  - the biggest remaining V2 validation gap is ownership semantics
+  - not general scenario count
+  - not more timer realism
+  - not more WAL detail
+- Scope chosen:
+  - sender identity
+  - recovery session identity
+  - supersede / invalidate rules
+  - stale completion rejection
+  - `distsim` to `enginev2` bridge tests
+- This phase is intentionally separate from broad Phase 04 implementation growth.
+- Goal:
+  - gain confidence that V2 is validated as owned session/sender protocol state, not only as policy
--- a/sw-block/.private/phase/phase-04a.md
+++ b/sw-block/.private/phase/phase-04a.md
@ -0,0 +1,113 @@
+# Phase 04a
+
+Date: 2026-03-27
+Status: active
+Purpose: close the critical V2 ownership-validation gap by making sender/session ownership explicit in both simulation and the standalone `enginev2` slice
+
+## Goal
+
+Validate the core V2 claim more deeply:
+
+1. one stable sender identity per replica
+2. one active recovery session per replica
+3. endpoint change, epoch bump, and supersede rules invalidate stale work
+4. stale late results from old sessions cannot mutate current state
+
+This phase is not about adding broad new simulator surface.
+It is about proving the ownership model that is supposed to make V2 better than V1.5.
+
+## Why This Phase Exists
+
+Current simulation is already strong on:
+
+- quorum / commit rules
+- stale epoch rejection
+- catch-up vs rebuild
+- timeout / race ordering
+- changed-address recovery at the policy level
+
+The remaining critical risk is narrower:
+
+- the simulator still validates V2 strongly as policy
+- but not yet strongly enough as owned sender/session protocol state
+
+That is the highest-value validation gap to close before trusting V2 too much.
+
+## Source Of Truth
+
+Design:
+- `sw-block/design/v2-first-slice-session-ownership.md`
+- `sw-block/design/v2-acceptance-criteria.md`
+- `sw-block/design/v2-open-questions.md`
+- `sw-block/design/protocol-development-process.md`
+
+Simulator / prototype:
+- `sw-block/prototype/distsim/`
+- `sw-block/prototype/enginev2/`
+
+Historical / review context:
+- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
+- `sw-block/design/v2-scenario-sources-from-v1.md`
+
+## Scope
+
+### In scope
+
+1. explicit sender/session identity validation in `distsim`
+2. explicit stale-session invalidation rules
+3. bridge tests from `distsim` scenarios to `enginev2` sender/session invariants
+4. doc cleanup so V2-boundary tests point to real simulator and `enginev2` coverage
+
+### Out of scope
+
+- Smart WAL expansion
+- broad new timing realism
+- TCP / disk realism
+- V1 production integration
+- new backend/storage engine work
+
+## Critical Questions To Close
+
+1. can an old session completion mutate state after a new session supersedes it?
+2. does endpoint change invalidate or supersede the active session cleanly?
+3. does epoch bump remove all authority from prior sessions?
+4. can duplicate recovery triggers create overlapping active sessions?
+
+## Assigned Tasks For `sw`
+
+### P0
+
+1. add explicit session identity to `distsim`
+- model session ID or equivalent ownership token
+- make stale session results rejectable by identity, not just by coarse state
+
+2. add ownership scenarios to `distsim`
+- endpoint change during active catch-up
+- epoch bump during active catch-up
+- stale late completion from old session
+- duplicate recovery trigger while a session is already active
+
+3. add bridge tests in `enginev2`
+- same-address reconnect preserves sender identity
+- endpoint bump supersedes or invalidates active session
+- epoch bump rejects stale completion
+- only one active session per sender
+
+### P1
+
+4. tighten `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
+- point to actual `distsim` scenarios
+- point to actual `enginev2` bridge tests
+- state what remains real-engine-only
+
+5. only add simulator mechanics if a bridge test exposes a real ownership gap
+
+## Exit Criteria
+
+Phase 04a is done when:
+
+1. `distsim` explicitly validates sender/session ownership invariants
+2. `enginev2` has bridge tests for the same invariants
+3. stale session work is shown unable to mutate current sender state
+4. V2-boundary doc no longer has stale simulator references
+5. we can say with confidence that V2 ownership semantics, not just V2 policy, are validated at prototype level
--- a/sw-block/README.md
+++ b/sw-block/README.md
@ -0,0 +1,18 @@
+# sw-block
+
+Private WAL V2 and standalone block-service workspace.
+
+Purpose:
+- keep WAL V2 design/prototype work isolated from WAL V1 production code in `weed/storage/blockvol`
+- allow private design notes and experiments to evolve without polluting V1 delivery paths
+- keep the future standalone `sw-block` product structure clean enough to split into a separate repo later if needed
+
+Suggested layout:
+- `design/`: shared V2 design docs
+- `prototype/`: code prototypes and experiments
+- `.private/`: private notes, phase development, roadmap, and non-public working material
+
+Repository direction:
+- current state: `sw-block/` is an isolated workspace inside `seaweedfs`
+- likely future state: `sw-block` becomes a standalone sibling repo/product
+- design and prototype structure should therefore stay product-oriented and not depend on SeaweedFS-specific paths
--- a/sw-block/design/README.md
+++ b/sw-block/design/README.md
@ -0,0 +1,26 @@
+# V2 Design
+
+Current WAL V2 design set:
+- `wal-replication-v2.md`
+- `wal-replication-v2-state-machine.md`
+- `wal-replication-v2-orchestrator.md`
+- `wal-v2-tiny-prototype.md`
+- `wal-v1-to-v2-mapping.md`
+- `v2-dist-fsm.md`
+- `v2_scenarios.md`
+- `v1-v15-v2-comparison.md`
+- `v2-scenario-sources-from-v1.md`
+- `protocol-development-process.md`
+- `v2-acceptance-criteria.md`
+- `v2-open-questions.md`
+- `v2-first-slice-session-ownership.md`
+- `v2-prototype-roadmap-and-gates.md`
+
+These documents are the working design home for the V2 line.
+
+The original project-level copies under `learn/projects/sw-block/design/` remain as shared references for now.
+
+Execution note:
+- active development tracking for the current simulator phase lives under:
+  - `../.private/phase/phase-01.md`
+  - `../.private/phase/phase-02.md`
--- a/sw-block/design/protocol-development-process.md
+++ b/sw-block/design/protocol-development-process.md
@ -0,0 +1,288 @@
+# Protocol Development Process
+
+Date: 2026-03-27
+
+## Purpose
+
+This document defines how `sw-block` protocol work should be developed.
+
+The process is meant to work for:
+
+- V2
+- future V3
+- or a later block algorithm that is not WAL-based
+
+The point is to make protocol work systematic rather than reactive.
+
+## Core Philosophy
+
+### 1. Design before implementation
+
+Do not start with production code and hope the protocol becomes clear later.
+
+Start with:
+
+1. system contract
+2. invariants
+3. state model
+4. scenario backlog
+
+Only then move to implementation.
+
+### 2. Real failures are inputs, not just bugs
+
+When V1 or V1.5 fails in real testing, treat that as:
+
+- a design requirement
+- a scenario source
+- a simulator input
+
+Do not patch and forget.
+
+### 3. Simulator is part of the protocol, not a side tool
+
+The simulator exists to answer:
+
+- what should happen
+- what must never happen
+- which old designs fail
+- why the new design is better
+
+It is not a replacement for real testing.
+It is the design-validation layer before production implementation.
+
+### 4. Passing tests are not enough
+
+Green tests are necessary, not sufficient.
+
+We also require:
+
+- explicit invariants
+- explicit scenario intent
+- clear state transitions
+- review of assumptions and abstraction boundaries
+
+### 5. Keep hot-path and recovery-path reasoning separate
+
+Healthy steady-state behavior and degraded recovery behavior are different problems.
+
+Both must be designed explicitly.
+
+## Development Ladder
+
+Every major protocol feature should move through these steps:
+
+1. **Problem statement**
+- what real bug, limit, or product goal is driving the work
+
+2. **Contract**
+- what the protocol guarantees
+- what it does not guarantee
+
+3. **State model**
+- node state
+- coordinator state
+- recovery state
+- role / epoch / lineage rules
+
+4. **Scenario backlog**
+- named scenarios
+- source:
+  - real failure
+  - design obligation
+  - adversarial distributed case
+
+5. **Prototype / simulator**
+- reduced but explicit model
+- invariant checks
+- V1 / V1.5 / V2 comparison where relevant
+
+6. **Implementation**
+- production code only after the protocol shape is clear enough
+
+7. **Real validation**
+- unit
+- component
+- integration
+- real hardware where needed
+
+8. **Feedback loop**
+- turn new failures back into scenario/design inputs
+
+## Required Artifacts
+
+For protocol work to be considered real progress, we usually want:
+
+### Design
+
+- design doc
+- scenario doc
+- comparison doc when replacing an older approach
+
+### Prototype
+
+- simulator or prototype code
+- tests that assert protocol behavior
+
+### Implementation
+
+- production patch
+- production tests
+- docs updated to match the actual algorithm
+
+### Review
+
+- implementation gate
+- design/protocol gate
+
+## Two-Gate Rule
+
+We use two acceptance gates.
+
+### Gate 1: implementation
+
+Owned by the coding side.
+
+Questions:
+
+- does it build?
+- do tests pass?
+- does it behave as intended in code?
+
+### Gate 2: protocol/design
+
+Owned by the design/review side.
+
+Questions:
+
+- is the logic actually sound?
+- do tests prove the intended thing?
+- are assumptions explicit?
+- is the abstraction boundary honest?
+
+A task is not accepted until both gates pass.
+
+## Layering Rule
+
+Keep simulation layers separate.
+
+### `distsim`
+
+Use for:
+
+- protocol correctness
+- state transitions
+- fencing
+- recoverability
+- promotion / lineage
+- reference-state checking
+
+### `eventsim`
+
+Use for:
+
+- timeout behavior
+- timer races
+- event ordering
+- same-tick / delayed event interactions
+
+Do not duplicate scenarios blindly across both layers.
+
+## Test Selection Rule
+
+Do not choose simulator inputs only from failing tests.
+
+Review all relevant tests and classify them by:
+
+- protocol significance
+- simulator value
+- implementation specificity
+
+Good simulator candidates often come from:
+
+- barrier truth
+- catch-up vs rebuild
+- stale message rejection
+- failover / promotion safety
+- changed-address restart
+- mode semantics
+
+Keep real-only tests for:
+
+- wire format
+- OS timing
+- exact WAL file behavior
+- frontend transport specifics
+
+## Version Comparison Rule
+
+When designing a successor protocol:
+
+- keep the old version visible
+- reproduce the old failure or limitation
+- show the improved behavior in the new version
+
+For `sw-block`, that means:
+
+- `V1`
+- `V1.5`
+- `V2`
+
+should be compared explicitly where possible.
+
+## Documentation Rule
+
+The docs must track three different things:
+
+### `learn/projects/sw-block/`
+
+Use for:
+
+- project history
+- V1/V1.5 algorithm records
+- phase records
+- real test history
+
+### `sw-block/design/`
+
+Use for:
+
+- active design truth
+- V2 and later protocol docs
+- scenario backlog
+- comparison docs
+
+### `sw-block/.private/phase/`
+
+Use for:
+
+- active execution plan
+- log
+- decisions
+
+## What Good Progress Looks Like
+
+A good protocol iteration usually has this pattern:
+
+1. real failure or design pressure identified
+2. scenario named and written down
+3. simulator reproduces the bad case
+4. new protocol handles it explicitly
+5. implementation follows
+6. real tests validate it
+
+If one of those steps is missing, confidence is weaker.
+
+## Bottom Line
+
+The process is:
+
+1. design the contract
+2. model the state
+3. define the scenarios
+4. simulate the protocol
+5. implement carefully
+6. validate in real tests
+7. feed failures back into design
+
+That is the process we should keep using for V2 and any later protocol line.
--- a/sw-block/design/protocol-version-simulation.md
+++ b/sw-block/design/protocol-version-simulation.md
@ -0,0 +1,252 @@
+# Protocol Version Simulation
+
+Date: 2026-03-26
+Status: design proposal
+Purpose: define how the simulator should model WAL V1, WAL V1.5 (Phase 13), and WAL V2 on the same scenario set
+
+## Why This Exists
+
+The simulator is more valuable if the same scenario can answer:
+
+1. how WAL V1 behaves
+2. how WAL V1.5 behaves
+3. how WAL V2 should behave
+
+That turns the simulator into:
+- a regression tool for V1/V1.5
+- a justification tool for V2
+- a comparison framework across protocol generations
+
+## Principle
+
+Do not fork three separate simulators.
+
+Instead:
+- keep one simulator core
+- add protocol-version behavior modes
+- run the same named scenario under different modes
+
+## Proposed Versions
+
+### `ProtocolV1`
+
+Intent:
+- represent pre-Phase-13 behavior
+
+Behavior shape:
+- WAL is streamed optimistically
+- lagging replica is degraded/excluded quickly
+- no real short-gap catch-up contract
+- no retention-backed recovery window
+- replica usually falls toward rebuild rather than incremental recovery
+
+What scenarios should expose:
+- short outage still causes unnecessary degrade/rebuild
+- transient jitter may be over-penalized
+- poor graceful rejoin story
+
+### `ProtocolV15`
+
+Intent:
+- represent Phase-13 WAL V1.5 behavior
+
+Behavior shape:
+- reconnect handshake exists
+- WAL catch-up exists
+- primary may retain WAL longer for lagging replica
+- recovery still depends heavily on address stability and control-plane timing
+- catch-up may still tail-chase or stall operationally
+
+What scenarios should expose:
+- transient disconnects may recover
+- restart with new receiver address may still fail practical recovery
+- tail-chasing / retention pressure remain structural risks
+
+### `ProtocolV2`
+
+Intent:
+- represent the target design
+
+Behavior shape:
+- explicit recovery reservation
+- explicit catch-up vs rebuild boundary
+- lineage-first promotion
+- version-correct recovery sources
+- explicit abort/rebuild path on non-convergence or lost recoverability
+
+What scenarios should show:
+- short gap recovers cleanly
+- impossible catch-up fails cleanly
+- rebuild is explicit, not accidental
+
+## Behavior Axes To Toggle
+
+The simulator does not need completely different code paths.
+It needs protocol-version-sensitive policy on these axes:
+
+### 1. Lagging replica treatment
+
+`V1`:
+- degrade quickly
+- no meaningful WAL catch-up window
+
+`V1.5`:
+- allow WAL catch-up while history remains available
+
+`V2`:
+- allow catch-up only with explicit recoverability / reservation
+
+### 2. WAL retention / recoverability
+
+`V1`:
+- little or no retention for lagging-replica recovery
+
+`V1.5`:
+- retention-based recovery window
+- but no strong reservation contract
+
+`V2`:
+- recoverability check plus reservation
+
+### 3. Restart / address stability
+
+`V1`:
+- generally poor rejoin path
+
+`V1.5`:
+- reconnect may work only if replica address is stable
+
+`V2`:
+- address/identity assumptions should be explicit in the model
+
+### 4. Tail-chasing behavior
+
+`V1`:
+- usually degrades rather than catches up
+
+`V1.5`:
+- catch-up may be attempted but may never converge
+
+`V2`:
+- non-convergence should explicitly abort/escalate
+
+### 5. Promotion policy
+
+`V1`:
+- weaker lineage reasoning
+
+`V1.5`:
+- improved epoch/LSN handling
+
+`V2`:
+- lineage-first promotion is a first-class rule
+
+## Recommended Simulator API
+
+Add a version enum, for example:
+
+```go
+type ProtocolVersion string
+
+const (
+    ProtocolV1  ProtocolVersion = "v1"
+    ProtocolV15 ProtocolVersion = "v1_5"
+    ProtocolV2  ProtocolVersion = "v2"
+)
+```
+
+Attach it to the simulator or cluster:
+
+```go
+type Cluster struct {
+    Protocol ProtocolVersion
+    ...
+}
+```
+
+## Policy Hooks
+
+Rather than branching everywhere, centralize the differences in a few hooks:
+
+1. `CanAttemptCatchup(...)`
+2. `CatchupConvergencePolicy(...)`
+3. `RecoverabilityPolicy(...)`
+4. `RestartRejoinPolicy(...)`
+5. `PromotionPolicy(...)`
+
+That keeps the simulator readable.
+
+## Example Scenario Comparisons
+
+### Scenario: brief disconnect
+
+`V1`:
+- likely degrade / no efficient catch-up
+
+`V1.5`:
+- catch-up may succeed if address/history remain stable
+
+`V2`:
+- explicit recoverability + reservation
+- catch-up only if the missing window is still recoverable
+- otherwise explicit rebuild
+
+### Scenario: replica restart with new receiver port
+
+`V1`:
+- poor recovery path
+
+`V1.5`:
+- background reconnect fails if it retries stale address
+
+`V2`:
+- identity/address model must make this explicit
+- direct reconnect is not assumed
+- use explicit reassignment plus catch-up if recoverable, otherwise rebuild cleanly
+
+### Scenario: primary writes faster than catch-up
+
+`V1`:
+- replica degrades
+
+`V1.5`:
+- may tail-chase indefinitely or pin WAL too long
+
+`V2`:
+- explicit non-convergence detection -> abort / rebuild
+
+## What To Measure
+
+For each scenario, compare:
+
+1. does committed data remain safe?
+2. does uncommitted data stay out of committed lineage?
+3. does recovery complete or stall?
+4. does protocol choose catch-up or rebuild?
+5. is the outcome explicit or accidental?
+
+## Immediate Next Step
+
+Start with a minimal versioned policy layer:
+
+1. add `ProtocolVersion`
+2. implement one or two version-sensitive hooks:
+   - `CanAttemptCatchup`
+   - `CatchupConvergencePolicy`
+3. run existing scenarios under:
+   - `ProtocolV1`
+   - `ProtocolV15`
+   - `ProtocolV2`
+
+That is enough to begin proving:
+- V1 breaks
+- V1.5 improves but still strains
+- V2 handles the same scenario more cleanly
+
+## Bottom Line
+
+The same scenario set should become a comparison harness across protocol generations.
+
+That is one of the strongest uses of the simulator:
+- not only "does V2 work?"
+- but "why is V2 better than V1 and V1.5?"
--- a/sw-block/design/v1-v15-v2-comparison.md
+++ b/sw-block/design/v1-v15-v2-comparison.md
@ -0,0 +1,314 @@
+# V1, V1.5, and V2 Comparison
+
+Date: 2026-03-27
+
+## Purpose
+
+This document compares:
+
+- `V1`: original replicated WAL shipping model
+- `V1.5`: Phase 13 catch-up-first improvements on top of V1
+- `V2`: explicit FSM / orchestrator / recoverability-driven design under `sw-block/`
+
+It is a design comparison, not a marketing document.
+
+## 1. One-line summary
+
+- `V1` is simple but weak on short-gap recovery.
+- `V1.5` materially improves recovery, but still relies on assumptions and incremental control-plane fixes.
+- `V2` is structurally cleaner, more explicit, and easier to validate, but is not yet a production engine.
+
+## 2. Steady-State Hot Path
+
+In the healthy case, all three versions can look similar:
+
+1. primary appends ordered WAL
+2. primary ships entries to replicas
+3. replicas apply in order
+4. durability barrier determines when client-visible commit completes
+
+### V1
+
+- simplest replication path
+- lagging replica typically degrades quickly
+- little explicit recovery structure
+
+### V1.5
+
+- same basic hot path as V1
+- WAL retention and reconnect/catch-up improve short outage handling
+- extra logic exists, but much of it is off the hot path
+
+### V2
+
+- can keep a similar hot path if implemented carefully
+- extra complexity is mainly in:
+  - recovery planner
+  - replica state machine
+  - coordinator/orchestrator
+  - recoverability checks
+
+### Performance expectation
+
+In a normal healthy cluster:
+
+- `V2` should not be much heavier than `V1.5`
+- most V2 complexity sits in failure/recovery/control paths
+- there is no proof yet that V2 has better steady-state throughput or latency
+
+## 3. Recovery Behavior
+
+### V1
+
+Recovery is weakly structured:
+
+- lagging replica tends to degrade
+- short outage often becomes rebuild or long degraded state
+- little explicit catch-up boundary
+
+### V1.5
+
+Recovery is improved:
+
+- short outage can recover by retained-WAL catch-up
+- background reconnect closes the `sync_all` dead-loop
+- catch-up-first is preferred before rebuild
+
+But the model is still partly implicit:
+
+- reconnect depends on endpoint stability unless control plane refreshes assignment
+- recoverability boundary is not as explicit as V2
+- tail-chasing and retention pressure still need policy care
+
+### V2
+
+Recovery is explicit by design:
+
+- `InSync`
+- `Lagging`
+- `CatchingUp`
+- `NeedsRebuild`
+- `Rebuilding`
+
+And explicit decisions exist for:
+
+- catch-up vs rebuild
+- stale-epoch rejection
+- promotion candidate choice
+- recoverable vs unrecoverable gap
+
+## 4. Real V1.5 Lessons
+
+The main V2 requirements come from real V1.5 behavior.
+
+### 4.1 Changed-address restart
+
+Observed in `CP13-8 T4b`:
+
+- replica restarted
+- endpoint changed
+- primary shipper held stale address
+- direct reconnect could not succeed until control plane refreshed assignment
+
+V1.5 fix:
+
+- saved address used only as hint
+- heartbeat-reported address becomes source of truth
+- master refreshes primary assignment
+
+Lesson for V2:
+
+- endpoint is not identity
+- reassignment must be explicit
+
+### 4.2 Reconnect race
+
+Observed in Phase 13 review:
+
+- barrier path and background reconnect path could both trigger reconnect
+
+V1.5 fix:
+
+- `reconnectMu` serializes reconnect / catch-up
+
+Lesson for V2:
+
+- one active recovery session per replica should be a protocol rule, not just a local mutex trick
+
+### 4.3 Tail-chasing
+
+Even with retained WAL:
+
+- primary may write faster than a lagging replica can recover
+- catch-up may not converge
+
+Lesson for V2:
+
+- explicit abort / `NeedsRebuild`
+- do not pretend catch-up will always work
+
+### 4.4 Control-plane recovery latency
+
+V1.5 can be correct but still operationally slow if recovery waits on slower management cycles.
+
+Lesson for V2:
+
+- keep authority in coordinator
+- but make recovery decisions explicit and fast when possible
+
+## 5. V2 Structural Improvements
+
+V2 is better primarily because it is easier to reason about and validate.
+
+### 5.1 Better state model
+
+Instead of implicit recovery behavior, V2 has:
+
+- per-replica FSM
+- volume/orchestrator model
+- distributed simulator with scenario coverage
+
+### 5.2 Better validation
+
+V2 has:
+
+- named scenario backlog
+- protocol-state assertions
+- randomized simulation
+- V1/V1.5/V2 comparison tests
+
+This is a major difference from V1/V1.5, where many fixes were discovered through implementation and hardware testing first.
+
+### 5.3 Better correctness boundaries
+
+V2 makes these explicit:
+
+- recoverable gap vs rebuild
+- stale traffic rejection
+- promotion lineage safety
+- reservation or payload availability transitions
+
+## 6. Stability Comparison
+
+### Current judgment
+
+- `V1`: least stable under failure/recovery stress
+- `V1.5`: meaningfully better and now functionally validated on real tests
+- `V2`: best protocol structure and best simulator confidence
+
+### Important limit
+
+`V2` is not yet proven more stable in production because:
+
+- it is not a production engine yet
+- confidence comes from simulator/design work, not real block workload deployment
+
+So the accurate statement is:
+
+- `V2` is more stable **architecturally**
+- `V1.5` is more stable **operationally today** because it is implemented and tested on real hardware
+
+## 7. Performance Comparison
+
+### What is likely true
+
+`V2` should perform better than rebuild-heavy recovery approaches when:
+
+- outage is short
+- gap is recoverable
+- catch-up avoids full rebuild
+
+It should also behave better under:
+
+- flapping replicas
+- stale delayed messages
+- mixed-state replica sets
+
+### What is not yet proven
+
+We do not yet know whether `V2` has:
+
+- better steady-state throughput
+- lower p99 latency
+- lower CPU overhead
+- lower memory overhead
+
+than `V1.5`
+
+That requires real implementation and benchmarking.
+
+## 8. Smart WAL Fit
+
+### Why Smart WAL is awkward in V1/V1.5
+
+V1/V1.5 do not naturally model:
+
+- payload classes
+- recoverability reservations
+- historical payload resolution
+- explicit recoverable/unrecoverable transition
+
+So Smart WAL would be harder to add cleanly there.
+
+### Why Smart WAL fits V2 better
+
+V2 already has the right conceptual slots:
+
+- `RecoveryClass`
+  - `WALInline`
+  - `ExtentReferenced`
+- recoverability planner
+- catch-up vs rebuild decision point
+- simulator for payload-availability transitions
+
+### Important rule
+
+Smart WAL must not mean:
+
+- “read current extent for old LSN”
+
+That is incorrect.
+
+Historical correctness requires:
+
+- WAL inline payload
+- or pinned snapshot/versioned extent state
+- not current live extent contents
+
+## 9. What Is Proven Today
+
+### Proven
+
+- `V1.5` significantly improves V1 recovery behavior
+- real `CP13-8` testing validated the V1.5 data path and `sync_all` behavior
+- the V2 simulator covers:
+  - stale traffic rejection
+  - tail-chasing
+  - flapping replicas
+  - multi-promotion lineage
+  - changed-address restart comparison
+  - same-address transient outage comparison
+  - Smart WAL availability transitions
+
+### Not yet proven
+
+- V2 production implementation quality
+- V2 steady-state performance advantage
+- V2 real hardware recovery performance
+
+## 10. Bottom Line
+
+If choosing based on current evidence:
+
+- use `V1.5` as the production line today
+- use `V2` as the better long-term architecture
+
+If choosing based on protocol quality:
+
+- `V2` is clearly better structured
+- `V1.5` is still more ad hoc, even after successful fixes
+
+If choosing based on current real-world proof:
+
+- `V1.5` has the stronger operational evidence today
+- `V2` has the stronger design and simulation evidence today
--- a/sw-block/design/v1-v15-v2-simulator-goals.md
+++ b/sw-block/design/v1-v15-v2-simulator-goals.md
@ -0,0 +1,281 @@
+# V1 / V1.5 / V2 Simulator Goals
+
+Date: 2026-03-26
+Status: working design note
+Purpose: define how the simulator should be used against WAL V1, Phase-13 V1.5, and WAL V2
+
+## Why This Exists
+
+The simulator is not only for validating V2.
+
+It should also be used to:
+
+1. break WAL V1
+2. stress WAL V1.5 / Phase 13
+3. justify why WAL V2 is needed
+
+This note defines what failures we want the simulator to find in each protocol generation.
+
+## What The Simulator Can And Cannot Do
+
+### What it is good at
+
+The simulator is good at:
+
+1. finding concrete counterexamples
+2. exposing bad protocol assumptions
+3. checking commit / failover / fencing invariants
+4. checking historical data correctness at target `LSN`
+
+### What it is not
+
+The simulator is not a full proof unless promoted to formal model checking.
+
+So the right claim is:
+
+- "no issue found under these modeled runs"
+
+not:
+
+- "protocol proven correct in all implementations"
+
+## Protocol Targets
+
+### WAL V1
+
+Core shape:
+- primary ships WAL out
+- lagging replica degrades quickly
+- no real recoverability contract
+- no strong short-gap catch-up window
+
+Primary risk:
+- a briefly lagging replica gets downgraded too early and forced into rebuild
+
+### WAL V1.5 / Phase 13
+
+Core shape:
+- primary retains WAL longer for lagging replicas
+- reconnect / catch-up exists
+- rebuild fallback exists
+- primary may wait before releasing WAL
+
+Primary risks:
+- WAL pinning
+- tail chasing
+- slow availability recovery
+- recoverability assumptions that do not hold long enough
+
+### WAL V2
+
+Core shape:
+- explicit state machine
+- explicit recoverability / reservation
+- catch-up vs rebuild boundary is formalized
+- eventual support for `WALInline` vs `ExtentReferenced`
+
+Primary goal:
+- no committed data loss
+- no false recovery
+- cheaper and clearer short-gap recovery
+
+## What To Find In WAL V1
+
+The simulator should try to find scenarios where V1 fails operationally or structurally.
+
+### V1-F1. Short Disconnect Still Forces Rebuild
+
+Sequence:
+1. replica disconnects briefly
+2. primary continues writing
+3. replica returns quickly
+
+Expected ideal behavior:
+- short-gap catch-up
+
+What V1 may do:
+- downgrade replica too early
+- no usable catch-up path
+- rebuild required unnecessarily
+
+### V1-F2. Jitter Causes Avoidable Degrade
+
+Sequence:
+1. replica is alive but sees delayed/reordered delivery
+2. primary interprets this as lag/failure
+
+Failure signal:
+- unnecessary downgrade or exclusion
+
+### V1-F3. Repeated Brief Flaps Cause Thrash
+
+Sequence:
+1. repeated short disconnect/reconnect
+2. primary repeatedly degrades replica
+
+Failure signal:
+- poor availability
+- excessive rebuild churn
+
+### V1-F4. No Efficient Path Back To Healthy State
+
+Sequence:
+1. replica becomes degraded
+2. network recovers
+
+Failure signal:
+- control plane or protocol provides no clean short recovery path
+
+## What To Find In WAL V1.5 / Phase 13
+
+The simulator should stress whether retention-based catch-up is actually enough.
+
+### V15-F1. Tail Chasing Under Ongoing Writes
+
+Sequence:
+1. replica reconnects behind
+2. primary keeps writing
+3. catch-up tries to close the gap
+
+Failure signal:
+- replica never converges
+- stays forever behind
+- no clean escalation path
+
+### V15-F2. WAL Pinning Harms System Progress
+
+Sequence:
+1. replica lags
+2. primary retains WAL to help recovery
+3. lag persists
+
+Failure signal:
+- WAL window remains pinned too long
+- reclaim stalls
+- system availability or throughput suffers
+
+### V15-F3. Catch-Up Window Expires Mid-Recovery
+
+Sequence:
+1. catch-up begins
+2. primary continues advancing
+3. required recoverability disappears before completion
+
+Failure signal:
+- protocol still claims success
+- or lacks a clean abort-to-rebuild path
+
+### V15-F4. Restart Recovery Too Slow
+
+Sequence:
+1. replica restarts
+2. primary blocks writes correctly under `sync_all`
+3. service recovery takes too long
+
+Failure signal:
+- correctness preserved
+- but availability recovery is operationally unacceptable
+
+### V15-F5. Multiple Lagging Replicas Poison Progress
+
+Sequence:
+1. more than one replica lags
+2. retention and recovery obligations interact
+
+Failure signal:
+- one slow replica or mixed states poison the entire volume behavior
+
+## What WAL V2 Should Survive
+
+V2 should not merely avoid V1/V1.5 failures.
+It should make them explicit and manageable.
+
+### V2-S1. Short Gap Recovers Cheaply
+
+Expected:
+- brief disconnect -> catch-up -> promote
+- no rebuild
+
+### V2-S2. Impossible Catch-Up Fails Cleanly
+
+Expected:
+- not fully recoverable -> `NeedsRebuild`
+- no pretend success
+
+### V2-S3. Reservation Loss Forces Correct Abort
+
+Expected:
+- once recoverability is lost, catch-up aborts
+- rebuild path takes over
+
+### V2-S4. Promotion Is Lineage-First
+
+Expected:
+- new primary chosen from valid lineage
+- not simply highest apparent `LSN`
+
+### V2-S5. Historical Data Correctness Is Preserved
+
+Expected:
+- no rebuild from current extent pretending to be old state
+- correct snapshot/base + replay behavior
+
+## Simulation Strategy By Version
+
+### For V1
+
+Use simulator to:
+- break it
+- demonstrate avoidable rebuilds and downgrade behavior
+
+The simulator is mainly a diagnostic and justification tool here.
+
+### For V1.5
+
+Use simulator to:
+- stress retention-based catch-up
+- find operational limits
+- expose where retention alone is not enough
+
+The simulator is a stress and tradeoff tool here.
+
+### For V2
+
+Use simulator to:
+- validate named protocol scenarios
+- validate random/adversarial runs
+- confirm state + data correctness under failover/recovery
+
+The simulator is a design-validation tool here.
+
+## Practical Outcome
+
+If the simulator finds:
+
+### On V1
+- short outages still lead to rebuild
+
+Then conclusion:
+- V1 lacks a real short-gap recovery story
+
+### On V1.5
+- retention helps but can still tail-chase or pin WAL too long
+
+Then conclusion:
+- V1.5 is a useful bridge, but not the final architecture
+
+### On V2
+- catch-up/rebuild boundary is explicit and safe
+
+Then conclusion:
+- V2 solves the protocol problem more cleanly
+
+## Bottom Line
+
+Use the simulator differently for each generation:
+
+1. WAL V1: find where it breaks
+2. WAL V1.5: find where it strains
+3. WAL V2: validate that it behaves correctly and more cleanly
+
+That is how the simulator justifies the architectural move from V1 to V2.
--- a/sw-block/design/v2-acceptance-criteria.md
+++ b/sw-block/design/v2-acceptance-criteria.md
@ -0,0 +1,280 @@
+# V2 Acceptance Criteria
+
+Date: 2026-03-27
+
+## Purpose
+
+This document defines the minimum protocol-validation bar for V2.
+
+It is not the full scenario backlog.
+
+It is the smaller acceptance set that should be true before we claim:
+
+- the V2 protocol shape is validated enough to guide implementation
+
+## Scope
+
+This acceptance set is about:
+
+- protocol correctness
+- recovery correctness
+- lineage / fencing correctness
+- data correctness at target `LSN`
+
+This acceptance set is not yet about:
+
+- production performance
+- frontend integration
+- wire protocol
+- disk implementation details
+
+## Acceptance Rule
+
+A V2 acceptance item should satisfy all of:
+
+1. named scenario
+2. explicit expected behavior
+3. simulator coverage
+4. clear invariant or pass condition
+5. mapped reason why it matters
+
+## Acceptance Set
+
+### A1. Committed Data Survives Failover
+
+Must prove:
+
+- acknowledged data is not lost after primary failure and promotion
+
+Evidence:
+
+- `S1`
+- distributed simulator pass
+
+Pass condition:
+
+- promoted node matches reference state at committed `LSN`
+
+### A2. Uncommitted Data Is Not Revived
+
+Must prove:
+
+- non-acknowledged writes do not become committed after failover
+
+Evidence:
+
+- `S2`
+
+Pass condition:
+
+- committed prefix remains at the previous valid boundary
+
+### A3. Stale Epoch Traffic Is Fenced
+
+Must prove:
+
+- old primary / stale sender traffic cannot mutate current lineage
+
+Evidence:
+
+- `S3`
+- stale write / stale barrier / stale delayed ack scenarios
+
+Pass condition:
+
+- stale traffic is rejected
+- committed prefix does not change
+
+### A4. Short-Gap Catch-Up Works
+
+Must prove:
+
+- brief outage with recoverable gap returns via catch-up, not rebuild
+
+Evidence:
+
+- `S4`
+- same-address transient outage comparison
+
+Pass condition:
+
+- recovered replica returns to `InSync`
+- final state matches reference
+
+### A5. Non-Convergent Catch-Up Escalates Explicitly
+
+Must prove:
+
+- tail-chasing or failed catch-up does not pretend success
+
+Evidence:
+
+- `S6`
+
+Pass condition:
+
+- explicit `CatchingUp -> NeedsRebuild`
+
+### A6. Recoverability Boundary Is Explicit
+
+Must prove:
+
+- recoverable vs unrecoverable gap is decided explicitly
+
+Evidence:
+
+- `S7`
+- Smart WAL availability transition scenarios
+
+Pass condition:
+
+- recovery aborts when reservation/payload availability is lost
+- rebuild becomes the explicit fallback
+
+### A7. Historical Data Correctness Holds
+
+Must prove:
+
+- recovered data for target `LSN` is historically correct
+- current extent cannot fake old history
+
+Evidence:
+
+- `S8`
+- `S9`
+
+Pass condition:
+
+- snapshot + tail rebuild matches reference state
+- current-extent reconstruction of old `LSN` fails correctness
+
+### A8. Durability Mode Semantics Are Correct
+
+Must prove:
+
+- `best_effort`, `sync_all`, and `sync_quorum` behave as intended under mixed replica states
+
+Evidence:
+
+- `S10`
+- `S11`
+- timeout-backed quorum/all race tests
+
+Pass condition:
+
+- `sync_all` remains strict
+- `sync_quorum` commits only with true durable quorum
+- invalid `sync_quorum` topology assumptions are rejected
+
+### A9. Promotion Uses Safe Candidate Eligibility
+
+Must prove:
+
+- promotion requires:
+  - running
+  - epoch alignment
+  - state eligibility
+  - committed-prefix sufficiency
+
+Evidence:
+
+- stronger `S12`
+- candidate eligibility tests
+
+Pass condition:
+
+- unsafe candidates are rejected by default
+- desperate promotion, if any, is explicit and separate
+
+### A10. Changed-Address Restart Is Explicitly Recoverable
+
+Must prove:
+
+- endpoint is not identity
+- changed-address restart does not rely on stale endpoint reuse
+
+Evidence:
+
+- V1 / V1.5 / V2 changed-address comparison
+- endpoint-version / assignment-update simulator flow
+
+Pass condition:
+
+- stale endpoint is rejected
+- control-plane update refreshes primary view
+- recovery proceeds only after explicit update
+
+### A11. Timeout Semantics Are Explicit
+
+Must prove:
+
+- barrier, catch-up, and reservation timeouts are first-class protocol behavior
+
+Evidence:
+
+- Phase 03 P0 timeout tests
+
+Pass condition:
+
+- timeout effects are explicit
+- stale timeouts do not regress recovered state
+- late barrier ack after timeout is rejected
+
+### A12. Timer Races Are Stable
+
+Must prove:
+
+- timer/event ordering does not silently break protocol guarantees
+
+Evidence:
+
+- Phase 03 P1/P2 race tests
+
+Pass condition:
+
+- same-tick ordering is explicit
+- promotion / epoch bump / timeout interactions preserve invariants
+- traces are debuggable
+
+## Compare Requirement
+
+Where meaningful, V2 acceptance should include comparison against:
+
+- `V1`
+- `V1.5`
+
+Especially for:
+
+- changed-address restart
+- same-address transient outage
+- tail-chasing
+- slow control-plane recovery
+
+## Required Evidence
+
+Before calling V2 protocol validation “good enough”, we want:
+
+1. scenario coverage in `v2_scenarios.md`
+2. selected simulator tests in `distsim`
+3. timing/race tests in `eventsim`
+4. V1 / V1.5 / V2 comparison where relevant
+5. review sign-off that the tests prove the right thing
+
+## What This Does Not Prove
+
+Even if all acceptance items pass, this still does not prove:
+
+- production implementation quality
+- wire protocol correctness
+- real performance
+- disk-level behavior
+
+Those require later implementation and real-system validation.
+
+## Bottom Line
+
+If A1 through A12 are satisfied, V2 is validated enough at the protocol/design level to justify:
+
+1. implementation slicing
+2. Smart WAL design refinement
+3. later real-engine integration
--- a/sw-block/design/v2-dist-fsm.md
+++ b/sw-block/design/v2-dist-fsm.md
@ -0,0 +1,234 @@
+# WAL V2 Distributed Simulator
+
+Date: 2026-03-26
+Status: design proposal
+Purpose: define the next prototype layer above `ReplicaFSM` and `VolumeModel` so WAL V2 can be validated as a distributed state machine rather than only a local state machine
+
+## Why This Exists
+
+The current V2 prototype already has:
+
+- `ReplicaFSM`
+- `VolumeModel`
+- `RecoveryPlanner`
+- scenario tracing
+
+That is enough to reason about local recovery logic and volume-level admission.
+
+It is not enough to prove the distributed safety claim.
+
+The real system question is:
+
+- when time moves forward, nodes start/stop/disconnect/reconnect, and the coordinator changes epoch,
+- do all acknowledged writes remain recoverable according to the configured durability policy?
+
+That requires a distributed simulator.
+
+## Core Idea
+
+Model the system as:
+
+1. node-local state machines
+2. a coordinator state machine
+3. a time-driven message simulator
+4. a reference data model used as the correctness oracle
+
+## Layers
+
+### 1. `NodeModel`
+
+Each node has:
+
+- role
+- epoch seen
+- local WAL state
+  - head
+  - tail
+  - `receivedLSN`
+  - `flushedLSN`
+- checkpoint/snapshot state
+  - `cpLSN`
+- local extent state
+- local connectivity state
+- local `ReplicaFSM` for each remote relationship as needed
+
+### 2. `CoordinatorModel`
+
+The coordinator owns:
+
+- current epoch
+- primary assignment
+- membership
+- durability policy
+- rebuild assignments
+- promotion decisions
+
+### 3. `Network/Time Simulator`
+
+The simulator owns:
+
+- logical time ticks
+- message delivery queues
+- delay, drop, and disconnect events
+- node start/stop/restart
+
+### 4. `Reference Model`
+
+The reference model is the correctness oracle.
+
+It applies the committed write history to an idealized block map.
+At any target `LSN = X`, it can answer:
+
+- what value should each block contain at `X`?
+
+## Data Correctness Model
+
+### Synthetic 4K writes
+
+For simulation, each 4K write should be represented as:
+
+- block ID
+- value
+
+A simple deterministic choice is:
+- `value = LSN`
+
+Example:
+- `LSN 10`: write block 7 = 10
+- `LSN 11`: write block 2 = 11
+- `LSN 12`: write block 7 = 12
+
+This makes correctness checks trivial.
+
+### Why this matters
+
+This catches the exact extent-recovery trap:
+
+1. `LSN 10`: block 7 = 10
+2. `LSN 12`: block 7 = 12
+
+If recovery claims to rebuild state at `LSN 10` using current extent and returns block 7 = 12, the simulator detects the bug immediately.
+
+## Golden Invariant
+
+For any node declared recovered to target `LSN = T`:
+
+- node extent state must equal the reference model's state at `T`
+
+Not:
+- equal to current latest state
+- equal to any valid-looking value
+
+Exactly:
+- the reference state at target `LSN`
+
+## Recovery Correctness Rules
+
+### WAL replay correctness
+
+For `(startLSN, endLSN]` replay to be valid:
+
+- every record in the interval must exist
+- every payload must be the correct historical version for its LSN
+- no replay gaps are allowed
+- no stale-epoch records are allowed
+
+### Extent/snapshot correctness
+
+Extent-based recovery is valid only if the data source is version-correct.
+
+Allowed examples:
+- immutable snapshot at `cpLSN`
+- pinned copy-on-write generation
+- pinned payload object referenced by a recovery record
+
+Not allowed:
+- current live extent used as if it were historical state at old `cpLSN`
+
+## Suggested Prototype Package
+
+Prototype location:
+- `sw-block/prototype/distsim/`
+
+Suggested files:
+- `types.go`
+- `node.go`
+- `coordinator.go`
+- `network.go`
+- `reference.go`
+- `scenario.go`
+- `sim_test.go`
+
+## Minimal First Milestone
+
+Do not try to simulate the whole product first.
+
+First milestone:
+
+1. one primary
+2. one replica
+3. time ticks
+4. synthetic 4K writes with deterministic values
+5. canonical reference model
+6. simple recovery check:
+   - WAL replay recovers correct value
+   - current extent alone does not recover old `LSN`
+   - snapshot/base image at `cpLSN` does recover correct value
+
+If that milestone is solid, then add:
+- failover
+- quorum
+- multi-replica
+- coordinator promotion rules
+
+## Test Cases To Add Early
+
+### 1. WAL replay preserves historical values
+- write block 7 = 10
+- write block 7 = 12
+- replay only to `LSN 10`
+- expect block 7 = 10
+
+### 2. Current extent cannot reconstruct old `LSN`
+- same write sequence
+- try rebuilding `LSN 10` from latest extent
+- expect mismatch/error
+
+### 3. Snapshot at `cpLSN` works
+- snapshot at `LSN 10`
+- later overwrite block 7 at `LSN 12`
+- rebuild from snapshot `LSN 10`
+- expect block 7 = 10
+
+### 4. Reservation expiration invalidates recovery
+- recovery window initially valid
+- time advances
+- reservation expires
+- recovery must abort rather than return partial or wrong state
+
+## Relationship To Existing Prototype
+
+This simulator should reuse existing prototype concepts where possible:
+
+- `fsmv2` for node-local recovery lifecycle
+- `volumefsm` ideas for mode semantics and admission
+- `RecoveryPlanner` for recoverability decisions
+
+The simulator is the next proof layer:
+- not just whether transitions are legal
+- but whether data remains correct under those transitions
+
+## Bottom Line
+
+WAL V2 correctness is not only a state problem.
+It is also a data-version problem.
+
+The distributed simulator should therefore prove two things together:
+
+1. state-machine safety
+2. data correctness at target `LSN`
+
+That is the right next prototype layer if the goal is to prove:
+- quorum commit safety
+- no committed data loss
+- no incorrect recovery from later extent state
--- a/sw-block/design/v2-first-slice-sender-ownership.md
+++ b/sw-block/design/v2-first-slice-sender-ownership.md
@ -0,0 +1,159 @@
+# V2 First Slice: Per-Replica Sender/Session Ownership
+
+Date: 2026-03-27
+Status: implementation-ready
+Depends-on: Q1 (recovery session), Q6 (orchestrator scope), Q7 (first slice)
+
+## Problem
+
+`SetReplicaAddrs()` replaces the entire `ShipperGroup` atomically. This causes:
+
+1. **State loss on topology change.** All shippers are destroyed and recreated.
+   Recovery state (`replicaFlushedLSN`, `lastContactTime`, catch-up progress) is lost.
+   After a changed-address restart, the new shipper starts from scratch.
+
+2. **No per-replica identity.** Shippers are identified by array index. The master
+   cannot target a specific replica for rebuild/catch-up — it must re-issue the
+   entire address set.
+
+3. **Background reconnect races.** A reconnect cycle may be in progress when
+   `SetReplicaAddrs` replaces the group. The in-progress reconnect's connection
+   objects become orphaned.
+
+## Design
+
+### Per-replica sender identity
+
+`ShipperGroup` changes from `[]*WALShipper` to `map[string]*WALShipper`, keyed by
+the replica's canonical data address. Each shipper stores its own `ReplicaID`.
+
+```go
+type WALShipper struct {
+    ReplicaID string // canonical data address — identity across reconnects
+    // ... existing fields
+}
+
+type ShipperGroup struct {
+    mu       sync.RWMutex
+    shippers map[string]*WALShipper // keyed by ReplicaID
+}
+```
+
+### ReconcileReplicas replaces SetReplicaAddrs
+
+Instead of replacing the entire group, `ReconcileReplicas` diffs old vs new:
+
+```
+ReconcileReplicas(newAddrs []ReplicaAddr):
+    for each existing shipper:
+        if NOT in newAddrs → Stop and remove
+    for each newAddr:
+        if matching shipper exists → keep (preserve state)
+        if no match → create new shipper
+```
+
+This preserves `replicaFlushedLSN`, `lastContactTime`, catch-up progress, and
+background reconnect goroutines for replicas that stay in the set.
+
+`SetReplicaAddrs` becomes a wrapper:
+```go
+func (v *BlockVol) SetReplicaAddrs(addrs []ReplicaAddr) {
+    if v.shipperGroup == nil {
+        v.shipperGroup = NewShipperGroup(nil)
+    }
+    v.shipperGroup.ReconcileReplicas(addrs, v.makeShipperFactory())
+}
+```
+
+### Changed-address restart flow
+
+1. Replica restarts on new port. Heartbeat reports new address.
+2. Master detects endpoint change (address differs, same volume).
+3. Master sends assignment update to primary with new replica address.
+4. Primary's `ReconcileReplicas` receives `[oldAddr1, newAddr2]`.
+5. Old shipper for the changed replica is stopped (old address gone from set).
+6. New shipper created with new address — but this is a fresh shipper.
+7. New shipper bootstraps: Disconnected → Connecting → CatchingUp → InSync.
+
+The improvement over V1.5: the **other** replicas in the set are NOT disturbed.
+Only the changed replica gets a fresh shipper. Recovery state for stable replicas
+is preserved.
+
+### Recovery session
+
+Each WALShipper already contains the recovery state machine:
+- `state` (Disconnected → Connecting → CatchingUp → InSync → Degraded → NeedsRebuild)
+- `replicaFlushedLSN` (authoritative progress)
+- `lastContactTime` (retention budget)
+- `catchupFailures` (escalation counter)
+- Background reconnect goroutine
+
+No separate `RecoverySession` object is needed. The WALShipper IS the per-replica
+recovery session. The state machine already tracks the session lifecycle.
+
+What changes: the session is no longer destroyed on topology change (unless the
+replica itself is removed from the set).
+
+### Coordinator vs primary responsibilities
+
+| Responsibility | Owner |
+|---------------|-------|
+| Endpoint truth (canonical address) | Coordinator (master) |
+| Assignment updates (add/remove replicas) | Coordinator |
+| Epoch authority | Coordinator |
+| Session creation trigger | Coordinator (via assignment) |
+| Session execution (reconnect, catch-up, barrier) | Primary (via WALShipper) |
+| Timeout enforcement | Primary |
+| Ordered receive/apply | Replica |
+| Barrier ack | Replica |
+| Heartbeat reporting | Replica |
+
+### Migration from current code
+
+| Current | V2 |
+|---------|-----|
+| `ShipperGroup.shippers []*WALShipper` | `ShipperGroup.shippers map[string]*WALShipper` |
+| `SetReplicaAddrs()` creates all new | `ReconcileReplicas()` diffs and preserves |
+| `StopAll()` in demote | `StopAll()` unchanged (stops all) |
+| `ShipAll(entry)` iterates slice | `ShipAll(entry)` iterates map values |
+| `BarrierAll(lsn)` parallel slice | `BarrierAll(lsn)` parallel map values |
+| `MinReplicaFlushedLSN()` iterates slice | Same, iterates map values |
+| `ShipperStates()` iterates slice | Same, iterates map values |
+| No per-shipper identity | `WALShipper.ReplicaID` = canonical data addr |
+
+### Files changed
+
+| File | Change |
+|------|--------|
+| `wal_shipper.go` | Add `ReplicaID` field, pass in constructor |
+| `shipper_group.go` | `map[string]*WALShipper`, `ReconcileReplicas`, update iterators |
+| `blockvol.go` | `SetReplicaAddrs` calls `ReconcileReplicas`, shipper factory |
+| `promotion.go` | No change (StopAll unchanged) |
+| `dist_group_commit.go` | No change (uses ShipperGroup API) |
+| `block_heartbeat.go` | No change (uses ShipperStates) |
+
+### Acceptance bar
+
+The following existing tests must continue to pass:
+- All CP13-1 through CP13-7 protocol tests (sync_all_protocol_test.go)
+- All adversarial tests (sync_all_adversarial_test.go)
+- All baseline tests (sync_all_bug_test.go)
+- All rebuild tests (rebuild_v1_test.go)
+
+The following CP13-8 tests validate the V2 improvement:
+- `TestCP13_SyncAll_ReplicaRestart_Rejoin` — changed-address recovery
+- `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` — V2 reconnect protocol
+- `TestAdversarial_CatchupMultipleDisconnects` — state preservation across reconnects
+
+New tests to add:
+- `TestReconcileReplicas_PreservesExistingShipper` — stable replica keeps state
+- `TestReconcileReplicas_RemovesStaleShipper` — removed replica stopped
+- `TestReconcileReplicas_AddsNewShipper` — new replica bootstraps
+- `TestReconcileReplicas_MixedUpdate` — one kept, one removed, one added
+
+## Non-goals for this slice
+
+- Smart WAL payload classes
+- Recovery reservation protocol
+- Full coordinator orchestration
+- New transport layer
--- a/sw-block/design/v2-first-slice-session-ownership.md
+++ b/sw-block/design/v2-first-slice-session-ownership.md
@ -0,0 +1,193 @@
+# V2 First Slice: Per-Replica Sender and Recovery Session Ownership
+
+Date: 2026-03-27
+
+## Purpose
+
+This document defines the first real V2 implementation slice.
+
+The slice is intentionally narrow:
+
+- per-replica sender ownership
+- explicit recovery session ownership
+- clear coordinator vs primary responsibility
+
+This is the first step toward a standalone V2 block engine under `sw-block/`.
+
+## Why This Slice First
+
+It directly addresses the clearest V1.5 structural limits:
+
+- sender identity loss when replica sets are refreshed
+- changed-address restart recovery complexity
+- repeated reconnect cycles without stable per-replica ownership
+- adversarial Phase 13 boundary tests that V1.5 cannot cleanly satisfy
+
+It also avoids jumping too early into:
+
+- Smart WAL
+- new backend storage layout
+- full production transport redesign
+
+## Core Decision
+
+Use:
+
+- **one sender owner per replica**
+- **at most one active recovery session per replica per epoch**
+
+Healthy replicas may only need their steady sender object.
+
+Degraded / reconnecting replicas gain an explicit recovery session owned by the primary.
+
+## Ownership Split
+
+### Coordinator
+
+Owns:
+
+- replica identity / endpoint truth
+- assignment updates
+- epoch authority
+- session creation / destruction intent
+
+Does not own:
+
+- byte-by-byte catch-up execution
+- local sender loop scheduling
+
+### Primary
+
+Owns:
+
+- per-replica sender objects
+- per-replica recovery session execution
+- reconnect / catch-up progress
+- timeout enforcement for active session
+- transition from:
+  - normal sender
+  - to recovery session
+  - back to normal sender
+
+### Replica
+
+Owns:
+
+- receive/apply path
+- barrier ack
+- heartbeat/reporting
+
+Replica remains passive from the recovery-orchestration point of view.
+
+## Data Model
+
+## Sender Owner
+
+Per replica, maintain a stable sender owner with:
+
+- replica logical ID
+- current endpoint
+- current epoch view
+- steady-state health/status
+- optional active recovery session reference
+
+## Recovery Session
+
+Per replica, per epoch:
+
+- `ReplicaID`
+- `Epoch`
+- `EndpointVersion` or equivalent endpoint truth
+- `State`
+  - `connecting`
+  - `catching_up`
+  - `in_sync`
+  - `needs_rebuild`
+- `StartLSN`
+- `TargetLSN`
+- timeout / deadline metadata
+
+## Session Rules
+
+1. only one active session per replica per epoch
+2. new assignment for same replica:
+- supersedes old session only if epoch/session generation is newer
+3. stale session must not continue after:
+- epoch bump
+- endpoint truth change
+- explicit coordinator replacement
+
+## Minimal State Transitions
+
+### Healthy path
+
+1. replica sender exists
+2. sender ships normally
+3. replica remains `InSync`
+
+### Recovery path
+
+1. sender detects or is told replica is not healthy
+2. coordinator provides valid assignment/endpoint truth
+3. primary creates recovery session
+4. session connects
+5. session catches up if recoverable
+6. on success:
+- session closes
+- steady sender resumes normal state
+
+### Rebuild path
+
+1. session determines catch-up is not sufficient
+2. session transitions to `needs_rebuild`
+3. higher layer rebuild flow takes over
+
+## What This Slice Does Not Include
+
+Not in the first slice:
+
+- Smart WAL payload classes in production
+- snapshot pinning / GC logic
+- new on-disk engine
+- frontend publication changes
+- full production event scheduler
+
+## Proposed V2 Workspace Target
+
+Do this under `sw-block/`, not `weed/storage/blockvol/`.
+
+Suggested area:
+
+- `sw-block/prototype/enginev2/`
+
+Suggested first files:
+
+- `sw-block/prototype/enginev2/session.go`
+- `sw-block/prototype/enginev2/sender.go`
+- `sw-block/prototype/enginev2/group.go`
+- `sw-block/prototype/enginev2/session_test.go`
+
+The first code does not need full storage I/O.
+It should prove ownership and transition shape first.
+
+## Acceptance For This Slice
+
+The slice is good enough when:
+
+1. sender identity is stable per replica
+2. changed-address reassignment updates the right sender owner
+3. multiple reconnect cycles do not lose recovery ownership
+4. stale session does not survive epoch bump
+5. the 4 Phase 13 V2-boundary tests have a clear path to become satisfiable
+
+## Relationship To Existing Simulator
+
+This slice should align with:
+
+- `v2-acceptance-criteria.md`
+- `v2-open-questions.md`
+- `v1-v15-v2-comparison.md`
+- `distsim` / `eventsim` behavior
+
+The simulator remains the design oracle.
+The first implementation slice should not contradict it.
--- a/sw-block/design/v2-open-questions.md
+++ b/sw-block/design/v2-open-questions.md
@ -0,0 +1,161 @@
+# V2 Open Questions
+
+Date: 2026-03-27
+
+## Purpose
+
+This document records what is still algorithmically open in V2.
+
+These are not bugs.
+
+They are design questions that should be closed deliberately before or during implementation slicing.
+
+## 1. Recovery Session Ownership
+
+Open question:
+
+- what is the exact ownership model for one active recovery session per replica?
+
+Need to decide:
+
+- session identity fields
+- supersede vs reject vs join behavior
+- how epoch/session invalidates old recovery work
+
+Why it matters:
+
+- V1.5 needed local reconnect serialization
+- V2 should make this a protocol rule
+
+## 2. Promotion Threshold Strictness
+
+Open question:
+
+- must a promotion candidate always have `FlushedLSN >= CommittedLSN`, or is there any narrower safe exception?
+
+Current prototype:
+
+- uses committed-prefix sufficiency as the safety gate
+
+Why it matters:
+
+- determines how strict real failover behavior should be
+
+## 3. Recovery Reservation Shape
+
+Open question:
+
+- what exactly is reserved during catch-up?
+
+Need to decide:
+
+- WAL range only?
+- payload pins?
+- snapshot pin?
+- expiry semantics?
+
+Why it matters:
+
+- recoverability must be explicit, not hopeful
+
+## 4. Smart WAL Payload Classes
+
+Open question:
+
+- which payload classes are allowed in V2 first?
+
+Current model has:
+
+- `WALInline`
+- `ExtentReferenced`
+
+Need to decide:
+
+- whether first real implementation includes both
+- whether `ExtentReferenced` requires pinned snapshot/versioned extent only
+
+## 5. Smart WAL Garbage Collection Boundary
+
+Open question:
+
+- when can a referenced payload stop being recoverable?
+
+Need to decide:
+
+- GC interaction
+- timeout interaction
+- recovery session pinning
+
+Why it matters:
+
+- this is the line between catch-up and rebuild
+
+## 6. Exact Orchestrator Scope
+
+Open question:
+
+- how much of the final V2 control logic belongs in:
+  - local node state
+  - coordinator
+  - transport/session manager
+
+Why it matters:
+
+- avoid V1-style scattered state ownership
+
+## 7. First Real Implementation Slice
+
+Open question:
+
+- what is the first production slice of V2?
+
+Candidates:
+
+1. per-replica sender/session ownership
+2. explicit recovery-session management
+3. catch-up/rebuild decision plumbing
+
+Recommended default:
+
+- per-replica sender/session ownership
+
+## 8. Steady-State Overhead Budget
+
+Open question:
+
+- what overhead is acceptable in the normal healthy case?
+
+Need to decide:
+
+- metadata checks on hot path
+- extra state bookkeeping
+- what stays off the hot path
+
+Why it matters:
+
+- V2 should be structurally better without becoming needlessly heavy
+
+## 9. Smart WAL First-Phase Goal
+
+Open question:
+
+- is the first Smart WAL goal:
+  - lower recovery cost
+  - lower steady-state WAL volume
+  - or just proof of historical correctness model?
+
+Recommended answer:
+
+- first prove correctness model, then optimize
+
+## 10. End Condition For Simulator Work
+
+Open question:
+
+- when do we stop adding simulator depth and start implementation?
+
+Suggested answer:
+
+- once acceptance criteria are satisfied
+- and the first implementation slice is clear
+- and remaining simulator additions are no longer changing core protocol decisions
--- a/sw-block/design/v2-prototype-roadmap-and-gates.md
+++ b/sw-block/design/v2-prototype-roadmap-and-gates.md
@ -0,0 +1,239 @@
+# V2 Prototype Roadmap And Gates
+
+Date: 2026-03-27
+Status: active
+Purpose: define the remaining prototype roadmap, the validation gates between stages, and the decision point between real V2 engine work and possible V2.5 redesign
+
+## Current Position
+
+V2 design/FSM/simulator work is sufficiently closed for serious prototyping, but not frozen against later `V2.5` adjustments.
+
+Current state:
+
+- design proof: high
+- execution proof: medium
+- data/recovery proof: low
+- prototype end-to-end proof: low
+
+Rough prototype progress:
+
+- `25%` to `35%`
+
+This is early executable prototype, not engine-ready prototype.
+
+## Roadmap Goal
+
+Answer this question with prototype evidence:
+
+- can V2 become a real engine path?
+- or should it become `V2.5` before real implementation begins?
+
+## Step 1: Execution Authority Closure
+
+Purpose:
+
+- finish the sender / recovery-session authority model so stale work is unambiguously rejected
+
+Scope:
+
+1. ownership-only `AttachSession()` / `SupersedeSession()`
+2. execution begins only through execution APIs
+3. stale handshake / progress / completion fenced by `sessionID`
+4. endpoint bump / epoch bump invalidate execution authority
+5. sender-group preserve-or-kill behavior is explicit
+
+Done when:
+
+1. all execution APIs are sender-gated and reject stale `sessionID`
+2. session creation is separated from execution start
+3. phase ordering is enforced
+4. endpoint bump / epoch bump invalidate execution authority correctly
+5. mixed add/remove/update reconciliation preserves or kills state exactly as intended
+
+Main files:
+
+- `sw-block/prototype/enginev2/`
+- `sw-block/prototype/distsim/`
+- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
+
+Key gate:
+
+- old recovery work cannot mutate current sender state at any execution stage
+
+## Step 2: Orchestrated Recovery Prototype
+
+Purpose:
+
+- move from good local sender APIs to an actual prototype recovery flow driven by assignment/update intent
+
+Scope:
+
+1. assignment/update intent creates or supersedes recovery attempts
+2. reconnect / reassignment / catch-up / rebuild decision path
+3. sender-group becomes orchestration entry point
+4. explicit outcome branching:
+   - zero-gap fast completion
+   - positive-gap catch-up
+   - unrecoverable gap -> `NeedsRebuild`
+
+Done when:
+
+1. the prototype expresses a realistic recovery flow from topology/control intent
+2. sender-group drives recovery creation, not only unit helpers
+3. recovery outcomes are explicit and testable
+4. orchestrator responsibility is clear enough to narrow `v2-open-questions.md` item 6
+
+Key gate:
+
+- recovery control is no longer scattered across helper calls; it has one clear orchestration path
+
+## Step 3: Minimal Historical Data Prototype
+
+Purpose:
+
+- prove the recovery model against real data-history assumptions, not only control logic
+
+Scope:
+
+1. minimal WAL/history model, not full engine
+2. enough to exercise:
+   - catch-up range
+   - retained prefix/window
+   - rebuild fallback
+   - historical correctness at target LSN
+3. enough reservation/recoverability state to make recovery explicit
+
+Done when:
+
+1. the prototype can prove why a gap is recoverable or unrecoverable
+2. catch-up and rebuild decisions are backed by minimal data/history state
+3. `v2-open-questions.md` items 3, 4, 5 are closed or sharply narrowed
+4. prototype evidence strengthens acceptance criteria `A5`, `A6`, and `A7`
+
+Key gate:
+
+- the prototype must explain why recovery is allowed, not just that policy says it is
+
+## Step 4: Prototype Scenario Closure
+
+Purpose:
+
+- make the prototype itself demonstrate the V2 story end-to-end
+
+Scope:
+
+1. map key V2 scenarios onto the prototype
+2. express the 4 V2-boundary cases against prototype behavior
+3. add one small end-to-end harness inside `sw-block/prototype/`
+4. align prototype evidence with acceptance criteria
+
+Done when:
+
+1. prototype behavior can be reviewed scenario-by-scenario
+2. key V1/V1.5 failures have prototype equivalents
+3. prototype outcomes match intended V2 design claims
+4. remaining gaps are clearly real-engine gaps, not protocol/prototype ambiguity
+
+Key gate:
+
+- a reviewer can trace:
+  - acceptance criteria -> scenario -> prototype behavior
+  without hand-waving
+
+## Gates
+
+### Gate 1: Design Closed Enough
+
+Status:
+
+- mostly passed
+
+Meaning:
+
+1. acceptance criteria exist
+2. core simulator exists
+3. ownership gap from V1.5 is understood
+
+### Gate 2: Execution Authority Closed
+
+Passes after Step 1.
+
+Meaning:
+
+- stale execution results cannot mutate current authority
+
+### Gate 3: Orchestrated Recovery Closed
+
+Passes after Step 2.
+
+Meaning:
+
+- recovery flow is controlled by one coherent orchestration model
+
+### Gate 4: Historical Data Model Closed
+
+Passes after Step 3.
+
+Meaning:
+
+- catch-up vs rebuild is backed by executable data-history logic
+
+### Gate 5: Prototype Convincing
+
+Passes after Step 4.
+
+Meaning:
+
+- enough evidence exists to choose:
+  - real V2 engine path
+  - or `V2.5` redesign
+
+## Decision Gate After Step 4
+
+### Path A: Real V2 Engine Planning
+
+Choose this if:
+
+1. prototype control logic is coherent
+2. recovery boundary is explicit
+3. boundary cases are convincing
+4. no major structural flaw remains
+
+Outputs:
+
+1. real engine slicing plan
+2. migration/integration plan into future standalone `sw-block`
+3. explicit non-goals for first production version
+
+### Path B: V2.5 Redesign
+
+Choose this if the prototype reveals:
+
+1. ownership/orchestration still too fragile
+2. recovery boundary still too implicit
+3. historical correctness model too costly or too unclear
+4. too much complexity leaks into the hot path
+
+Output:
+
+- write `V2.5` as a design/prototype correction before engine work
+
+## What Not To Do Yet
+
+1. no Smart WAL expansion beyond what Step 3 minimally needs
+2. no backend/storage-engine redesign
+3. no V1 production integration
+4. no frontend/wire protocol work
+5. no performance optimization as a primary goal
+
+## Practical Summary
+
+Current sequence:
+
+1. finish execution authority
+2. build orchestrated recovery
+3. add minimal historical-data proof
+4. close key scenarios against the prototype
+5. decide:
+   - V2 engine
+   - or `V2.5`
--- a/sw-block/design/v2-scenario-sources-from-v1.md
+++ b/sw-block/design/v2-scenario-sources-from-v1.md
@ -0,0 +1,249 @@
+# V2 Scenario Sources From V1 and V1.5
+
+Date: 2026-03-27
+
+## Purpose
+
+This document distills V1 / V1.5 real-test material into V2 scenario inputs.
+
+Sources:
+
+- `learn/projects/sw-block/phases/phase13_test.md`
+- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
+
+This is not the active scenario backlog.
+
+Use:
+
+- `v2_scenarios.md` for the active V2 scenario set
+- this file for historical source and rationale
+
+## How To Use This File
+
+For each item below:
+
+1. keep the real V1/V1.5 test as implementation evidence
+2. create or maintain a V2 simulator scenario for the protocol core
+3. define the expected V2 behavior explicitly
+
+## Source Buckets
+
+### 1. Core protocol behavior
+
+These are the highest-value simulator inputs.
+
+- barrier durability truth
+- reconnect + catch-up
+- non-convergent catch-up -> rebuild
+- rebuild fallback
+- failover / promotion safety
+- WAL retention / tail-chasing
+- durability mode semantics
+
+Recommended V2 treatment:
+
+- `sim_core`
+
+### 2. Supporting invariants
+
+These matter, but usually as reduced simulator checks.
+
+- canonical address handling
+- replica role/epoch gating
+- committed-prefix rules
+- rebuild publication cleanup
+- assignment refresh behavior
+
+Recommended V2 treatment:
+
+- `sim_reduced`
+
+### 3. Real-only implementation behavior
+
+These should usually stay in real-engine tests.
+
+- actual wire encoding / decode bugs
+- real disk / `fdatasync` timing
+- NVMe / iSCSI frontend behavior
+- Go concurrency artifacts tied to concrete implementation
+
+Recommended V2 treatment:
+
+- `real_only`
+
+### 4. V2 boundary items
+
+These are especially important.
+
+They should remain visible as:
+
+- current V1/V1.5 limitation
+- explicit V2 acceptance target
+
+Recommended V2 treatment:
+
+- `v2_boundary`
+
+## Distilled Scenario Inputs
+
+### A. Barrier truth uses durable replica progress
+
+Real source:
+
+- Phase 13 barrier / `replicaFlushedLSN` tests
+
+Why it matters:
+
+- commit must follow durable replica progress, not send progress
+
+V2 target:
+
+- barrier completion counted only from explicit durable progress state
+
+### B. Same-address transient outage
+
+Real source:
+
+- Phase 13 reconnect / catch-up tests
+- `CP13-8` short outage recovery
+
+Why it matters:
+
+- proves cheap short-gap recovery path
+
+V2 target:
+
+- explicit recoverability check
+- catch-up if recoverable
+- rebuild otherwise
+
+### C. Changed-address restart
+
+Real source:
+
+- `CP13-8 T4b`
+- changed-address refresh fixes
+
+Why it matters:
+
+- endpoint is not identity
+- stale endpoint must not remain authoritative
+
+V2 target:
+
+- heartbeat/control-plane learns new endpoint
+- reassignment updates sender target
+- recovery session starts only after endpoint truth is updated
+
+### D. Non-convergent catch-up / tail-chasing
+
+Real source:
+
+- Phase 13 retention + catch-up + rebuild fallback line
+
+Why it matters:
+
+- “catch-up exists” is not enough
+- must know when to stop and rebuild
+
+V2 target:
+
+- explicit `CatchingUp -> NeedsRebuild`
+- no fake success
+
+### E. Slow control-plane recovery
+
+Real source:
+
+- `CP13-8 T4b` hardware behavior before fix
+
+Why it matters:
+
+- safety can be correct while availability recovery is poor
+
+V2 target:
+
+- explicit fast recovery path when possible
+- explicit fallback when only control-plane repair can help
+
+### F. Stale message / delayed ack fencing
+
+Real source:
+
+- Phase 13 epoch/fencing tests
+- V2 scenario work already mirrors this
+
+Why it matters:
+
+- old lineage must not mutate committed prefix
+
+V2 target:
+
+- stale message rejection is explicit and testable
+
+### G. Promotion candidate safety
+
+Real source:
+
+- failover / promotion gating tests
+- V2 candidate-selection work
+
+Why it matters:
+
+- wrong promotion loses committed lineage
+
+V2 target:
+
+- candidate must satisfy:
+  - running
+  - epoch aligned
+  - state eligible
+  - committed-prefix sufficient
+
+### H. Rebuild boundary after failed catch-up
+
+Real source:
+
+- Phase 13 rebuild fallback behavior
+
+Why it matters:
+
+- rebuild is required when retained WAL cannot safely close the gap
+
+V2 target:
+
+- rebuild is explicit fallback, not ad hoc recovery
+
+## Immediate Feed Into `v2_scenarios.md`
+
+These are the most important V1/V1.5-derived V2 scenarios:
+
+1. same-address transient outage
+2. changed-address restart
+3. non-convergent catch-up / tail-chasing
+4. stale delayed message / barrier ack rejection
+5. committed-prefix-safe promotion
+6. control-plane-latency recovery shape
+
+## What Should Not Be Copied Blindly
+
+Do not clone every real-engine test into the simulator.
+
+Do not use the simulator for:
+
+- exact OS timing
+- exact socket/wire bugs
+- exact block frontend behavior
+- implementation-specific lock races
+
+Instead:
+
+- extract the protocol invariant
+- model the reduced scenario if the protocol value is high
+
+## Bottom Line
+
+V1 / V1.5 tests should feed V2 in two ways:
+
+1. as historical evidence of what failed or mattered in real life
+2. as scenario seeds for the V2 simulator and acceptance backlog
--- a/sw-block/design/v2_scenarios.md
+++ b/sw-block/design/v2_scenarios.md
@ -0,0 +1,638 @@
+# WAL V2 Scenarios
+
+Date: 2026-03-26
+Status: working scenario backlog
+Purpose: define the scenario set that proves why WAL V2 exists, what it must do better than WAL V1, and what it should handle better than rebuild-heavy systems
+
+Execution note:
+- active implementation planning for these scenarios lives under `../.private/phase/`
+- `design/` is the design/source-of-truth view
+- `.private/phase/` is the execution/checklist view for `sw`
+
+## Why This File Exists
+
+V2 should not grow by adding random simulations.
+
+Each new scenario should prove one of these claims:
+
+1. committed data is never lost
+2. uncommitted data is never falsely revived
+3. epoch and promotion lineage are safe
+4. short-gap recovery is cheaper and cleaner than rebuild
+5. catch-up vs rebuild boundary is explicit and correct
+6. historical data correctness is preserved
+
+## Scenario Sources
+
+The backlog draws scenarios from three sources:
+
+1. **V1 / V1.5 real failures**
+- real bugs and real-hardware gaps observed during Phase 12 / Phase 13
+- these are the highest-value scenarios because they came from actual system behavior
+
+2. **V2 design obligations**
+- scenarios required by the intended V2 protocol shape
+- examples:
+  - reservations
+  - lineage-first promotion
+  - explicit catch-up vs rebuild boundary
+
+3. **Distributed-systems adversarial cases**
+- scenarios not yet seen in production, but known to be dangerous
+- examples:
+  - zombie primary
+  - partitions
+  - message reordering
+  - multi-promotion lineage chains
+
+This file is the shared backlog for anyone extending:
+
+- `sw-block/prototype/fsmv2/`
+- `sw-block/prototype/volumefsm/`
+- `sw-block/prototype/distsim/`
+
+For active development sequencing, see:
+- `sw-block/.private/phase/phase-01.md`
+- `sw-block/.private/phase/phase-02.md`
+- `sw-block/design/v2-scenario-sources-from-v1.md`
+
+Current simulator note:
+- current `distsim` coverage already includes:
+  - changed-address restart comparison across `V1` / `V1.5` / `V2`
+  - same-address transient outage comparison
+  - slow control-plane recovery comparison
+  - stale-endpoint rejection
+  - committed-prefix-aware promotion eligibility
+
+## V2 Goals
+
+Compared with WAL V1, V2 should improve:
+
+1. state clarity
+2. recovery boundary clarity
+3. fencing and promotion correctness
+4. testability of distributed behavior
+5. proof of data correctness at a target `LSN`
+
+Compared with rebuild-heavy systems, V2 should improve:
+
+1. short-gap recovery cost
+2. explicit progress semantics
+3. catch-up vs rebuild decision quality
+
+## Scenario Format
+
+Each scenario should eventually define:
+
+1. setup
+2. event sequence
+3. expected commit/ack behavior
+4. expected promotion/fencing behavior
+5. expected final data state at target `LSN`
+
+Where possible, use synthetic 4K writes with:
+
+- `value = LSN`
+
+That makes correctness assertions trivial.
+
+## Priority 1: Commit Safety
+
+These scenarios prove the most important distributed claim:
+
+- if the system ACKed a write under the configured policy, that write is not lost
+
+### S1. ACK Then Primary Crash
+
+Goal:
+- prove a quorum-acknowledged write survives failover
+
+Sequence:
+1. primary commits a write
+2. replicas durable-ACK enough nodes for policy
+3. primary crashes immediately
+4. coordinator promotes a valid replica
+
+Expect:
+- promoted node contains the committed `LSN`
+- final state matches reference model at committed `LSN`
+
+### S2. Non-Quorum Write Then Primary Crash
+
+Goal:
+- prove uncommitted data is not revived after failover
+
+Sequence:
+1. primary accepts a write locally
+2. quorum durability is not reached
+3. primary crashes
+4. coordinator promotes another node
+
+Expect:
+- promoted node does not expose the uncommitted write
+- committed `LSN` stays at previous value
+
+### S3. Zombie Old Primary Is Fenced
+
+Goal:
+- prove old-epoch traffic cannot corrupt new lineage
+
+Sequence:
+1. primary loses lease
+2. coordinator bumps epoch and promotes new primary
+3. old primary continues trying to send writes / barriers
+
+Expect:
+- all old-epoch traffic is rejected
+- no stale write becomes committed under the new epoch
+
+## Priority 2: Short-Gap Recovery
+
+These scenarios justify V2 over rebuild-heavy designs.
+
+### S4. Brief Disconnect, WAL Catch-Up Only
+
+Goal:
+- prove a short outage recovers via WAL catch-up, not rebuild
+
+Sequence:
+1. replica disconnects briefly
+2. primary continues writing
+3. gap stays inside recoverable window
+4. replica reconnects and catches up
+
+Expect:
+- `CatchingUp -> PromotionHold -> InSync`
+- no rebuild required
+- final state matches reference at target `LSN`
+
+### S5. Flapping Replica Stays Recoverable
+
+Goal:
+- prove transient disconnects do not force unnecessary rebuild
+
+Sequence:
+1. replica disconnects and reconnects repeatedly
+2. gaps stay within reserved recoverable windows
+
+Expect:
+- replica may move between `Lagging`, `CatchingUp`, and `PromotionHold`
+- replica does not enter `NeedsRebuild` unless recoverability is actually lost
+
+### S6. Tail-Chasing Under Load
+
+Goal:
+- prove behavior when primary writes faster than catch-up rate
+
+Sequence:
+1. replica reconnects behind
+2. primary continues writing quickly
+3. catch-up target may be reached or may fall behind again
+
+Expect:
+- explicit result:
+  - converge and promote
+  - or abort to `NeedsRebuild`
+- never silently pretend the replica is current
+
+## Priority 3: Catch-Up vs Rebuild Boundary
+
+These scenarios justify the V2 recoverability model.
+
+### S7. Recovery Initially Possible, Then Reservation Expires
+
+Goal:
+- prove `check -> reserve -> recover` is enforced
+
+Sequence:
+1. primary grants a recoverability reservation
+2. catch-up starts
+3. reservation expires or is revoked before completion
+
+Expect:
+- catch-up aborts
+- replica transitions to `NeedsRebuild`
+- no partial recovery is treated as success
+
+### S8. Current Extent Cannot Recover Old LSN
+
+Goal:
+- prove the historical correctness trap
+
+Sequence:
+1. write block `B = 10` at `LSN 10`
+2. later write block `B = 12` at `LSN 12`
+3. attempt to recover state at `LSN 10` from current extent
+
+Expect:
+- mismatch detected
+- scenario must fail correctness check
+
+### S9. Snapshot + Tail Rebuild Works
+
+Goal:
+- prove correct long-gap reconstruction
+
+Sequence:
+1. take snapshot at `cpLSN`
+2. later writes extend head
+3. lagging replica rebuilds from snapshot
+4. replay trailing WAL tail
+
+Expect:
+- final state matches reference at target `LSN`
+
+## Priority 4: Quorum and Mixed Replica States
+
+These scenarios justify V2 mode clarity.
+
+### S10. Mixed States Under `sync_quorum`
+
+Goal:
+- prove `sync_quorum` remains available with mixed replica states
+
+Sequence:
+1. one replica `InSync`
+2. one replica `CatchingUp`
+3. one replica `Rebuilding`
+
+Expect:
+- writes may continue if durable quorum exists
+- ACK gating follows quorum rules exactly
+
+### S11. Mixed States Under `sync_all`
+
+Goal:
+- prove `sync_all` remains strict
+
+Sequence:
+1. same mixed-state setup as above
+
+Expect:
+- writes/acks block or fail according to `sync_all`
+- no silent downgrade to quorum or best effort
+
+### S12. Promotion Chooses Best Valid Lineage
+
+Goal:
+- prove promotion is correctness-first, not “highest apparent LSN wins”
+
+Sequence:
+1. candidate nodes have different:
+   - flushed LSN
+   - rebuild state
+   - epoch lineage
+2. coordinator chooses a new primary
+
+Expect:
+- only a valid-lineage node is promotable
+- stale or inconsistent node is rejected
+
+## Priority 5: Smart WAL / Recovery Classes
+
+These scenarios justify V2’s future adaptive write path.
+
+### S13. `WALInline` Window Is Recoverable
+
+Goal:
+- prove inline WAL payload replay works directly
+
+Sequence:
+1. missing range consists of `WALInline` records
+2. planner grants reservation
+
+Expect:
+- catch-up allowed
+- final state correct
+
+### S14. `ExtentReferenced` Payload Still Resolvable
+
+Goal:
+- prove direct-extent records can still support catch-up when pinned
+
+Sequence:
+1. missing range includes `ExtentReferenced` records
+2. payload objects / generations are still resolvable
+3. reservation pins those dependencies
+
+Expect:
+- catch-up allowed
+- final state correct
+
+### S15. `ExtentReferenced` Payload Lost
+
+Goal:
+- prove metadata alone is not enough
+
+Sequence:
+1. missing range includes `ExtentReferenced` records
+2. metadata still exists
+3. payload object / version is no longer resolvable
+
+Expect:
+- planner returns `NeedsRebuild`
+- catch-up is forbidden
+
+## Priority 6: Restart and Rebuild Robustness
+
+These scenarios justify operational resilience.
+
+### S16. Replica Restarts During Catch-Up
+
+Goal:
+- prove restart does not corrupt catch-up state
+
+Sequence:
+1. replica is catching up
+2. replica restarts
+3. reconnect and recover again
+
+Expect:
+- no false promotion
+- resume or restart recovery cleanly
+
+### S17. Replica Restarts During Rebuild
+
+Goal:
+- prove rebuild interruption is safe
+
+Sequence:
+1. replica is rebuilding from snapshot
+2. replica restarts mid-copy
+
+Expect:
+- rebuild aborts or restarts safely
+- no partial base image is treated as valid
+
+### S18. Primary Restarts Without Failover
+
+Goal:
+- prove restart with same lineage is handled explicitly
+
+Sequence:
+1. primary stops and restarts
+2. coordinator either preserves or changes epoch depending on policy
+
+Expect:
+- replicas react consistently
+- no stale assumptions about previous sender sessions
+
+### S19. Chain Of Custody Across Multiple Promotions
+
+Goal:
+- prove committed data survives more than one failover lineage step
+
+Sequence:
+1. primary `A` commits writes
+2. fail over to `B`
+3. `B` commits additional writes
+4. fail over to `C`
+
+Expect:
+- `C` contains all writes committed by `A` and `B`
+- no committed data disappears across multiple promotions
+- final state matches reference model at committed `LSN`
+
+### S20. Network Partition With Concurrent Write Attempts
+
+Goal:
+- prove epoch fencing prevents split-brain writes during partition
+
+Sequence:
+1. cluster partitions into two live sides
+2. old primary side continues trying to write
+3. coordinator promotes a new primary on the surviving side
+4. both sides attempt to send control/data traffic
+
+Expect:
+- only the current-epoch side can advance committed state
+- stale-side writes are rejected or ignored
+- no conflicting committed lineage appears
+
+## Suggested Implementation Order
+
+Implement in this order:
+
+1. `S1` ACK then primary crash
+2. `S2` non-quorum write then primary crash
+3. `S3` zombie old primary fenced
+4. `S4` brief disconnect with WAL catch-up
+5. `S7` reservation expiry aborts catch-up
+6. `S10` mixed-state quorum policy
+7. `S9` long-lag rebuild from snapshot + tail
+8. `S13-S15` Smart WAL recoverability
+
+## Coverage Matrix
+
+Status values:
+- `covered`
+- `partial`
+- `not_started`
+- `needs_richer_model`
+
+| Scenario | Package | Test / Artifact | Status | Notes |
+|---|---|---|---|---|
+| `S1` ACK then primary crash | `distsim` | `TestQuorumCommitSurvivesPrimaryFailover` | `covered` | quorum commit survives failover |
+| `S2` non-quorum write then primary crash | `distsim` | `TestUncommittedWriteNotPreservedAfterPrimaryLoss` | `covered` | no false revival |
+| `S3` zombie old primary fenced | `distsim` | `TestZombieOldPrimaryWritesAreFenced` | `covered` | stale epoch traffic ignored |
+| `S4` brief disconnect, WAL catch-up only | `distsim` | `TestReplicaCatchupFromPrimaryWAL` | `covered` | short-gap recovery |
+| `S5` flapping replica stays recoverable | `distsim` | `TestS5_FlappingReplica_NoUnnecessaryRebuild`, `TestS5_FlappingWithStateTracking`, `TestS5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `covered` | both recoverable flapping and explicit budget-exceeded escalation are now asserted |
+| `S6` tail-chasing under load | `distsim` | `TestS6_TailChasing_ConvergesOrAborts`, `TestS6_TailChasing_NonConvergent_Aborts`, `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild`, `TestP02_S6_NonConvergent_ExplicitStateTransition` | `covered` | explicit non-convergent `CatchingUp -> NeedsRebuild` path now asserted |
+| `S7` reservation expiry aborts catch-up | `fsmv2`, `volumefsm`, `distsim` | `TestFSMReservationLostNeedsRebuild`, `TestModelReservationLostDuringCatchupAfterRebuild`, `TestReservationExpiryAbortsCatchup` | `covered` | present at 3 layers |
+| `S8` current extent cannot recover old LSN | `distsim` | `TestCurrentExtentCannotRecoverOldLSN` | `covered` | historical correctness trap |
+| `S9` snapshot + tail rebuild works | `distsim` | `TestReplicaRebuildFromSnapshotAndTail`, `TestSnapshotPlusTrailingReplayReachesTargetLSN` | `covered` | long-gap reconstruction |
+| `S10` mixed states under `sync_quorum` | `volumefsm`, `distsim` | `TestModelSyncQuorumWithThreeReplicasMixedStates`, `TestSyncQuorumWithMixedReplicaStates` | `covered` | quorum stays available |
+| `S11` mixed states under `sync_all` | `distsim` | `TestSyncAllBlocksWithMixedReplicaStates` | `covered` | strict sync_all behavior |
+| `S12` promotion chooses best valid lineage | `distsim` | `TestPromotionUsesValidLineageNode`, `TestS12_PromotionChoosesBestLineage_NotHighestLSN`, `TestS12_PromotionRejectsRebuildingCandidate` | `covered` | lineage-first promotion now exercised beyond simple LSN comparison |
+| `S13` `WALInline` window recoverable | `distsim` | `TestWALInlineRecordsAreRecoverable` | `covered` | inline payload recoverability |
+| `S14` `ExtentReferenced` payload resolvable | `distsim` | `TestExtentReferencedResolvableRecordsAreRecoverable`, `TestMixedClassRecovery_FullSuccess` | `covered` | recoverable direct-extent and mixed-class recovery case |
+| `S15` `ExtentReferenced` payload lost | `distsim` | `TestExtentReferencedUnresolvableForcesRebuild`, `TestRecoverableThenUnrecoverable`, `TestTimeVaryingAvailability` | `covered` | metadata alone not enough; active recovery can transition from recoverable to unrecoverable |
+| `S16` replica restarts during catch-up | `distsim` | `TestReplicaRestartDuringCatchupRestartsSafely` | `covered` | safe recovery restart |
+| `S17` replica restarts during rebuild | `distsim` | `TestReplicaRestartDuringRebuildRestartsSafely` | `covered` | rebuild interruption safe |
+| `S18` primary restarts without failover | `distsim` | `TestS18_PrimaryRestart_SameLineage`, `TestS18_PrimaryRestart_ReplicasRejectOldEpoch`, `TestS18_PrimaryRestart_DelayedOldAck_DoesNotAdvancePrefix`, `TestS18_PrimaryRestart_InFlightBarrierDropped`, `TestP02_S18_DelayedAck_ExplicitRejection` | `covered` | delayed stale ack rejection and committed-prefix stability are now asserted directly |
+| `S19` chain of custody across promotions | `distsim` | `TestS19_ChainOfCustody_MultiplePromotions`, `TestS19_ChainOfCustody_ThreePromotions` | `covered` | multi-promotion lineage continuity covered |
+| `S20` live partition with competing writes | `distsim` | `TestS20_LivePartition_StaleWritesNotCommitted`, `TestS20_LivePartition_HealRecovers`, `TestS20_StalePartition_ProtocolRejectsStaleWrites`, `TestP02_S20_StaleTraffic_CommittedPrefixUnchanged` | `covered` | stale-side protocol traffic is explicitly rejected and committed prefix remains unchanged |
+
+## Ownership Notes
+
+When adding a scenario:
+
+1. add or extend the relevant prototype test:
+   - `fsmv2`
+   - `volumefsm`
+   - `distsim`
+2. update this file with:
+   - status
+   - package location
+3. keep correctness checks tied to:
+   - committed `LSN`
+   - reference model state
+
+## Current Coverage Snapshot
+
+Already covered in some form:
+
+- quorum commit survives primary failover
+- uncommitted write not preserved after primary loss
+- zombie old primary fenced by epoch
+- lagging replica catch-up from primary WAL
+- reservation expiry aborts catch-up in distributed sim
+- `sync_quorum` continues with one lagging replica
+- `sync_all` blocks with one lagging replica
+- `sync_quorum` with mixed replica states
+- `sync_all` with mixed replica states
+- rebuild from snapshot + tail
+- promotion uses valid lineage node
+- flapping recoverable vs budget-exceeded rebuild path
+- tail-chasing explicit escalation to rebuild
+- restart during catch-up recovers safely
+- restart during rebuild recovers safely
+- primary restart delayed stale ack rejection
+- `WALInline` recoverability
+- `ExtentReferenced` resolvable vs unresolvable boundary
+- mixed-class Smart WAL recovery and time-varying payload availability
+- delayed stale messages and selective drop behavior
+- multi-node reservation expiry and rebuild-timeout behavior
+- current extent cannot reconstruct old `LSN`
+
+Still important to add:
+
+- explicit coordinator-driven candidate selection among competing valid/invalid lineages
+- control-plane latency scenarios derived from `CP13-8 T4b`
+- explicit V1 / V1.5 / V2 comparison scenarios for:
+  - changed-address restart
+  - same-address transient outage
+  - slow reassignment recovery
+
+## V1.5 Lessons To Add Or Strengthen
+
+These come directly from WAL V1.5 / Phase 13 behavior and should be treated as high-priority scenario drivers.
+
+### L1. Replica Restart With New Receiver Port
+
+Observed:
+- replica VS restarts
+- receiver comes back on a new random port
+- primary background reconnect retries old address and fails
+
+Implication:
+- direct reconnect only works if replica address is stable
+
+Backlog impact:
+- strengthen `S18`
+- add a restart/address-change sub-scenario under `S20` or a future network/control-plane recovery scenario
+
+### L2. Slow Control-Plane Reassignment Dominates Recovery
+
+Observed:
+- sync correctness preserved
+- write availability recovery waits for heartbeat/reassignment cycle
+
+Implication:
+- "recoverable in theory" is not enough
+- recovery latency is part of protocol quality
+
+Backlog impact:
+- `S5` is now covered at current simulator level
+- strengthen `S18`
+- add long-running restart/rejoin timing scenarios
+
+### L3. Background Reconnect Helps Only Same-Address Recovery
+
+Observed:
+- background reconnect is useful for transient network failure
+- not sufficient for process restart with address change
+
+Implication:
+- scenarios must distinguish:
+  - transient disconnect
+  - process restart
+  - address change
+
+Backlog impact:
+- keep `S4` as transient disconnect
+- strengthen `S18` with restart/address-stability cases
+
+### L4. Tail-Chasing And Retention Pressure Are Structural Risks
+
+Observed:
+- Phase 13 reasoning repeatedly exposed:
+  - lagging replica may pin WAL
+  - catch-up may not converge while primary keeps advancing
+
+Implication:
+- V2 must explicitly model convergence, abort, and rebuild boundaries
+
+Backlog impact:
+- strengthen `S6`
+- add multi-node retention / timeout variants
+
+### L5. Current Extent Is Not Historical State
+
+Observed:
+- using current extent to reconstruct old `LSN` can return later values
+
+Implication:
+- V2 must require version-correct base images or resolvable historical payloads
+
+Backlog impact:
+- already covered by `S8`
+- should remain a permanent regression scenario
+
+## Randomized Simulation
+
+In addition to fixed scenarios, V2 should keep a randomized simulator suite.
+
+Purpose:
+
+1. discover paths that were not explicitly written as named scenarios
+2. stress promotion, restart, and recovery ordering
+3. check invariants after each random step
+
+Current prototype:
+
+- `sw-block/prototype/distsim/random.go`
+- `sw-block/prototype/distsim/random_test.go`
+
+Current invariants checked:
+
+1. current committed `LSN` remains a committed prefix
+2. promotable nodes match reference state at committed `LSN`
+3. current primary, if valid/running, matches reference state at committed `LSN`
+
+This does not replace named scenarios.
+It complements them.
+
+## Scenario Summary
+
+When reviewing or adding scenarios, always record the source:
+
+1. from real V1/V1.5 behavior
+2. from explicit V2 design obligation
+3. from adversarial distributed-systems reasoning
+
+The best scenarios are the ones that come from real failures first, then are generalized into V2 requirements.
+
+## Development Phases
+
+Execution detail is tracked in:
+- `sw-block/.private/phase/phase-01.md`
+- `sw-block/.private/phase/phase-02.md`
+
+High-level phase order:
+
+1. close explicit scenario backlog
+   - `S19`
+   - `S20`
+2. strengthen missing lifecycle scenarios
+   - `S5`
+   - `S6`
+   - `S18`
+   - stronger `S12`
+3. extend protocol-state simulation and version comparison
+   - `V1`
+   - `V1.5`
+   - `V2`
+   - stronger closure of current `partial` scenarios
+4. strengthen random/adversarial simulation
+5. add timeout-based scenarios only when the execution path is modeled
--- a/sw-block/design/wal-replication-v2-orchestrator.md
+++ b/sw-block/design/wal-replication-v2-orchestrator.md
@ -0,0 +1,359 @@
+# WAL Replication V2 Orchestrator
+
+Date: 2026-03-26
+Status: design proposal
+Purpose: define the volume-level orchestration model that sits above the per-replica WAL V2 FSM
+
+## Why This Document Exists
+
+`ReplicaFSM` alone is not enough.
+
+It can describe one replica relative to the current primary, but it cannot by itself model:
+
+- primary head continuing to advance
+- multiple replicas in different states
+- durability mode semantics
+- primary lease loss and epoch change
+- primary failover and replica promotion
+- fencing of old recovery sessions
+
+So WAL V2 needs a second layer:
+- per-replica `ReplicaFSM`
+- volume-level `Orchestrator`
+
+## Scope
+
+This document defines the volume-level logic only.
+
+It does not define:
+- exact network protocol
+- exact master RPCs
+- exact storage backend internals
+
+It assumes the per-replica state machine from:
+- `wal-replication-v2-state-machine.md`
+
+## Core Model
+
+The orchestrator owns:
+
+1. current primary lineage
+- `epoch`
+- lease/authority state
+
+2. volume durability mode
+- `best_effort`
+- `sync_all`
+- `sync_quorum`
+
+3. moving primary progress
+- `headLSN`
+- checkpoint/snapshot anchors
+
+4. replica set
+- one `ReplicaFSM` per replica
+- per-replica role in the current volume topology
+
+5. volume-level admission decision
+- can writes proceed?
+- can sync requests complete?
+- must promotion/failover occur?
+
+## Two FSM Layers
+
+### Layer A: `ReplicaFSM`
+
+Owns per-replica state such as:
+- `Bootstrapping`
+- `InSync`
+- `Lagging`
+- `CatchingUp`
+- `PromotionHold`
+- `NeedsRebuild`
+- `Rebuilding`
+- `CatchUpAfterRebuild`
+- `Failed`
+
+### Layer B: `VolumeOrchestrator`
+
+Owns system-wide state such as:
+- current `epoch`
+- current primary identity
+- durability mode
+- set of required replicas
+- current `headLSN`
+- whether writes or promotions are allowed
+
+The orchestrator does not replace `ReplicaFSM`.
+It drives it.
+
+## Volume State
+
+The orchestrator should track at least:
+
+```go
+type VolumeMode string
+
+type PrimaryState string
+
+const (
+    PrimaryServing  PrimaryState = "Serving"
+    PrimaryDraining PrimaryState = "Draining"
+    PrimaryLost     PrimaryState = "Lost"
+)
+
+type VolumeModel struct {
+    Epoch        uint64
+    PrimaryID    string
+    PrimaryState PrimaryState
+    Mode         VolumeMode
+
+    HeadLSN       uint64
+    CheckpointLSN uint64
+
+    RequiredReplicaIDs []string
+    Replicas           map[string]*ReplicaFSM
+}
+```
+
+This is a model shape, not a required production struct.
+
+## Orchestrator Responsibilities
+
+### 1. Advance primary head
+
+When primary commits a new write:
+- increment `headLSN`
+- enqueue/send to replica sender loops
+- evaluate whether the current mode still allows ACK
+
+### 2. Evaluate sync eligibility
+
+The orchestrator computes volume-level durability from replica states.
+
+Derived rule:
+- only `ReplicaFSM.IsSyncEligible()` counts
+
+### 3. Drive recovery entry
+
+When a replica disconnects or falls behind:
+- feed disconnect/lag events into that replica FSM
+- decide whether to try catch-up or rebuild
+- acquire recovery reservation if required
+
+### 4. Handle primary authority changes
+
+When lease is lost or a new primary is chosen:
+- increment epoch
+- abort stale recovery sessions
+- reevaluate all replica relationships from the new primary's perspective
+
+### 5. Drive promotion / failover
+
+When current primary is lost:
+- choose promotion candidate
+- assign new epoch
+- move old primary to stale/lost
+- convert the promoted replica into the new serving primary
+- reclassify remaining replicas relative to the new primary
+
+## Required Volume-Level Events
+
+The orchestrator should be able to simulate at least these events.
+
+### Write/progress events
+- `WriteCommitted(lsn)`
+- `CheckpointAdvanced(lsn)`
+- `BarrierCompleted(replicaID, flushedLSN)`
+
+### Replica health events
+- `ReplicaDisconnected(replicaID)`
+- `ReplicaReconnect(replicaID, flushedLSN)`
+- `ReplicaReservationLost(replicaID)`
+- `ReplicaCatchupTimeout(replicaID)`
+- `ReplicaRebuildTooSlow(replicaID)`
+
+### Topology/control events
+- `PrimaryLeaseLost()`
+- `EpochChanged(newEpoch)`
+- `PromoteReplica(replicaID)`
+- `ReplicaAssigned(replicaID)`
+- `ReplicaRemoved(replicaID)`
+
+## Mode Semantics
+
+### `best_effort`
+
+Rules:
+- ACK after primary local durability
+- replicas may be `Lagging`, `CatchingUp`, `NeedsRebuild`, or `Rebuilding`
+- background recovery continues
+
+Volume implication:
+- primary can keep serving while replicas recover
+
+### `sync_all`
+
+Rules:
+- ACK only when all required replicas are `InSync` and durable through target LSN
+- bounded retry only
+- no silent downgrade
+
+Volume implication:
+- one lagging required replica can block sync completion
+- orchestrator may fail requests, not silently reinterpret policy
+
+### `sync_quorum`
+
+Rules:
+- ACK when quorum of required nodes are durable through target LSN
+- lagging replicas may recover in background as long as quorum remains
+
+Volume implication:
+- orchestrator must count eligible replicas, not just healthy sockets
+
+## Primary-Head Simulation Rules
+
+The orchestrator must explicitly model that the primary keeps moving.
+
+### Rule 1: head moves independently of replica recovery
+
+A replica entering `CatchingUp` does not freeze `headLSN`.
+
+### Rule 2: each recovery attempt uses explicit targets
+
+For a replica in recovery, orchestrator chooses:
+- `catchupTargetLSN = H0`
+- or `snapshotCpLSN = C` and replay target `H0`
+
+### Rule 3: promotion is explicit
+
+A replica is not restored to `InSync` just because it reaches `H0`.
+
+It must still pass:
+- barrier confirmation
+- `PromotionHold`
+
+## Failover / Promotion Model
+
+The orchestrator must be able to simulate:
+
+1. old primary loses lease
+2. old primary is fenced by epoch change
+3. one replica is promoted
+4. promoted replica becomes new primary under a higher epoch
+5. all old recovery sessions from the old primary are invalidated
+6. remaining replicas are reevaluated relative to the new primary's head and retained history
+
+Important consequence:
+- failover is not a `ReplicaFSM` transition only
+- it is a volume-level re-rooting of all replica relationships
+
+## Suggested Promotion Rules
+
+Promotion candidate should prefer:
+1. highest valid durable progress
+2. current epoch-consistent history
+3. healthiest replica among tied candidates
+
+After promotion:
+- `PrimaryID` changes
+- `Epoch` increments
+- all replica reservations from the previous primary are void
+- all non-primary replicas must renegotiate recovery against the new primary
+
+## Multi-Replica Examples
+
+### Example 1: `sync_all`
+
+- replica A = `InSync`
+- replica B = `Lagging`
+- replica C = `InSync`
+
+If A and B are required replicas in RF=3 `sync_all`:
+- writes needing sync durability fail or wait
+- even though one replica is still healthy
+
+### Example 2: `sync_quorum`
+
+- replica A = `InSync`
+- replica B = `CatchingUp`
+- replica C = `InSync`
+
+If quorum is 2:
+- volume can continue serving sync requests
+- B recovers in background
+
+### Example 3: failover
+
+- old primary lost
+- replica A promoted
+- replica B was previously `CatchingUp` under old epoch
+
+After promotion:
+- B's old session is aborted
+- B re-enters evaluation against A's history
+
+## What The Tiny Prototype Should Simulate
+
+The V2 prototype should be able to drive at least these scenarios:
+
+1. steady state keep-up
+- primary head advances
+- all required replicas remain `InSync`
+
+2. short outage
+- one replica disconnects
+- primary keeps writing
+- reconnect succeeds within recoverable window
+- replica returns via `PromotionHold`
+
+3. long outage
+- one replica disconnects too long
+- recoverability expires
+- replica goes `NeedsRebuild`
+- rebuild and trailing replay complete
+
+4. tail chasing
+- replica catch-up speed is below primary ingest speed
+- orchestrator chooses fail, throttle, or rebuild path depending on mode
+
+5. failover
+- primary lease lost
+- new epoch assigned
+- replica promoted
+- old recovery sessions fenced
+
+6. mixed-state quorum
+- different replicas in different states
+- orchestrator computes correct `sync_all` / `sync_quorum` result
+
+## Relationship To WAL V1
+
+WAL V1 already contains pieces of this logic, but they are scattered across:
+- shipper state
+- barrier code
+- retention code
+- assignment/promotion code
+- rebuild code
+- heartbeat/master logic
+
+V2 should separate these into:
+- per-replica recovery FSM
+- volume-level orchestrator
+
+## Bottom Line
+
+The next step after `ReplicaFSM` is not `Smart WAL`.
+
+The next step is the volume-level orchestrator model.
+
+Why:
+- primary keeps moving
+- durability mode is volume-scoped
+- failover/promotion is volume-scoped
+- replica recovery must be evaluated in the context of the whole volume
+
+So V2 needs:
+- `ReplicaFSM` for one replica
+- `VolumeOrchestrator` for the moving multi-replica system
--- a/sw-block/design/wal-replication-v2-state-machine.md
+++ b/sw-block/design/wal-replication-v2-state-machine.md
@ -0,0 +1,632 @@
+# WAL Replication V2 State Machine
+
+Date: 2026-03-26
+Status: design proposal
+Purpose: define the V2 replication state machine for a moving-head primary where replicas may transition between keep-up, catch-up, and reconstruction while the primary continues accepting writes
+
+## Why This Document Exists
+
+The hard part of V2 is not the existence of three modes:
+
+- keep-up
+- catch-up
+- reconstruction
+
+The hard part is that the primary head continues advancing while replicas move between those modes.
+
+So V2 must be specified as a real state machine:
+
+- state definitions
+- state-owned LSN anchors
+- allowed transitions
+- retention obligations
+- abort rules
+
+This document treats edge cases as state-transition cases.
+
+## Scope
+
+This is a protocol/state-machine design.
+
+It does not yet define:
+- exact RPC payloads
+- exact snapshot storage format
+- exact implementation package boundaries
+
+Those can follow after the state model is stable.
+
+## Core Terms
+
+### `headLSN`
+
+The primary's current highest WAL LSN.
+
+### `replicaFlushedLSN`
+
+The highest LSN durably persisted on the replica.
+
+### `cpLSN`
+
+A checkpoint/snapshot base point. A snapshot at `cpLSN` represents the block state exactly at that LSN.
+
+### `promotionBarrierLSN`
+
+The LSN a replica must durably reach before it can re-enter `InSync`.
+
+### `Recovery Feasibility`
+
+Whether `(startLSN, endLSN]` can be reconstructed completely, in order, under the current epoch.
+
+This is not a static fact. It changes over time as WAL is reclaimed, payload generations are garbage-collected, or snapshots are released.
+
+### `Recovery Reservation`
+
+A bounded primary-side reservation proving a recovery window is recoverable and pinning all dependencies needed to finish the current catch-up or rebuild-tail replay.
+
+A transition into recovery is valid only after the reservation is granted.
+
+## State Set
+
+Replica may be in one of these states:
+
+1. `Bootstrapping`
+2. `InSync`
+3. `Lagging`
+4. `CatchingUp`
+5. `PromotionHold`
+6. `NeedsRebuild`
+7. `Rebuilding`
+8. `CatchUpAfterRebuild`
+9. `Failed`
+
+Only `InSync` replicas count for sync durability.
+
+## State Semantics
+
+### 1. `Bootstrapping`
+
+Replica has not yet earned sync eligibility and does not yet have trusted reconnect progress.
+
+Properties:
+- fresh replica identity or newly assigned replica
+- may receive initial baseline/live stream
+- not yet eligible for `sync_all`
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: background/bootstrap only
+
+Owned anchors:
+- current assignment epoch
+
+### 2. `InSync`
+
+Replica is eligible for sync durability.
+
+Properties:
+- receiving live ordered stream
+- `replicaFlushedLSN` is near the primary head
+- normal barrier protocol is valid
+
+Counts for:
+- `sync_all`: yes
+- `sync_quorum`: yes
+- `best_effort`: yes, but not required for ACK
+
+Owned anchors:
+- `replicaFlushedLSN`
+
+### 3. `Lagging`
+
+Replica has fallen out of the normal live-stream envelope but recovery path is not yet chosen.
+
+Properties:
+- primary no longer treats it as sync-eligible
+- replica may still be recoverable from WAL or extent-backed recovery records
+- or may require rebuild
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: background recovery only
+
+Owned anchors:
+- last known `replicaFlushedLSN`
+
+### 4. `CatchingUp`
+
+Replica is replaying from its own durable point toward a chosen target.
+
+Properties:
+- short-gap recovery mode
+- primary must reserve and pin the required recovery window
+- primary head continues to move
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: background recovery only
+
+Owned anchors:
+- `catchupStartLSN = replicaFlushedLSN`
+- `catchupTargetLSN`
+- `promotionBarrierLSN`
+- `recoveryReservationID`
+- `reservationExpiry`
+
+### 5. `PromotionHold`
+
+Replica has reached the chosen promotion point but must demonstrate short stability before re-entering `InSync`.
+
+Properties:
+- prevents immediate flapping back into sync eligibility
+- replica has already reached `promotionBarrierLSN`
+- promotion requires stable barriers or elapsed hold time
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: stabilization only
+
+Owned anchors:
+- `promotionBarrierLSN`
+- `promotionHoldUntil` or equivalent hold criterion
+
+### 6. `NeedsRebuild`
+
+Replica cannot recover from retained recovery records alone.
+
+Properties:
+- catch-up window is insufficient or no longer provable
+- replica must not count toward sync durability
+- replica no longer pins old catch-up history
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: background repair candidate only
+
+Owned anchors:
+- last known `replicaFlushedLSN`
+
+### 7. `Rebuilding`
+
+Replica is fetching and installing a checkpoint/snapshot base image.
+
+Properties:
+- primary must preserve the chosen snapshot/base
+- primary must preserve the required WAL or recovery tail after `cpLSN`
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: background rebuild only
+
+Owned anchors:
+- `snapshotID`
+- `snapshotCpLSN`
+- `tailReplayStartLSN = snapshotCpLSN + 1`
+- `recoveryReservationID`
+- `reservationExpiry`
+
+### 8. `CatchUpAfterRebuild`
+
+Replica has installed the base image and is replaying trailing history after it.
+
+Properties:
+- semantically similar to `CatchingUp`
+- base point is checkpoint/snapshot, not the replica's original own state
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: background recovery only
+
+Owned anchors:
+- `snapshotCpLSN`
+- `catchupTargetLSN`
+- `promotionBarrierLSN`
+- `recoveryReservationID`
+- `reservationExpiry`
+
+### 9. `Failed`
+
+Replica recovery failed in a way that needs operator/control-plane action beyond normal retry.
+
+Properties:
+- terminal or semi-terminal fault state
+- may require delete/recreate/manual intervention
+
+Counts for:
+- `sync_all`: no
+- `sync_quorum`: no
+- `best_effort`: no direct role
+
+## Transition Rules
+
+### `Bootstrapping -> InSync`
+
+Trigger:
+- initial bootstrap completes
+- barrier confirms durable progress under the current epoch
+
+Action:
+- establish trusted `replicaFlushedLSN`
+- grant sync eligibility for the first time
+
+### `InSync -> Lagging`
+
+Trigger:
+- disconnect
+- barrier timeout
+- barrier fsync failure
+- stream error
+
+Action:
+- remove sync eligibility immediately
+
+### `Lagging -> CatchingUp`
+
+Trigger:
+- reconnect succeeds
+- primary grants a recovery reservation proving `(replicaFlushedLSN, catchupTargetLSN]` is recoverable for a bounded window
+
+Action:
+- choose `catchupTargetLSN`
+- pin required recovery dependencies for the reservation lifetime
+
+### `Lagging -> NeedsRebuild`
+
+Trigger:
+- required recovery window is not recoverable
+- impossible progress reported
+- epoch mismatch invalidates direct catch-up
+- background janitor determines the replica is outside recoverable budget
+
+Action:
+- stop treating replica as a catch-up candidate
+
+### `CatchingUp -> PromotionHold`
+
+Trigger:
+- replica replays to `catchupTargetLSN`
+- barrier confirms `promotionBarrierLSN`
+
+Action:
+- start promotion debounce window
+
+### `PromotionHold -> InSync`
+
+Trigger:
+- promotion hold criteria satisfied
+  - stable barrier successes
+  - or elapsed hold time
+
+Action:
+- restore sync eligibility
+- clear promotion anchors
+
+### `PromotionHold -> Lagging`
+
+Trigger:
+- disconnect
+- failed barrier
+- failed live stream health check
+
+Action:
+- cancel promotion attempt
+- remove sync eligibility
+
+### `CatchingUp -> NeedsRebuild`
+
+Trigger:
+- catch-up cannot converge
+- recovery reservation is lost
+- catch-up timeout policy exceeded
+- epoch changes
+
+Action:
+- abandon WAL-only catch-up
+- move to reconstruction path
+
+### `NeedsRebuild -> Rebuilding`
+
+Trigger:
+- control plane or primary chooses reconstruction base
+- snapshot/base image transfer starts
+- primary grants a rebuild reservation
+
+Action:
+- bind replica to `snapshotID` and `snapshotCpLSN`
+
+### `Rebuilding -> CatchUpAfterRebuild`
+
+Trigger:
+- snapshot/base image installed successfully
+- trailing recovery reservation is still valid
+
+Action:
+- replay trailing history after `snapshotCpLSN`
+
+### `Rebuilding -> NeedsRebuild`
+
+Trigger:
+- rebuild copy fails
+- rebuild reservation is lost
+- rebuild WAL-tail budget is exceeded
+- epoch changes
+
+Action:
+- abort current rebuild session
+- remain excluded from sync durability
+
+### `CatchUpAfterRebuild -> PromotionHold`
+
+Trigger:
+- trailing replay reaches target
+- barrier confirms durable replay through `promotionBarrierLSN`
+
+Action:
+- start promotion debounce
+
+### `CatchUpAfterRebuild -> NeedsRebuild`
+
+Trigger:
+- reservation is lost
+- replay cannot converge
+- epoch changes
+
+Action:
+- abandon current attempt
+- require a fresh rebuild plan
+
+### Any state -> `Failed`
+
+Trigger examples:
+- unrecoverable protocol inconsistency
+- repeated rebuild failure beyond retry policy
+- snapshot corruption
+- local replica storage failure
+
+## Retention Obligations By State
+
+The key V2 rule is:
+
+- recoverability is not a static fact
+- it is a bounded promise the primary must honor once it admits a replica into recovery
+
+### `InSync`
+
+Primary must retain:
+- recent WAL under normal retention policy
+
+Primary does not need:
+- snapshot pin purely for this replica
+
+### `Lagging`
+
+Primary must retain:
+- enough recent information to evaluate recoverability or intentionally declare `NeedsRebuild`
+
+This state should be short-lived.
+
+### `CatchingUp`
+
+Primary must retain for the reservation lifetime:
+- recovery metadata for `(catchupStartLSN, promotionBarrierLSN]`
+- every payload referenced by that recovery window
+- current epoch lineage for the session
+
+### `PromotionHold`
+
+Primary must retain:
+- whatever live-stream and barrier state is required to validate promotion
+
+This state should be brief and must not pin long-lived history.
+
+### `NeedsRebuild`
+
+Primary retains:
+- no special old recovery window for this replica
+
+This state explicitly releases the old catch-up hold.
+
+### `Rebuilding`
+
+Primary must retain for the reservation lifetime:
+- chosen `snapshotID`
+- any base-image dependencies
+- trailing history after `snapshotCpLSN`
+
+### `CatchUpAfterRebuild`
+
+Primary must retain for the reservation lifetime:
+- recovery metadata for `(snapshotCpLSN, promotionBarrierLSN]`
+- every payload referenced by that trailing window
+
+## Moving-Head Rules
+
+The primary head continues advancing during:
+- `CatchingUp`
+- `Rebuilding`
+- `CatchUpAfterRebuild`
+
+Therefore transitions must never use current head at finish time as an implicit target.
+
+Instead, each transition must select explicit targets.
+
+### Catch-up target
+
+When catch-up starts, choose:
+- `catchupTargetLSN = H0`
+
+Replica first chases to `H0`, not to an infinite moving head.
+
+Then:
+- either enter `PromotionHold` and promote
+- or begin another bounded cycle
+- or abort to rebuild
+
+### Rebuild target
+
+When rebuild starts, choose:
+- `snapshotCpLSN = C`
+- trailing replay target `H0`
+
+Replica installs the snapshot at `C`, then replays `(C, H0]`, then enters `PromotionHold`.
+
+## Tail-Chasing Rule
+
+Replica may fail to converge if:
+- catch-up speed < primary ingest speed
+
+V2 must define bounded behavior:
+
+1. bounded catch-up window
+2. bounded catch-up time
+3. policy after failure to converge:
+   - for `sync_all`: bounded retry, then fail requests
+   - for `best_effort`: keep serving and continue background recovery or escalate to rebuild
+
+No silent downgrade of `sync_all` is allowed.
+
+## Recovery Feasibility
+
+The primary must not admit a replica into catch-up based on a best-effort guess.
+
+It must prove the requested recovery window is recoverable and then reserve it.
+
+Recommended abstraction:
+
+- `CheckRecoveryFeasibility(startLSN, endLSN) -> fully recoverable | needs rebuild`
+- `ReserveRecoveryWindow(startLSN, endLSN) -> reservation`
+
+Only a successful reservation may drive:
+- `Lagging -> CatchingUp`
+- `NeedsRebuild -> Rebuilding`
+- `Rebuilding -> CatchUpAfterRebuild`
+
+## Recovery Classes
+
+V2 must support more than one local record type without leaking that detail into replica state.
+
+### `WALInline`
+
+Properties:
+- payload lives directly in WAL
+- recoverable while WAL is retained
+
+### `ExtentReferenced`
+
+Properties:
+- recovery metadata points at payload outside WAL
+- payload must be resolved from extent/snapshot generation state
+
+The FSM does not care how payload is stored.
+
+It only cares whether the requested window is fully recoverable for the lifetime of the reservation.
+
+The engine-level rule is:
+
+- every record in `(startLSN, endLSN]` must be payload-resolvable
+- the resolved version must correspond to that record's historical state
+- the payload must stay pinned until the reservation ends
+
+If any required payload is not resolvable:
+- the window is not recoverable
+- the replica must go to `NeedsRebuild`
+
+## Snapshot Rule
+
+Rebuild must use a real checkpoint/snapshot base image.
+
+Valid:
+- immutable snapshot at `cpLSN`
+- copy-on-write checkpoint image
+- frozen base image with exact `cpLSN`
+
+Invalid:
+- current extent treated as historical `cpLSN`
+
+## Epoch / Fencing Rule
+
+Every transition is epoch-bound.
+
+If epoch changes during:
+- `Bootstrapping`
+- `Lagging`
+- `CatchingUp`
+- `PromotionHold`
+- `Rebuilding`
+- `CatchUpAfterRebuild`
+
+Then:
+- abort current transition
+- discard old sender assumptions
+- restart negotiation under the new epoch
+
+This prevents stale-primary recovery traffic from being accepted.
+
+## Multi-Replica Volume Rules
+
+Different replicas may be in different states simultaneously.
+
+Example:
+- replica A = `InSync`
+- replica B = `CatchingUp`
+- replica C = `Rebuilding`
+
+Volume-level durability policy is computed per mode.
+
+### `sync_all`
+- all required replicas must be `InSync`
+
+### `sync_quorum`
+- enough replicas must be `InSync`
+
+### `best_effort`
+- primary local durability only
+- replicas recover in background
+
+## Illegal or Suspicious Conditions
+
+These should force rejection or abort:
+
+1. replica reports `replicaFlushedLSN > headLSN`
+2. replica progress belongs to wrong epoch
+3. requested recovery window is not recoverable
+4. recovery reservation cannot be granted
+5. snapshot base does not match claimed `cpLSN`
+6. replay stream shows impossible gap/ordering after reconstruction
+
+## Design Guidance
+
+V2 should be implemented so that:
+
+1. state owns recovery semantics
+2. anchors make transitions explicit
+3. retention obligations are derived from state
+4. catch-up admission requires reservation, not guesswork
+5. mode semantics are derived from `InSync` eligibility
+
+This is better than burying recovery behavior across many ad hoc code paths.
+
+## Bottom Line
+
+V2 is fundamentally a state machine problem.
+
+The correct abstraction is not:
+- some edge cases around WAL replay
+
+It is:
+- replicas move through explicit states while the primary head continues advancing and recovery windows must be provable and reserved
+
+So V2 must be designed around:
+- state definitions
+- anchor LSNs
+- transition rules
+- retention obligations
+- recoverability checks
+- recovery reservations
+- abort conditions
--- a/sw-block/design/wal-replication-v2.md
+++ b/sw-block/design/wal-replication-v2.md
@ -0,0 +1,401 @@
+# WAL Replication V2
+
+Date: 2026-03-26
+Status: design proposal
+Purpose: redesign WAL-based block replication around explicit short-gap catch-up and long-gap reconstruction
+
+## Goal
+
+Provide a replication architecture that:
+
+- keeps the primary write path fast
+- supports correct synchronous durability semantics
+- supports short-gap reconnect catch-up using WAL
+- avoids paying unbounded WAL retention tax for long-lag replicas
+- uses reconstruction from a real checkpoint/snapshot base for larger lag
+
+This design replaces a "WAL does everything" mindset with a 3-tier recovery model.
+
+## Core Principle
+
+WAL is excellent for:
+- recent ordered delta
+- local crash recovery
+- short-gap replica catch-up
+
+WAL is not the right long-range recovery mechanism for lagging block replicas.
+
+Long-gap recovery should use:
+- a real checkpoint/snapshot base image
+- plus WAL tail replay after that base point
+
+## Correctness Boundary
+
+Never reconstruct old state from current extent alone.
+
+Example:
+
+1. `LSN 100`: block `A = foo`
+2. `LSN 120`: block `A = bar`
+
+If a replica needs state at `LSN 100`, current extent contains `bar`, not `foo`.
+
+Therefore:
+- current extent is latest state
+- not historical state
+
+So long-gap recovery must use a base image that is known to represent a real checkpoint/snapshot `cpLSN`.
+
+## 3-Tier Replication Model
+
+### Tier A: Keep-up
+
+Replica is close enough to the primary that normal ordered streaming keeps it current.
+
+Properties:
+- normal steady-state mode
+- no special recovery path
+- replica stays `InSync`
+
+### Tier B: Lagging Catch-up
+
+Replica fell behind, but the primary still has enough recoverable history covering the missing range.
+
+Properties:
+- reconnect handshake determines the replica durable point
+- primary proves and reserves a bounded recovery window
+- primary replays missing history
+- replica returns to `InSync` only after replay, barrier confirmation, and promotion hold
+
+### Tier C: Reconstruction
+
+Replica is too far behind for direct replay.
+
+Properties:
+- replica must rebuild from a real checkpoint/snapshot base
+- after base image install, primary replays trailing history after `cpLSN`
+- replica only re-enters `InSync` after durable catch-up completes
+
+## Architecture
+
+### Primary Artifacts
+
+The primary owns three forms of state:
+
+1. `Active WAL`
+- recent ordered metadata/delta stream
+- bounded by retention policy
+
+2. `Checkpoint Snapshot`
+- immutable point-in-time base image at `cpLSN`
+- used for long-gap reconstruction
+
+3. `Current Extent`
+- latest live block state
+- not a substitute for historical checkpoint state
+
+### Replica Artifacts
+
+Replica maintains:
+
+1. local WAL or equivalent recovery log
+2. replica `receivedLSN`
+3. replica `flushedLSN`
+4. local extent state
+
+## Sender Model
+
+Do not ship recovery data inline from foreground write goroutines.
+
+Per replica, use:
+- one ordered send queue
+- one sender loop
+
+The sender loop owns:
+- live stream shipping
+- reconnect handling
+- short-gap catch-up
+- reconstruction tail replay
+
+This guarantees:
+- strict LSN order per replica
+- clean transport state ownership
+- no inline shipping races in the primary write path
+
+## Write Path
+
+Primary write path:
+
+1. allocate monotonic `LSN`
+2. append recovery metadata to local WAL or journal
+3. enqueue the record to each replica sender queue
+4. return according to durability mode semantics
+
+Flusher later:
+- flushes dirty data to extent
+- manages checkpoints
+- manages bounded retention of WAL and other recovery dependencies
+
+## Recovery Classes
+
+V2 supports more than one local record type.
+
+### `WALInline`
+
+Properties:
+- payload lives directly in WAL
+- recoverable while WAL is retained
+
+### `ExtentReferenced`
+
+Properties:
+- journal entry contains metadata only
+- payload is resolved from extent/snapshot generation state
+- direct-extent writes and future smart-WAL paths fall into this class
+
+Replica state does not encode these classes.
+
+Instead, the primary must answer a stricter question for reconnect:
+- is `(startLSN, endLSN]` fully recoverable under the current epoch, and can it be reserved for the duration of recovery?
+
+## Replica Progress Model
+
+Each replica reports progress explicitly.
+
+### `receivedLSN`
+- highest LSN received and appended locally
+- not yet a durability guarantee
+
+### `flushedLSN`
+- highest LSN durably persisted on the replica
+- authoritative sync durability signal
+
+Only `flushedLSN` counts for:
+- `sync_all`
+- `sync_quorum`
+
+## Replica States
+
+Replica state is defined by `wal-replication-v2-state-machine.md`.
+
+Important highlights:
+- `Bootstrapping`
+- `InSync`
+- `Lagging`
+- `CatchingUp`
+- `PromotionHold`
+- `NeedsRebuild`
+- `Rebuilding`
+- `CatchUpAfterRebuild`
+- `Failed`
+
+Only `InSync` replicas count toward sync durability.
+
+## Protocol
+
+### 1. Normal Streaming
+
+Primary sender loop:
+- sends ordered replicated write records
+
+Replica:
+1. validates ordering
+2. appends locally
+3. advances `receivedLSN`
+
+### 2. Barrier / Sync
+
+Primary sends:
+- `BarrierReq{LSN, Epoch}`
+
+Replica:
+1. wait until `receivedLSN >= LSN`
+2. flush durable local state
+3. set `flushedLSN = LSN`
+4. reply `BarrierResp{Status, FlushedLSN}`
+
+Primary uses this to evaluate mode policy.
+
+### 3. Reconnect Handshake
+
+On reconnect, primary obtains:
+- current epoch
+- primary head
+- replica durable `flushedLSN`
+
+Then primary evaluates recovery feasibility.
+
+Possible outcomes:
+
+1. replica already caught up
+- state -> `PromotionHold` or `InSync` depending on policy
+
+2. bounded catch-up possible
+- reserve recovery window
+- state -> `CatchingUp`
+
+3. direct replay not possible
+- state -> `NeedsRebuild`
+
+## Recovery Feasibility and Reservation
+
+The key V2 rule is:
+- `fully recoverable` is not enough
+- the primary must also reserve the recovery window
+
+Recommended engine-side flow:
+
+1. `CheckRecoveryFeasibility(startLSN, endLSN)`
+2. if feasible, `ReserveRecoveryWindow(startLSN, endLSN)`
+3. only then start `CatchingUp` or `CatchUpAfterRebuild`
+
+A recovery reservation pins:
+- recovery metadata
+- referenced payload generations
+- required snapshots/base images
+- current epoch lineage for the session
+
+If the reservation is lost during recovery:
+- abort the current attempt
+- fall back to `NeedsRebuild`
+
+## Tier B: Lagging Catch-up Algorithm
+
+When a replica is behind but within a recoverable retained window:
+
+1. choose a bounded target `H0`
+2. reserve `(ReplicaFlushedLSN, H0]`
+3. replay the missing range
+4. barrier confirms durable `flushedLSN >= H0`
+5. enter `PromotionHold`
+6. only then restore `InSync`
+
+### Tail-chasing problem
+
+If the primary is writing faster than the replica can catch up, the replica may never converge.
+
+To handle this:
+
+1. define a bounded catch-up window
+2. if catch-up rate is slower than ingest rate for too long:
+   - either temporarily throttle primary admission for strict `sync_all`
+   - or fail `sync_all` requests and let control-plane policy react
+   - or abort to rebuild
+3. do not let a replica remain in unbounded perpetual `CatchingUp`
+
+### Important rule
+
+For `sync_all`, the data path must not silently downgrade to `best_effort`.
+
+Correct behavior:
+- bounded retry
+- then fail
+
+Any mode change must be explicit policy, not silent transport behavior.
+
+## Tier C: Reconstruction Algorithm
+
+When a replica is too far behind for direct replay:
+
+1. mark replica `NeedsRebuild`
+2. choose a real checkpoint/snapshot base at `cpLSN`
+3. create a rebuild reservation
+4. replica enters `Rebuilding`
+5. replica pulls immutable checkpoint/snapshot image
+6. replica installs that base image and sets base progress to `cpLSN`
+7. primary replays trailing history `(cpLSN, H0]`
+8. barrier confirms durable replay
+9. replica enters `PromotionHold`
+10. replica returns to `InSync`
+
+### Why snapshot/base image must be real
+
+If the replica needs state at `cpLSN`, the base image must represent exactly that checkpoint.
+
+Invalid:
+- current extent copied at some later time and treated as historical `cpLSN`
+
+Valid:
+- immutable snapshot
+- copy-on-write checkpoint image
+- frozen base image
+
+## Retention and Budget
+
+V2 retention is bounded.
+
+### WAL / recovery metadata retention
+
+Primary keeps only a bounded recent recovery window:
+- `max_retained_wal_bytes`
+- optionally `max_retained_wal_time`
+
+### Recovery reservation budget
+
+Reservations are also bounded:
+- timeout
+- bytes pinned
+- snapshot dependency lifetime
+
+If a catch-up or rebuild session exceeds its reservation budget:
+- primary aborts the session
+- replica falls back to `NeedsRebuild`
+- a newer rebuild plan may be chosen later
+
+## Sync Modes
+
+### `best_effort`
+- ACK after primary local durability
+- replicas may lag
+- background catch-up or rebuild allowed
+
+### `sync_all`
+- ACK only when all required replicas are `InSync` and durably at target LSN
+- bounded retry only
+- no silent downgrade
+
+### `sync_quorum`
+- ACK when enough replicas are `InSync` and durably at target LSN
+
+## Why This Direction
+
+V2 separates three different concerns cleanly:
+
+1. fast steady-state replication
+2. short-gap replay
+3. long-gap reconstruction
+
+This avoids forcing WAL alone to solve all recovery cases.
+
+## Implementation Order
+
+Recommended order:
+
+1. pure FSM
+2. ordered sender loop
+3. bounded direct replay
+4. checkpoint/snapshot reconstruction
+5. smarter local write path and recovery classes
+6. policy and control-plane integration
+
+## Phase 13 current direction
+
+Current Phase 13 / WAL V1 is still:
+- fixing correctness of WAL-centered sync replication
+- still focused mainly on bounded WAL replay and rebuild fallback
+
+That is the right bridge.
+
+V2 should follow after WAL V1 closes.
+
+## Bottom Line
+
+V2 is not "more WAL features."
+
+It is:
+- explicit recovery feasibility
+- explicit recovery reservations
+- ordered sender loops
+- short-gap replay for recent lag
+- checkpoint/snapshot reconstruction for long lag
+- promotion back to `InSync` only after durable proof
--- a/sw-block/design/wal-v1-to-v2-mapping.md
+++ b/sw-block/design/wal-v1-to-v2-mapping.md
@ -0,0 +1,349 @@
+# WAL V1 To V2 Mapping
+
+Date: 2026-03-26
+Status: working note
+Purpose: map the current WAL V1 scattered state across `sw-block` into the proposed WAL V2 FSM vocabulary
+
+## Why This Note Exists
+
+Current WAL V1 correctness logic is spread across:
+
+- `wal_shipper.go`
+- `replica_apply.go`
+- `dist_group_commit.go`
+- `blockvol.go`
+- `promotion.go`
+- `rebuild.go`
+- heartbeat/master reporting
+
+This note does not propose immediate code changes.
+
+It exists to answer two questions:
+
+1. what state already exists in WAL V1 today?
+2. how does that state map into the cleaner WAL V2 FSM model?
+
+## Current V1 State Owners
+
+### 1. Shipper state
+
+Primary-side per-replica transport and recovery state lives mainly in:
+- `weed/storage/blockvol/wal_shipper.go`
+
+Current V1 shipper states:
+- `ReplicaDisconnected`
+- `ReplicaConnecting`
+- `ReplicaCatchingUp`
+- `ReplicaInSync`
+- `ReplicaDegraded`
+- `ReplicaNeedsRebuild`
+
+Other shipper-owned flags/anchors:
+- `replicaFlushedLSN`
+- `hasFlushedProgress`
+- `catchupFailures`
+- `lastContactTime`
+
+### 2. Replica receiver progress
+
+Replica-side receive/apply progress lives mainly in:
+- `weed/storage/blockvol/replica_apply.go`
+
+Current V1 replica progress:
+- `receivedLSN`
+- `flushedLSN`
+- duplicate/gap handling in `applyEntry()`
+
+### 3. Volume-level durability policy
+
+Volume-level sync semantics live mainly in:
+- `weed/storage/blockvol/dist_group_commit.go`
+
+Current V1 policy uses:
+- local WAL sync result
+- per-shipper barrier results
+- `DurabilityBestEffort`
+- `DurabilitySyncAll`
+- `DurabilitySyncQuorum`
+
+### 4. Volume-level retention/checkpoint state
+
+Primary-side local checkpoint and WAL retention state lives mainly in:
+- `weed/storage/blockvol/blockvol.go`
+- `weed/storage/blockvol/flusher.go`
+
+Current V1 anchors:
+- `nextLSN`
+- `CheckpointLSN()`
+- WAL retained range
+- retention-floor callbacks from `ShipperGroup`
+
+### 5. Role/assignment state
+
+Master-driven volume role state lives mainly in:
+- `weed/storage/blockvol/promotion.go`
+- `weed/storage/blockvol/blockvol.go`
+- `weed/server/volume_server_block.go`
+
+Current V1 roles:
+- `RolePrimary`
+- `RoleReplica`
+- `RoleStale`
+- `RoleRebuilding`
+- `RoleDraining`
+
+### 6. Rebuild state
+
+Existing V1 rebuild transport/process lives mainly in:
+- `weed/storage/blockvol/rebuild.go`
+
+Current V1 rebuild phases:
+- WAL catch-up attempt
+- full extent copy
+- trailing WAL catch-up
+- rejoin via assignment + fresh shipper bootstrap
+
+### 7. Heartbeat/master-visible replication state
+
+Master-visible state lives mainly in:
+- `weed/storage/blockvol/block_heartbeat.go`
+- `weed/storage/blockvol/blockvol.go`
+- server-side registry/master handling
+
+Current V1 visible fields include:
+- `ReplicaDegraded`
+- `ReplicaShipperStates []ReplicaShipperStatus`
+- role/epoch/checkpoint/head state
+
+## V1 To V2 Mapping
+
+### Shipper state mapping
+
+| WAL V1 shipper state | Proposed WAL V2 FSM state | Notes |
+| --- | --- | --- |
+| `ReplicaDisconnected` | `Bootstrapping` or `Lagging` | Fresh shipper with no durable progress maps to `Bootstrapping`; previously-synced disconnected replica maps to `Lagging`. |
+| `ReplicaConnecting` | transitional part of `Lagging -> CatchingUp` | V2 should model this as an event/session phase, not a durable steady state. |
+| `ReplicaCatchingUp` | `CatchingUp` | Direct mapping for short-gap replay. |
+| `ReplicaInSync` | `InSync` | Direct mapping. |
+| `ReplicaDegraded` | `Lagging` | V1 transport failure state becomes the cleaner V2 recovery-needed state. |
+| `ReplicaNeedsRebuild` | `NeedsRebuild` | Direct mapping. |
+
+Main V1 cleanup opportunity:
+- V1 mixes transport/session detail (`Connecting`) with recovery lifecycle state.
+- V2 should keep the long-lived FSM smaller and push connection mechanics into sender-loop/session logic.
+
+### Replica receiver progress mapping
+
+| WAL V1 field | WAL V2 concept | Notes |
+| --- | --- | --- |
+| `receivedLSN` | `receivedLSN` | Keep as transport/apply progress only. |
+| `flushedLSN` | `replicaFlushedLSN` | Keep as authoritative durability anchor. |
+| duplicate/gap rules | replay validity rules | These become part of the V2 replay contract, not ad hoc receiver behavior. |
+
+Main V1 cleanup opportunity:
+- V1 receiver progress is already conceptually sound.
+- V2 should keep it but drive it from explicit FSM transitions and replay reservations.
+
+### Volume durability policy mapping
+
+| WAL V1 behavior | WAL V2 concept | Notes |
+| --- | --- | --- |
+| `BarrierAll` against current shippers | promotion and sync gate | V2 should keep barrier-based durability truth. |
+| `sync_all` requires all barriers | `InSync` eligibility gate | Same rule, but V2 eligibility should come from FSM state rather than scattered checks. |
+| `best_effort` ignores barrier failures | background recovery mode | Same high-level policy. |
+| `sync_quorum` counts successful barriers | quorum over `InSync` replicas | Same direction, but should be derived from explicit FSM state. |
+
+Main V1 cleanup opportunity:
+- durability mode logic should depend on `IsSyncEligible()`-style state, not raw shipper state enums spread across code.
+
+### Retention/checkpoint mapping
+
+| WAL V1 concept | WAL V2 concept | Notes |
+| --- | --- | --- |
+| `CheckpointLSN()` | checkpoint/base anchor | Keep, but V2 also adds explicit `cpLSN` snapshot semantics. |
+| retention floor from recoverable replicas | recoverability budget | Keep the idea, but V2 turns this into explicit reservation management. |
+| timeout-based `NeedsRebuild` | janitor-driven `Lagging -> NeedsRebuild` | Keep as background control logic, not hot-path mutation. |
+
+Main V1 cleanup opportunity:
+- V1 retains data because replicas might need it.
+- V2 should reserve specific recovery windows, not rely only on ambient retention conditions.
+
+### Role/assignment mapping
+
+| WAL V1 role state | WAL V2 meaning | Notes |
+| --- | --- | --- |
+| `RolePrimary` | primary ownership / epoch authority | Not a replica FSM state; remains volume/control-plane state. |
+| `RoleReplica` | replica service role | Orthogonal to replication FSM state. A replica volume may be `RoleReplica` while its sender-facing state is `Bootstrapping`, `Lagging`, or `InSync`. |
+| `RoleStale` | pre-rebuild/non-serving | Closest to `NeedsRebuild` preparation on the volume role side. |
+| `RoleRebuilding` | rebuild session role | Maps to volume-wide orchestration around V2 `Rebuilding`. |
+| `RoleDraining` | assignment/failover coordination | Outside replica FSM; remains a volume transition role. |
+
+Main V1 cleanup opportunity:
+- role state and replication FSM state are different dimensions.
+- V1 sometimes implicitly blends them.
+- V2 should keep them separate:
+  - control-plane role FSM
+  - per-replica replication FSM
+
+### Rebuild flow mapping
+
+| WAL V1 rebuild phase | WAL V2 FSM phase | Notes |
+| --- | --- | --- |
+| WAL catch-up pre-pass | `Lagging -> CatchingUp` if feasible | Same idea, but V2 requires recoverability proof and reservation. |
+| full extent copy | `NeedsRebuild -> Rebuilding` | Same high-level phase. |
+| trailing WAL catch-up | `CatchUpAfterRebuild` | Direct conceptual mapping. |
+| fresh shipper bootstrap after reassignment | `Bootstrapping` then promotion | V1 does this through assignment refresh; V2 may eventually do it with cleaner local transitions. |
+
+Main V1 cleanup opportunity:
+- V1 rebuild success is currently rejoined indirectly through control-plane reassignment.
+- V2 should eventually make rebuild completion and promotion explicit FSM transitions.
+
+### Heartbeat/master state mapping
+
+| WAL V1 visible state | WAL V2 meaning | Notes |
+| --- | --- | --- |
+| `ReplicaShipperStatus{DataAddr, State, FlushedLSN}` | control-plane view of per-replica FSM | Good starting shape. |
+| `ReplicaDegraded` | derived summary only | Too coarse for V2 decision-making; keep only as convenience/compat field. |
+| role/epoch/head/checkpoint | role FSM + replication anchors | Continue reporting; V2 may need richer recovery reservation visibility later. |
+
+Main V1 cleanup opportunity:
+- master-facing replication state should be per replica, not summarized as one degraded bit.
+
+## Current V1 Event Sources vs V2 Events
+
+### V1 event source: `Barrier()` outcome
+
+Current effects:
+- mark `InSync`
+- update `replicaFlushedLSN`
+- mark degraded on error
+
+V2 event mapping:
+- `BarrierSuccess`
+- `BarrierFailure`
+- `PromotionHealthy`
+
+### V1 event source: reconnect handshake
+
+Current effects:
+- `Connecting`
+- choose `InSync`, `CatchingUp`, or `NeedsRebuild`
+
+V2 event mapping:
+- `ReconnectObserved`
+- `RecoveryFeasible`
+- `RecoveryReservationGranted`
+- `ReconnectNeedsRebuild`
+
+### V1 event source: retention budget evaluation
+
+Current effects:
+- stale replica becomes `NeedsRebuild`
+
+V2 event mapping:
+- `RecoverabilityExpired`
+- `BackgroundJanitorNeedsRebuild`
+
+### V1 event source: rebuild assignment and `StartRebuild`
+
+Current effects:
+- role becomes `RoleRebuilding`
+- run baseline + trailing catch-up
+- rejoin later via reassignment
+
+V2 event mapping:
+- `StartRebuild`
+- `RebuildBaseApplied`
+- `RebuildReservationLost`
+- `RebuildCompleteReadyForPromotion`
+
+## Main Gaps Between V1 And V2
+
+### 1. V1 has shipper state, but not a pure FSM
+
+Current V1 state is embedded in:
+- transport logic
+- barrier logic
+- retention logic
+- rebuild orchestration
+
+V2 goal:
+- one pure FSM that owns state and anchors
+- transport/session code only executes actions
+
+### 2. V1 does not model reservation explicitly
+
+Current V1 asks, roughly:
+- is WAL still retained?
+
+V2 must ask:
+- is `(startLSN, endLSN]` fully recoverable?
+- can the primary reserve that window until recovery completes?
+
+### 3. V1 has no explicit promotion debounce state
+
+Current V1 goes effectively:
+- caught up -> `InSync`
+
+V2 adds:
+- `PromotionHold`
+
+### 4. V1 rebuild completion is control-plane indirect
+
+Current V1:
+- old `NeedsRebuild` shipper stays stuck
+- master reassigns
+- fresh shipper bootstraps
+
+V2 likely wants:
+- cleaner local FSM transitions, even if control plane still participates
+
+### 5. V1 does not yet encode recovery classes
+
+Current V1 is mostly WAL-centric.
+
+V2 should support:
+- `WALInline`
+- `ExtentReferenced`
+
+without leaking storage details into replica state.
+
+## What Should Stay From V1
+
+These V1 ideas are solid and should be preserved:
+
+1. `replicaFlushedLSN` as sync truth
+2. barrier-driven durability confirmation
+3. explicit `NeedsRebuild`
+4. per-replica status reporting to master
+5. retention budgets eventually forcing rebuild
+6. rebuild as a separate path from normal catch-up
+
+## What Should Move In V2
+
+These are the main redesign items:
+
+1. move scattered shipper/recovery state into one pure FSM
+2. separate transport/session phases from durable FSM state
+3. add `Bootstrapping` and `PromotionHold`
+4. add recoverability proof and reservation as first-class concepts
+5. make replay/rebuild admission depend on reservation, not just present-time checks
+6. cleanly separate:
+   - control-plane role FSM
+   - per-replica replication FSM
+
+## Bottom Line
+
+WAL V1 already contains most of the important primitives:
+
+- durable progress
+- barrier truth
+- catch-up
+- rebuild detection
+- master-visible per-replica state
+
+What V2 changes is not the existence of these ideas.
+
+It changes their organization:
+- from scattered transport/rebuild logic
+- to one explicit, testable FSM with recovery reservations and cleaner state boundaries
--- a/sw-block/design/wal-v2-tiny-prototype.md
+++ b/sw-block/design/wal-v2-tiny-prototype.md
@ -0,0 +1,277 @@
+# WAL V2 Tiny Prototype
+
+Date: 2026-03-26
+Status: design/prototyping plan
+Purpose: validate the core V2 replication logic before committing to a broader redesign
+
+## Goal
+
+Build a small, non-production prototype that proves the core V2 ideas:
+
+1. `ExtentBackend` abstraction
+2. 3-tier replication FSM
+3. async ordered sender loop
+4. barrier-driven durability tracking
+5. short-gap catch-up vs long-gap rebuild boundary
+6. recovery feasibility and reservation semantics
+
+This prototype is for discovering:
+- state complexity
+- recovery correctness
+- sender-loop behavior
+- performance shape
+
+It is not for shipping.
+
+## Prototype Scope
+
+### 1. Extent backend isolation layer
+
+Define a clean backend interface for extent reads/writes.
+
+Initial implementation:
+- `FileBackend`
+- normal Linux file
+- `pread`
+- `pwrite`
+- optional `fallocate`
+
+Do not start with raw-device allocation.
+
+The point is to stabilize:
+- extent semantics
+- base-image import/export assumptions
+- checkpoint/snapshot integration points
+
+### 2. V2 asynchronous replication FSM
+
+Build a pure in-memory FSM for one replica.
+
+FSM owns:
+- state
+- anchor LSNs
+- transition legality
+- sync eligibility
+- action suggestions
+- recovery reservation metadata
+
+Target state set:
+- `Bootstrapping`
+- `InSync`
+- `Lagging`
+- `CatchingUp`
+- `PromotionHold`
+- `NeedsRebuild`
+- `Rebuilding`
+- `CatchUpAfterRebuild`
+- `Failed`
+
+The FSM must not do:
+- network I/O
+- disk I/O
+- goroutine management
+
+### 3. Sender loop + barrier primitive
+
+For each replica:
+- one ordered sender goroutine
+- one non-blocking enqueue path from primary write path
+- one barrier/progress path
+
+Primary write path:
+1. allocate `LSN`
+2. append local WAL/journal metadata
+3. enqueue to sender loop
+4. return according to durability mode
+
+The sender loop is responsible for:
+- live ordered send
+- reconnect handling
+- catch-up replay
+- rebuild-tail replay
+
+## Explicit Non-Goals
+
+These are intentionally excluded from the tiny prototype:
+
+- raw allocator
+- garbage collection
+- `NVMe-oF`
+- `ublk`
+- chain replication
+- CSI / control plane
+- multi-replica quorum
+- encryption
+- real snapshot storage optimization
+
+These are extension layers, not the core logic being validated here.
+
+## Design Principle
+
+Those excluded items are not being rejected.
+
+They are treated as:
+- extensions of the core logic
+
+The prototype should be designed so they can later plug in without rewriting the state machine.
+
+## Suggested Layout
+
+One reasonable layout:
+
+- `weed/storage/blockvol/fsmv2/`
+  - `fsm.go`
+  - `events.go`
+  - `actions.go`
+  - `fsm_test.go`
+- `weed/storage/blockvol/prototypev2/`
+  - `backend.go`
+  - `file_backend.go`
+  - `sender_loop.go`
+  - `barrier.go`
+  - `prototype_test.go`
+
+Preferred direction:
+- keep it close enough to production packages that later reuse is easy
+- but clearly marked experimental
+
+## Core Interfaces
+
+### Extent backend
+
+Example direction:
+
+```go
+type ExtentBackend interface {
+    ReadAt(p []byte, off int64) (int, error)
+    WriteAt(p []byte, off int64) (int, error)
+    Sync() error
+    Size() uint64
+}
+```
+
+### FSM
+
+Example direction:
+
+```go
+type ReplicaFSM struct {
+    // state
+    // epoch
+    // anchor LSNs
+    // reservation metadata
+}
+
+func (f *ReplicaFSM) Apply(evt ReplicaEvent) ([]ReplicaAction, error)
+```
+
+### Sender loop
+
+Example direction:
+
+```go
+type SenderLoop struct {
+    // input queue
+    // FSM
+    // transport mock/adapter
+}
+```
+
+## What The Prototype Must Prove
+
+### A. FSM correctness
+
+The FSM must show that the state set is sufficient and coherent.
+
+Key scenarios:
+
+1. `Bootstrapping -> InSync`
+2. `InSync -> Lagging -> CatchingUp -> PromotionHold -> InSync`
+3. `Lagging -> NeedsRebuild -> Rebuilding -> CatchUpAfterRebuild -> PromotionHold -> InSync`
+4. epoch change aborts catch-up
+5. epoch change aborts rebuild
+6. reservation-lost aborts catch-up
+7. rebuild-too-slow aborts reconstruction
+8. flapping replica does not instantly re-enter `InSync`
+
+### B. Sender ordering
+
+The sender loop must prove:
+- strict LSN order per replica
+- no inline ship races from concurrent writes
+- decoupled foreground write path
+
+### C. Barrier semantics
+
+Barrier must prove:
+- it waits on replica progress
+- it uses `flushedLSN`, not transport guesses
+- it can drive promotion eligibility cleanly
+
+### D. Recovery boundary
+
+Prototype must make the handoff explicit:
+- recent lag -> reserved replay window
+- long lag -> rebuild from base image + trailing replay
+
+### E. Recovery reservation
+
+Prototype must make this explicit:
+- a window is not enough
+- it must be provable and then reserved
+- losing the reservation must abort recovery cleanly
+
+## Performance Questions The Prototype Should Answer
+
+Not benchmark headlines.
+
+Instead:
+
+1. how much contention disappears from the hot write path after removing inline ship
+2. how queue depth grows under slow replicas
+3. when catch-up stops converging
+4. how expensive promotion hold is
+5. how much complexity is added by rebuild-tail replay
+6. how much complexity is added by reservation management
+
+## Success Criteria
+
+The tiny prototype is successful if it gives clear answers to:
+
+1. can the V2 FSM be made explicit and testable?
+2. does sender-loop ordering materially simplify the replication path?
+3. is the catch-up vs rebuild boundary coherent under a moving primary head?
+4. does reservation-based recoverability make the design safer and clearer?
+5. does the architecture look simpler than extending WAL V1 forever?
+
+## Failure Criteria
+
+The prototype should be considered unsuccessful if:
+
+1. state count explodes and remains hard to reason about
+2. sender loop does not materially simplify ordering/recovery
+3. promotion and recovery rules remain too coupled to ad hoc timers and network callbacks
+4. rebuild-from-base + trailing replay is still ambiguous even in a controlled prototype
+5. reservation handling turns into unbounded complexity
+
+## Relationship To WAL V1
+
+WAL V1 remains the current delivery line.
+
+This prototype is not a replacement for:
+- `CP13-6`
+- `CP13-7`
+- `CP13-8`
+- `CP13-9`
+
+It exists to inform what should move into WAL V2 after WAL V1 closes.
+
+## Bottom Line
+
+The tiny prototype should validate the core logic only:
+
+- clean backend boundary
+- explicit FSM
+- ordered async sender
+- recoverability as a proof-plus-reservation problem
+- rebuild as a separate recovery mode, not a WAL accident
--- a/sw-block/private/README.md
+++ b/sw-block/private/README.md
@ -0,0 +1,14 @@
+# private
+
+Deprecated in favor of `../.private/`.
+
+Private working area for:
+- design sketches
+- draft notes
+- temporary comparison docs
+- prototype experiments not ready to move into shared design docs
+
+Keep production-independent work here until it is ready to be promoted into:
+- `../design/`
+- `../prototype/`
+- or the main repo docs under `learn/projects/sw-block/`
--- a/sw-block/prototype/README.md
+++ b/sw-block/prototype/README.md
@ -0,0 +1,23 @@
+# V2 Prototype
+
+Experimental WAL V2 prototype code lives here.
+
+Current prototype:
+- `fsmv2/`: pure in-memory replication FSM prototype
+- `volumefsm/`: volume-level orchestrator prototype above `fsmv2`
+- `distsim/`: early distributed/data-correctness simulator with synthetic 4K block values
+
+Rules:
+- do not wire this directly into WAL V1 production code
+- keep interfaces and tests focused on architecture learning
+- promote pieces into production only after V2 design stabilizes
+
+## Windows test workflow
+
+Because normal `go test` may be blocked by Windows Defender when it executes temporary test binaries from `%TEMP%`, use:
+
+```powershell
+powershell -ExecutionPolicy Bypass -File .\sw-block\prototype\run-tests.ps1
+```
+
+This builds test binaries into the workspace and runs them directly.
--- a/sw-block/prototype/distsim/cluster.go
+++ b/sw-block/prototype/distsim/cluster.go
--- a/sw-block/prototype/distsim/cluster_test.go
+++ b/sw-block/prototype/distsim/cluster_test.go
--- a/sw-block/prototype/distsim/distsim.test.exe
+++ b/sw-block/prototype/distsim/distsim.test.exe
--- a/sw-block/prototype/distsim/eventsim.go
+++ b/sw-block/prototype/distsim/eventsim.go
@ -0,0 +1,266 @@
+// eventsim.go — timeout events and timer-race infrastructure.
+//
+// This file implements the eventsim layer within the distsim package.
+// The two conceptual layers share the Cluster model but serve different purposes:
+//
+// distsim (protocol layer — cluster.go, protocol.go):
+//   - Protocol correctness: epoch fencing, barrier semantics, commit rules
+//   - Reference-state validation: AssertCommittedRecoverable
+//   - Recoverability logic: catch-up, rebuild, reservation
+//   - Promotion/lineage: candidate eligibility, ranking
+//   - Endpoint identity: address versioning, stale endpoint rejection
+//   - Control-plane flow: heartbeat → detect → assignment
+//
+// eventsim (timing/race layer — this file):
+//   - Explicit timeout events: barrier, catch-up, reservation
+//   - Timer-triggered state transitions
+//   - Same-tick race resolution: data events process before timeouts
+//   - Timeout cancellation on successful ack/convergence
+//
+// Boundary rule:
+//   - A scenario belongs in distsim tests if the bug is protocol-level
+//     (wrong state, wrong commit, wrong rejection).
+//   - A scenario belongs in eventsim tests if the bug is timing-level
+//     (race between ack and timeout, ordering of concurrent events).
+//   - Do not duplicate scenarios across both layers unless
+//     timer/event ordering is the actual bug surface.
+
+package distsim
+
+import "fmt"
+
+// TimeoutKind identifies the type of timeout event.
+type TimeoutKind string
+
+const (
+	TimeoutBarrier     TimeoutKind = "barrier"
+	TimeoutCatchup     TimeoutKind = "catchup"
+	TimeoutReservation TimeoutKind = "reservation"
+)
+
+// PendingTimeout represents a registered timeout that has not yet fired or been cancelled.
+type PendingTimeout struct {
+	Kind       TimeoutKind
+	ReplicaID  string
+	LSN        uint64 // for barrier timeouts: which LSN's barrier
+	DeadlineAt uint64 // absolute tick when timeout fires
+	Cancelled  bool
+}
+
+// FiredTimeout records a timeout that actually fired (was not cancelled in time).
+type FiredTimeout struct {
+	PendingTimeout
+	FiredAt uint64
+}
+
+// barrierExpiredKey uniquely identifies a timed-out barrier instance.
+type barrierExpiredKey struct {
+	ReplicaID string
+	LSN       uint64
+}
+
+// RegisterTimeout adds a pending timeout to the cluster.
+func (c *Cluster) RegisterTimeout(kind TimeoutKind, replicaID string, lsn uint64, deadline uint64) {
+	c.Timeouts = append(c.Timeouts, PendingTimeout{
+		Kind:       kind,
+		ReplicaID:  replicaID,
+		LSN:        lsn,
+		DeadlineAt: deadline,
+	})
+}
+
+// CancelTimeout cancels a pending timeout matching the given kind, replica, and LSN.
+// For catch-up/reservation timeouts, LSN is ignored (matched by kind+replica only).
+func (c *Cluster) CancelTimeout(kind TimeoutKind, replicaID string, lsn uint64) {
+	for i := range c.Timeouts {
+		t := &c.Timeouts[i]
+		if t.Cancelled {
+			continue
+		}
+		if t.Kind != kind || t.ReplicaID != replicaID {
+			continue
+		}
+		if kind == TimeoutBarrier && t.LSN != lsn {
+			continue
+		}
+		t.Cancelled = true
+		c.logEvent(EventTimeoutCancelled, fmt.Sprintf("%s replica=%s lsn=%d", kind, replicaID, t.LSN))
+	}
+}
+
+// fireTimeouts checks all pending timeouts against the current tick.
+// Called by Tick() AFTER message delivery, so data events (acks) get
+// a chance to cancel timeouts before they fire. This is the same-tick
+// race resolution rule: data before timers.
+//
+// State-guard rules (prevent stale timeout from mutating post-success state):
+//   - CatchupTimeout only fires if replica is still CatchingUp
+//   - ReservationTimeout only fires if replica is still CatchingUp
+//   - BarrierTimeout marks the barrier instance as expired (late acks rejected)
+func (c *Cluster) fireTimeouts() {
+	var remaining []PendingTimeout
+	for i := range c.Timeouts {
+		t := c.Timeouts[i]
+		if t.Cancelled {
+			continue
+		}
+		if c.Now < t.DeadlineAt {
+			remaining = append(remaining, t)
+			continue
+		}
+		// Check whether the timeout still has authority to mutate state.
+		stale := false
+		switch t.Kind {
+		case TimeoutBarrier:
+			// Barrier timeouts always apply — they mark the instance as expired.
+		case TimeoutCatchup, TimeoutReservation:
+			// Only valid if replica is still CatchingUp. If already recovered
+			// or escalated, the timeout is stale and has no authority.
+			if n := c.Nodes[t.ReplicaID]; n == nil || n.ReplicaState != NodeStateCatchingUp {
+				stale = true
+			}
+		}
+
+		if stale {
+			c.IgnoredTimeouts = append(c.IgnoredTimeouts, FiredTimeout{
+				PendingTimeout: t,
+				FiredAt:        c.Now,
+			})
+			c.logEvent(EventTimeoutIgnored, fmt.Sprintf("%s replica=%s lsn=%d (stale)", t.Kind, t.ReplicaID, t.LSN))
+			continue
+		}
+
+		// Timeout fires with authority.
+		c.FiredTimeouts = append(c.FiredTimeouts, FiredTimeout{
+			PendingTimeout: t,
+			FiredAt:        c.Now,
+		})
+		c.logEvent(EventTimeoutFired, fmt.Sprintf("%s replica=%s lsn=%d", t.Kind, t.ReplicaID, t.LSN))
+		switch t.Kind {
+		case TimeoutBarrier:
+			c.removeQueuedBarrier(t.ReplicaID, t.LSN)
+			c.ExpiredBarriers[barrierExpiredKey{t.ReplicaID, t.LSN}] = true
+		case TimeoutCatchup:
+			c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild
+		case TimeoutReservation:
+			c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild
+		}
+	}
+	c.Timeouts = remaining
+}
+
+// removeQueuedBarrier removes a re-queuing barrier from the message queue
+// after its timeout fires. Without this, the barrier would re-queue indefinitely.
+func (c *Cluster) removeQueuedBarrier(replicaID string, lsn uint64) {
+	var kept []inFlightMessage
+	for _, item := range c.Queue {
+		if item.msg.Kind == MsgBarrier && item.msg.To == replicaID && item.msg.TargetLSN == lsn {
+			continue
+		}
+		kept = append(kept, item)
+	}
+	c.Queue = kept
+}
+
+// cancelRecoveryTimeouts cancels all catch-up and reservation timeouts for a replica.
+// Called automatically by CatchUpWithEscalation on convergence or escalation,
+// so stale timeouts cannot regress a replica that already recovered or failed.
+func (c *Cluster) cancelRecoveryTimeouts(replicaID string) {
+	c.CancelTimeout(TimeoutCatchup, replicaID, 0)
+	c.CancelTimeout(TimeoutReservation, replicaID, 0)
+}
+
+// === Tick event log ===
+
+// TickEventKind identifies the type of event within a tick.
+type TickEventKind string
+
+const (
+	EventDeliveryAccepted  TickEventKind = "delivery_accepted"
+	EventDeliveryRejected  TickEventKind = "delivery_rejected"
+	EventTimeoutFired      TickEventKind = "timeout_fired"
+	EventTimeoutIgnored    TickEventKind = "timeout_ignored"
+	EventTimeoutCancelled  TickEventKind = "timeout_cancelled"
+)
+
+// TickEvent records a single event within a tick, in processing order.
+type TickEvent struct {
+	Tick   uint64
+	Kind   TickEventKind
+	Detail string
+}
+
+// logEvent appends a tick event to the cluster's event log.
+func (c *Cluster) logEvent(kind TickEventKind, detail string) {
+	c.TickLog = append(c.TickLog, TickEvent{Tick: c.Now, Kind: kind, Detail: detail})
+}
+
+// TickEventsAt returns all events recorded at a specific tick.
+func (c *Cluster) TickEventsAt(tick uint64) []TickEvent {
+	var events []TickEvent
+	for _, e := range c.TickLog {
+		if e.Tick == tick {
+			events = append(events, e)
+		}
+	}
+	return events
+}
+
+// === Trace infrastructure ===
+
+// Trace captures a snapshot of cluster state for debugging failed scenarios.
+// Reusable across test files and future replay/debug tooling.
+type Trace struct {
+	Tick            uint64
+	CommittedLSN    uint64
+	PrimaryID       string
+	Epoch           uint64
+	NodeStates      map[string]string
+	FiredTimeouts   []string
+	IgnoredTimeouts []string
+	TickEvents      []TickEvent // full ordered event log
+	Deliveries      int
+	Rejections      int
+	QueueDepth      int
+}
+
+// BuildTrace captures the current cluster state as a debuggable trace.
+func BuildTrace(c *Cluster) Trace {
+	tr := Trace{
+		Tick:         c.Now,
+		CommittedLSN: c.Coordinator.CommittedLSN,
+		PrimaryID:    c.Coordinator.PrimaryID,
+		Epoch:        c.Coordinator.Epoch,
+		NodeStates:   map[string]string{},
+		TickEvents:   c.TickLog,
+		Deliveries:   len(c.Deliveries),
+		Rejections:   len(c.Rejected),
+		QueueDepth:   len(c.Queue),
+	}
+	for id, n := range c.Nodes {
+		tr.NodeStates[id] = fmt.Sprintf("role=%s state=%s epoch=%d flushed=%d running=%v",
+			n.Role, n.ReplicaState, n.Epoch, n.Storage.FlushedLSN, n.Running)
+	}
+	for _, ft := range c.FiredTimeouts {
+		tr.FiredTimeouts = append(tr.FiredTimeouts,
+			fmt.Sprintf("%s replica=%s lsn=%d fired_at=%d", ft.Kind, ft.ReplicaID, ft.LSN, ft.FiredAt))
+	}
+	for _, it := range c.IgnoredTimeouts {
+		tr.IgnoredTimeouts = append(tr.IgnoredTimeouts,
+			fmt.Sprintf("%s replica=%s lsn=%d stale_at=%d", it.Kind, it.ReplicaID, it.LSN, it.FiredAt))
+	}
+	return tr
+}
+
+// === Query helpers ===
+
+// FiredTimeoutsByKind returns the count of fired timeouts of a specific kind.
+func (c *Cluster) FiredTimeoutsByKind(kind TimeoutKind) int {
+	count := 0
+	for _, ft := range c.FiredTimeouts {
+		if ft.Kind == kind {
+			count++
+		}
+	}
+	return count
+}
--- a/sw-block/prototype/distsim/phase02_advanced_test.go
+++ b/sw-block/prototype/distsim/phase02_advanced_test.go
@ -0,0 +1,213 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 02: Item 4 — Smart WAL recovery-class transitions
+// ============================================================
+
+// Test: recovery starts with resolvable ExtentReferenced records,
+// then a payload becomes unresolvable during active recovery.
+// Protocol must detect the transition and abort to NeedsRebuild.
+
+func TestP02_SmartWAL_RecoverableThenUnrecoverable(t *testing.T) {
+	// Build recovery records: first 3 WALInline, then 2 ExtentReferenced.
+	records := []RecoveryRecord{
+		{Write: Write{LSN: 1, Block: 1, Value: 1}, Class: RecoveryClassWALInline},
+		{Write: Write{LSN: 2, Block: 2, Value: 2}, Class: RecoveryClassWALInline},
+		{Write: Write{LSN: 3, Block: 3, Value: 3}, Class: RecoveryClassWALInline},
+		{Write: Write{LSN: 4, Block: 4, Value: 4}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
+		{Write: Write{LSN: 5, Block: 5, Value: 5}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
+	}
+
+	// Initially fully recoverable.
+	if !FullyRecoverable(records) {
+		t.Fatal("initial records should be fully recoverable")
+	}
+
+	// Simulate payload becoming unresolvable (e.g., extent generation GC'd).
+	records[4].PayloadResolvable = false
+
+	// Now NOT recoverable — must detect and abort.
+	if FullyRecoverable(records) {
+		t.Fatal("after payload loss, records should NOT be recoverable")
+	}
+
+	// Apply only the recoverable prefix.
+	state := ApplyRecoveryRecords(records[:4], 0, 4) // only first 4
+	if state[4] != 4 {
+		t.Fatalf("partial apply: block 4 should be 4, got %d", state[4])
+	}
+	if _, has5 := state[5]; has5 {
+		t.Fatal("block 5 should NOT be in partial state — payload was lost")
+	}
+}
+
+func TestP02_SmartWAL_MixedClassRecovery_FullSuccess(t *testing.T) {
+	records := []RecoveryRecord{
+		{Write: Write{LSN: 1, Block: 0, Value: 10}, Class: RecoveryClassWALInline},
+		{Write: Write{LSN: 2, Block: 1, Value: 20}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
+		{Write: Write{LSN: 3, Block: 0, Value: 30}, Class: RecoveryClassWALInline},
+		{Write: Write{LSN: 4, Block: 2, Value: 40}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
+	}
+
+	if !FullyRecoverable(records) {
+		t.Fatal("all resolvable — should be recoverable")
+	}
+
+	state := ApplyRecoveryRecords(records, 0, 4)
+	// Block 0 overwritten: 10 then 30.
+	if state[0] != 30 {
+		t.Fatalf("block 0: got %d, want 30", state[0])
+	}
+	if state[1] != 20 {
+		t.Fatalf("block 1: got %d, want 20", state[1])
+	}
+	if state[2] != 40 {
+		t.Fatalf("block 2: got %d, want 40", state[2])
+	}
+}
+
+func TestP02_SmartWAL_TimeVaryingAvailability(t *testing.T) {
+	// Simulate time-varying payload availability:
+	// At time T1, all records are recoverable.
+	// At time T2, one becomes unrecoverable.
+	// At time T3, it becomes recoverable again (re-pinned).
+
+	records := []RecoveryRecord{
+		{Write: Write{LSN: 1, Block: 0, Value: 1}, Class: RecoveryClassWALInline},
+		{Write: Write{LSN: 2, Block: 1, Value: 2}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
+		{Write: Write{LSN: 3, Block: 2, Value: 3}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
+	}
+
+	// T1: all recoverable.
+	if !FullyRecoverable(records) {
+		t.Fatal("T1: should be recoverable")
+	}
+
+	// T2: payload for LSN 2 lost.
+	records[1].PayloadResolvable = false
+	if FullyRecoverable(records) {
+		t.Fatal("T2: should NOT be recoverable after payload loss")
+	}
+
+	// T3: payload re-pinned (e.g., operator restores snapshot).
+	records[1].PayloadResolvable = true
+	if !FullyRecoverable(records) {
+		t.Fatal("T3: should be recoverable after re-pin")
+	}
+}
+
+// ============================================================
+// Phase 02: Item 5 — Strengthen S5 (flapping replica)
+// ============================================================
+
+// S5 strengthened: repeated disconnect/reconnect with catch-up
+// state tracking. If flapping exceeds budget, escalate to NeedsRebuild.
+
+func TestP02_S5_FlappingWithStateTracking(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 10 // generous for flapping
+
+	// Initial writes.
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	r1 := c.Nodes["r1"]
+
+	// 5 flapping cycles — each creates a small gap then catches up.
+	for cycle := 0; cycle < 5; cycle++ {
+		c.Disconnect("p", "r1")
+		c.Disconnect("r1", "p")
+
+		c.CommitWrite(uint64(3 + cycle*2))
+		c.CommitWrite(uint64(4 + cycle*2))
+		c.TickN(3)
+
+		c.Connect("p", "r1")
+		c.Connect("r1", "p")
+
+		r1.ReplicaState = NodeStateCatchingUp
+		converged := c.CatchUpWithEscalation("r1", 100)
+		if !converged {
+			t.Fatalf("cycle %d: catch-up should converge for small gap", cycle)
+		}
+		if r1.ReplicaState != NodeStateInSync {
+			t.Fatalf("cycle %d: expected InSync, got %s", cycle, r1.ReplicaState)
+		}
+	}
+
+	// After 5 successful flaps, CatchupAttempts should be 0 (reset on success).
+	if r1.CatchupAttempts != 0 {
+		t.Fatalf("CatchupAttempts should be 0 after successful catch-ups, got %d", r1.CatchupAttempts)
+	}
+
+	// No unnecessary rebuild — r1 should NOT have a base snapshot.
+	if r1.Storage.BaseSnapshot != nil {
+		t.Fatal("flapping replica should not have been rebuilt — only WAL catch-up")
+	}
+
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatal(err)
+	}
+}
+
+func TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 3 // tight budget
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	r1 := c.Nodes["r1"]
+
+	// Each flap creates a gap, but primary writes a LOT during disconnect.
+	// Catch-up recovers only 1 entry per attempt. After MaxCatchupAttempts
+	// non-convergent attempts, escalate.
+	for cycle := 0; cycle < 5; cycle++ {
+		c.Disconnect("p", "r1")
+		c.Disconnect("r1", "p")
+
+		// Large writes during disconnect.
+		for w := 0; w < 30; w++ {
+			c.CommitWrite(uint64(cycle*30+w+2) % 8)
+		}
+		c.TickN(3)
+
+		c.Connect("p", "r1")
+		c.Connect("r1", "p")
+
+		r1.ReplicaState = NodeStateCatchingUp
+
+		// Try catch-up with small batch — will not converge.
+		for attempt := 0; attempt < 5; attempt++ {
+			c.Disconnect("p", "r1")
+			c.Disconnect("r1", "p")
+			for w := 0; w < 10; w++ {
+				c.CommitWrite(uint64(200+cycle*50+attempt*10+w) % 8)
+			}
+			c.TickN(2)
+			c.Connect("p", "r1")
+			c.Connect("r1", "p")
+
+			c.CatchUpWithEscalation("r1", 1)
+
+			if r1.ReplicaState == NodeStateNeedsRebuild {
+				t.Logf("flapping escalated to NeedsRebuild at cycle %d, attempt %d", cycle, attempt)
+				// Verify: NeedsRebuild is sticky.
+				c.CatchUpWithEscalation("r1", 100)
+				if r1.ReplicaState != NodeStateNeedsRebuild {
+					t.Fatal("NeedsRebuild should be sticky — catch-up should not reset it")
+				}
+				return
+			}
+		}
+	}
+
+	// If we got here, the budget wasn't reached. That's wrong.
+	t.Fatalf("expected NeedsRebuild escalation, but state is %s with %d attempts",
+		r1.ReplicaState, r1.CatchupAttempts)
+}
--- a/sw-block/prototype/distsim/phase02_candidate_test.go
+++ b/sw-block/prototype/distsim/phase02_candidate_test.go
@ -0,0 +1,445 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 02: Coordinator candidate-selection tests
+// Verifies promotion ranking under mixed replica states.
+// ============================================================
+
+func TestP02_CandidateSelection_AllEqual_AlphabeticalTieBreak(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// All replicas InSync with same FlushedLSN → alphabetical tie-break.
+	best := c.BestPromotionCandidate()
+	if best != "r1" {
+		t.Fatalf("all equal: expected r1 (alphabetical), got %q", best)
+	}
+
+	candidates := c.PromotionCandidates()
+	if len(candidates) != 3 {
+		t.Fatalf("expected 3 candidates, got %d", len(candidates))
+	}
+	for i, exp := range []string{"r1", "r2", "r3"} {
+		if candidates[i].ID != exp {
+			t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp)
+		}
+	}
+}
+
+func TestP02_CandidateSelection_HigherLSN_Wins(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+
+	// Directly set FlushedLSN to simulate different progress.
+	// All InSync — higher LSN wins.
+	for _, id := range []string{"r1", "r2", "r3"} {
+		c.Nodes[id].ReplicaState = NodeStateInSync
+	}
+	c.Nodes["r1"].Storage.FlushedLSN = 10
+	c.Nodes["r2"].Storage.FlushedLSN = 20
+	c.Nodes["r3"].Storage.FlushedLSN = 15
+
+	best := c.BestPromotionCandidate()
+	if best != "r2" {
+		t.Fatalf("higher LSN: expected r2, got %q", best)
+	}
+
+	candidates := c.PromotionCandidates()
+	if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" {
+		t.Fatalf("order: got [%s, %s, %s], want [r2, r3, r1]",
+			candidates[0].ID, candidates[1].ID, candidates[2].ID)
+	}
+}
+
+func TestP02_CandidateSelection_StoppedNode_Excluded(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.Nodes["r1"].Storage.FlushedLSN = 100
+	c.Nodes["r2"].Storage.FlushedLSN = 50
+	c.StopNode("r1") // highest LSN but stopped
+
+	best := c.BestPromotionCandidate()
+	if best != "r2" {
+		t.Fatalf("stopped excluded: expected r2, got %q", best)
+	}
+
+	// r1 should be last in ranking (not running).
+	candidates := c.PromotionCandidates()
+	if candidates[0].ID != "r2" {
+		t.Fatalf("first candidate should be r2, got %s", candidates[0].ID)
+	}
+	if candidates[1].Running {
+		t.Fatal("r1 should be marked not running")
+	}
+}
+
+func TestP02_CandidateSelection_InSync_Beats_CatchingUp(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+
+	// r1: CatchingUp with highest LSN.
+	c.Nodes["r1"].ReplicaState = NodeStateCatchingUp
+	c.Nodes["r1"].Storage.FlushedLSN = 100
+
+	// r2: InSync with lower LSN.
+	c.Nodes["r2"].ReplicaState = NodeStateInSync
+	c.Nodes["r2"].Storage.FlushedLSN = 50
+
+	// r3: InSync with even lower LSN.
+	c.Nodes["r3"].ReplicaState = NodeStateInSync
+	c.Nodes["r3"].Storage.FlushedLSN = 40
+
+	// InSync with lower LSN beats CatchingUp with higher LSN.
+	best := c.BestPromotionCandidate()
+	if best != "r2" {
+		t.Fatalf("InSync beats CatchingUp: expected r2, got %q", best)
+	}
+
+	candidates := c.PromotionCandidates()
+	// r2 (InSync, 50), r3 (InSync, 40), r1 (CatchingUp, 100)
+	if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" {
+		t.Fatalf("order: got [%s, %s, %s]", candidates[0].ID, candidates[1].ID, candidates[2].ID)
+	}
+}
+
+func TestP02_CandidateSelection_AllCatchingUp_HighestLSN_Wins(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+
+	for _, id := range []string{"r1", "r2", "r3"} {
+		c.Nodes[id].ReplicaState = NodeStateCatchingUp
+	}
+	c.Nodes["r1"].Storage.FlushedLSN = 30
+	c.Nodes["r2"].Storage.FlushedLSN = 80
+	c.Nodes["r3"].Storage.FlushedLSN = 50
+
+	best := c.BestPromotionCandidate()
+	if best != "r2" {
+		t.Fatalf("all CatchingUp: expected r2 (highest LSN), got %q", best)
+	}
+}
+
+func TestP02_CandidateSelection_NeedsRebuild_Skipped(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+
+	// r1: NeedsRebuild with highest LSN.
+	c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild
+	c.Nodes["r1"].Storage.FlushedLSN = 100
+
+	// r2: InSync with moderate LSN.
+	c.Nodes["r2"].ReplicaState = NodeStateInSync
+	c.Nodes["r2"].Storage.FlushedLSN = 50
+
+	// r3: CatchingUp with low LSN.
+	c.Nodes["r3"].ReplicaState = NodeStateCatchingUp
+	c.Nodes["r3"].Storage.FlushedLSN = 20
+
+	best := c.BestPromotionCandidate()
+	if best != "r2" {
+		t.Fatalf("NeedsRebuild skipped: expected r2, got %q", best)
+	}
+
+	candidates := c.PromotionCandidates()
+	// r2 (InSync, 50), r3 (CatchingUp, 20), r1 (NeedsRebuild, 100)
+	if candidates[0].ID != "r2" {
+		t.Fatalf("first should be r2, got %s", candidates[0].ID)
+	}
+	if candidates[2].ID != "r1" {
+		t.Fatalf("last should be r1 (NeedsRebuild), got %s", candidates[2].ID)
+	}
+}
+
+func TestP02_CandidateSelection_NoRunning_ReturnsEmpty(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.StopNode("r1")
+	c.StopNode("r2")
+
+	best := c.BestPromotionCandidate()
+	if best != "" {
+		t.Fatalf("no running: expected empty, got %q", best)
+	}
+}
+
+func TestP02_CandidateSelection_AfterPartition_RankingUpdates(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// All InSync at FlushedLSN=2. Best = r1 (alphabetical).
+	if best := c.BestPromotionCandidate(); best != "r1" {
+		t.Fatalf("before partition: expected r1, got %q", best)
+	}
+
+	// Partition r1. Write more via p+r2+r3.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	// With 4 members (p, r1, r2, r3), quorum=3. p+r2+r3=3. OK.
+	// Actually, quorum = 4/2+1=3. p+r2+r3=3. Marginal.
+	c.CommitWrite(3)
+	c.CommitWrite(4)
+	c.CommitWrite(5)
+	c.TickN(5)
+
+	// r1 lagging, r2/r3 ahead.
+	c.Nodes["r1"].ReplicaState = NodeStateCatchingUp
+
+	// Now r2 or r3 should win (both InSync with higher LSN).
+	best := c.BestPromotionCandidate()
+	if best == "r1" {
+		t.Fatal("after partition: r1 should not be best (CatchingUp)")
+	}
+	if best != "r2" {
+		t.Fatalf("after partition: expected r2 (InSync, alphabetical tie-break), got %q", best)
+	}
+	t.Logf("after partition: best=%s", best)
+}
+
+func TestP02_CandidateSelection_MixedStates_FullRanking(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4", "r5")
+
+	// Set up a diverse state mix:
+	// r1: InSync, LSN=50
+	// r2: InSync, LSN=60 (highest InSync)
+	// r3: CatchingUp, LSN=80 (highest overall but CatchingUp)
+	// r4: NeedsRebuild, LSN=90 (highest but NeedsRebuild)
+	// r5: stopped, LSN=100 (highest but not running)
+	c.Nodes["r1"].ReplicaState = NodeStateInSync
+	c.Nodes["r1"].Storage.FlushedLSN = 50
+	c.Nodes["r2"].ReplicaState = NodeStateInSync
+	c.Nodes["r2"].Storage.FlushedLSN = 60
+	c.Nodes["r3"].ReplicaState = NodeStateCatchingUp
+	c.Nodes["r3"].Storage.FlushedLSN = 80
+	c.Nodes["r4"].ReplicaState = NodeStateNeedsRebuild
+	c.Nodes["r4"].Storage.FlushedLSN = 90
+	c.Nodes["r5"].Storage.FlushedLSN = 100
+	c.StopNode("r5")
+
+	best := c.BestPromotionCandidate()
+	if best != "r2" {
+		t.Fatalf("mixed states: expected r2 (InSync+highest among InSync), got %q", best)
+	}
+
+	candidates := c.PromotionCandidates()
+	// Expected order: r2(InSync,60), r1(InSync,50), r3(CatchingUp,80),
+	// r4(NeedsRebuild,90), r5(stopped,100)
+	expected := []string{"r2", "r1", "r3", "r4", "r5"}
+	for i, exp := range expected {
+		if candidates[i].ID != exp {
+			t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp)
+		}
+	}
+	t.Logf("full ranking: %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d)",
+		candidates[0].ID, candidates[0].State, candidates[0].FlushedLSN,
+		candidates[1].ID, candidates[1].State, candidates[1].FlushedLSN,
+		candidates[2].ID, candidates[2].State, candidates[2].FlushedLSN,
+		candidates[3].ID, candidates[3].State, candidates[3].FlushedLSN,
+		candidates[4].ID, candidates[4].State, candidates[4].FlushedLSN)
+}
+
+func TestP02_CandidateSelection_AllNeedsRebuild_SafeDefaultEmpty(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild
+	c.Nodes["r1"].Storage.FlushedLSN = 50
+	c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild
+	c.Nodes["r2"].Storage.FlushedLSN = 80
+
+	// Safe default: refuses NeedsRebuild candidates.
+	safe := c.BestPromotionCandidate()
+	if safe != "" {
+		t.Fatalf("safe default should return empty for all-NeedsRebuild, got %q", safe)
+	}
+}
+
+func TestP02_CandidateSelection_DesperationPromotion_ExplicitAPI(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+	for _, id := range []string{"r1", "r2", "r3"} {
+		c.Nodes[id].ReplicaState = NodeStateNeedsRebuild
+	}
+	c.Nodes["r1"].Storage.FlushedLSN = 10
+	c.Nodes["r2"].Storage.FlushedLSN = 30
+	c.Nodes["r3"].Storage.FlushedLSN = 20
+
+	safe := c.BestPromotionCandidate()
+	if safe != "" {
+		t.Fatalf("safe default should return empty, got %q", safe)
+	}
+
+	desperate := c.BestPromotionCandidateDesperate()
+	if desperate != "r2" {
+		t.Fatalf("desperation: expected r2 (highest LSN), got %q", desperate)
+	}
+}
+
+// === Candidate eligibility tests ===
+
+func TestP02_CandidateEligibility_Running(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	e := c.EvaluateCandidateEligibility("r1")
+	if !e.Eligible {
+		t.Fatalf("running InSync replica should be eligible, reasons: %v", e.Reasons)
+	}
+
+	c.StopNode("r1")
+	e = c.EvaluateCandidateEligibility("r1")
+	if e.Eligible {
+		t.Fatal("stopped replica should not be eligible")
+	}
+	if e.Reasons[0] != "not_running" {
+		t.Fatalf("expected not_running reason, got %v", e.Reasons)
+	}
+}
+
+func TestP02_CandidateEligibility_EpochAlignment(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Manually desync r1's epoch.
+	c.Nodes["r1"].Epoch = c.Coordinator.Epoch - 1
+
+	e := c.EvaluateCandidateEligibility("r1")
+	if e.Eligible {
+		t.Fatal("epoch-misaligned replica should not be eligible")
+	}
+	found := false
+	for _, r := range e.Reasons {
+		if r == "epoch_misaligned" {
+			found = true
+		}
+	}
+	if !found {
+		t.Fatalf("expected epoch_misaligned reason, got %v", e.Reasons)
+	}
+}
+
+func TestP02_CandidateEligibility_StateIneligible(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	for _, state := range []ReplicaNodeState{NodeStateNeedsRebuild, NodeStateRebuilding} {
+		c.Nodes["r1"].ReplicaState = state
+		e := c.EvaluateCandidateEligibility("r1")
+		if e.Eligible {
+			t.Fatalf("%s should not be eligible", state)
+		}
+	}
+
+	// CatchingUp IS eligible (data may be mostly current).
+	c.Nodes["r1"].ReplicaState = NodeStateCatchingUp
+	e := c.EvaluateCandidateEligibility("r1")
+	if !e.Eligible {
+		t.Fatalf("CatchingUp should be eligible, reasons: %v", e.Reasons)
+	}
+}
+
+func TestP02_CandidateEligibility_InsufficientCommittedPrefix(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// r1 has FlushedLSN=1, CommittedLSN=1 → eligible.
+	e := c.EvaluateCandidateEligibility("r1")
+	if !e.Eligible {
+		t.Fatalf("r1 at committed prefix should be eligible, reasons: %v", e.Reasons)
+	}
+
+	// Manually set r1 behind committed prefix.
+	c.Nodes["r1"].Storage.FlushedLSN = 0
+	e = c.EvaluateCandidateEligibility("r1")
+	if e.Eligible {
+		t.Fatal("FlushedLSN=0 with CommittedLSN=1 should not be eligible")
+	}
+	found := false
+	for _, r := range e.Reasons {
+		if r == "insufficient_committed_prefix" {
+			found = true
+		}
+	}
+	if !found {
+		t.Fatalf("expected insufficient_committed_prefix reason, got %v", e.Reasons)
+	}
+}
+
+func TestP02_CandidateEligibility_InSyncButLagging_Rejected(t *testing.T) {
+	// Scenario from finding: r1 is InSync with correct epoch but FlushedLSN << CommittedLSN.
+	// r2 is CatchingUp but has the committed prefix. r2 should be selected over r1.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
+
+	// Set committed prefix high.
+	c.Coordinator.CommittedLSN = 100
+
+	// r1: InSync, correct epoch, but FlushedLSN=1. Ineligible.
+	c.Nodes["r1"].ReplicaState = NodeStateInSync
+	c.Nodes["r1"].Storage.FlushedLSN = 1
+
+	// r2: CatchingUp, correct epoch, FlushedLSN=100. Eligible.
+	c.Nodes["r2"].ReplicaState = NodeStateCatchingUp
+	c.Nodes["r2"].Storage.FlushedLSN = 100
+
+	// r3: InSync, correct epoch, FlushedLSN=100. Eligible.
+	c.Nodes["r3"].ReplicaState = NodeStateInSync
+	c.Nodes["r3"].Storage.FlushedLSN = 100
+
+	// r1 is ineligible despite being InSync.
+	e1 := c.EvaluateCandidateEligibility("r1")
+	if e1.Eligible {
+		t.Fatal("r1 (InSync, FlushedLSN=1, CommittedLSN=100) should be ineligible")
+	}
+
+	// r2 and r3 are eligible.
+	e2 := c.EvaluateCandidateEligibility("r2")
+	if !e2.Eligible {
+		t.Fatalf("r2 should be eligible, reasons: %v", e2.Reasons)
+	}
+
+	// BestPromotionCandidate should pick r3 (InSync with prefix) over r2 (CatchingUp).
+	best := c.BestPromotionCandidate()
+	if best != "r3" {
+		t.Fatalf("expected r3 (InSync+prefix), got %q", best)
+	}
+
+	// r1 must NOT be in the eligible list at all.
+	eligible := c.EligiblePromotionCandidates()
+	for _, pc := range eligible {
+		if pc.ID == "r1" {
+			t.Fatal("r1 should not appear in eligible candidates")
+		}
+	}
+	t.Logf("committed-prefix gate: r1(InSync/flushed=1) rejected, r3(InSync/flushed=100) selected")
+}
+
+func TestP02_CandidateEligibility_EligiblePromotionCandidates(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4")
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// r1: InSync, eligible
+	// r2: NeedsRebuild, ineligible
+	c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild
+	// r3: stopped, ineligible
+	c.StopNode("r3")
+	// r4: epoch misaligned, ineligible
+	c.Nodes["r4"].Epoch = 0
+
+	eligible := c.EligiblePromotionCandidates()
+	if len(eligible) != 1 {
+		t.Fatalf("expected 1 eligible candidate, got %d", len(eligible))
+	}
+	if eligible[0].ID != "r1" {
+		t.Fatalf("expected r1 as only eligible, got %s", eligible[0].ID)
+	}
+
+	// BestPromotionCandidate uses eligibility.
+	best := c.BestPromotionCandidate()
+	if best != "r1" {
+		t.Fatalf("BestPromotionCandidate should return r1, got %q", best)
+	}
+}
--- a/sw-block/prototype/distsim/phase02_network_test.go
+++ b/sw-block/prototype/distsim/phase02_network_test.go
@ -0,0 +1,371 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 02: Delayed/drop network + multi-node reservation expiry
+// ============================================================
+
+// --- Item 4: Stale delayed messages after heal/promote ---
+
+// Scenario: messages from old primary are in-flight when partition heals
+// and a new primary is promoted. The stale messages arrive AFTER the
+// promotion. They must be rejected by epoch fencing.
+
+func TestP02_DelayedStaleMessages_AfterPromote(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "A", "B", "C")
+
+	// Phase 1: A writes, ships to B and C.
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// Phase 2: A writes more, but we manually enqueue delayed delivery
+	// to simulate in-flight messages when partition happens.
+	c.CommitWrite(3) // LSN 3 ships normally
+	// Don't tick yet — messages are in the queue.
+
+	// Phase 3: Partition A from everyone, promote B.
+	c.Disconnect("A", "B")
+	c.Disconnect("B", "A")
+	c.Disconnect("A", "C")
+	c.Disconnect("C", "A")
+	c.StopNode("A")
+	c.Promote("B")
+
+	// Phase 4: Manually inject stale messages as if they were delayed in the network.
+	// These represent A's write(3) + barrier(3) that were in-flight when A crashed.
+	staleEpoch := c.Coordinator.Epoch - 1
+	c.InjectMessage(Message{
+		Kind: MsgWrite, From: "A", To: "B", Epoch: staleEpoch,
+		Write: Write{LSN: 3, Block: 3, Value: 3},
+	}, c.Now+1)
+	c.InjectMessage(Message{
+		Kind: MsgBarrier, From: "A", To: "B", Epoch: staleEpoch,
+		TargetLSN: 3,
+	}, c.Now+2)
+	c.InjectMessage(Message{
+		Kind: MsgWrite, From: "A", To: "C", Epoch: staleEpoch,
+		Write: Write{LSN: 3, Block: 3, Value: 3},
+	}, c.Now+1)
+
+	// Phase 5: Tick to deliver stale messages.
+	committedBefore := c.Coordinator.CommittedLSN
+	c.TickN(5)
+
+	// All stale messages must be rejected — either by epoch fencing or node-down.
+	epochRejects := c.RejectedByReason(RejectEpochMismatch)
+	nodeDownRejects := c.RejectedByReason(RejectNodeDown)
+	totalRejects := epochRejects + nodeDownRejects
+	if totalRejects == 0 {
+		t.Fatal("stale delayed messages were not rejected")
+	}
+
+	// Committed prefix must not change from stale messages.
+	if c.Coordinator.CommittedLSN != committedBefore {
+		t.Fatalf("stale delayed messages changed committed prefix: before=%d after=%d",
+			committedBefore, c.Coordinator.CommittedLSN)
+	}
+
+	// Data correct on new primary.
+	if err := c.AssertCommittedRecoverable("B"); err != nil {
+		t.Fatalf("data incorrect after stale delayed messages: %v", err)
+	}
+	t.Logf("stale delayed messages: %d rejected by epoch_mismatch", epochRejects)
+}
+
+// Scenario: old barrier ACK arrives after promotion with long delay.
+// This is different from S18 — the delay is network-level, not restart-level.
+
+func TestP02_DelayedBarrierAck_LongNetworkDelay(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r1")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Write 2 — barrier sent to r1.
+	c.CommitWrite(2)
+	c.TickN(2) // barrier in flight
+
+	// Promote r1 (simulate primary failure + promotion).
+	c.StopNode("p")
+	c.Promote("r1")
+
+	committedBefore := c.Coordinator.CommittedLSN
+
+	// Long-delayed barrier ack from r1 → dead primary p.
+	c.InjectMessage(Message{
+		Kind: MsgBarrierAck, From: "r1", To: "p",
+		Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2,
+	}, c.Now+10)
+
+	c.TickN(15)
+
+	// Must be rejected — p is dead and epoch is stale.
+	nodeDownRejects := c.RejectedByReason(RejectNodeDown)
+	epochRejects := c.RejectedByReason(RejectEpochMismatch)
+	if nodeDownRejects == 0 && epochRejects == 0 {
+		t.Fatal("delayed barrier ack should be rejected (node down or epoch mismatch)")
+	}
+
+	// Stale ack must not advance committed prefix.
+	if c.Coordinator.CommittedLSN != committedBefore {
+		t.Fatalf("stale ack changed committed prefix: before=%d after=%d",
+			committedBefore, c.Coordinator.CommittedLSN)
+	}
+}
+
+// Scenario: write ships to replica, network drops the write but delivers
+// the barrier. Barrier should timeout or detect missing data.
+
+func TestP02_DroppedWrite_BarrierDelivered_Stalls(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r1")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Write 2 — but drop the write message to r1 (link down for data only).
+	// We simulate by writing but not ticking, then dropping queued writes.
+	c.CommitWrite(2) // enqueues write(2) + barrier(2) to r1
+
+	// Remove only the write message from the queue (simulate selective drop).
+	var kept []inFlightMessage
+	for _, item := range c.Queue {
+		if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 2 {
+			continue // drop this write
+		}
+		kept = append(kept, item)
+	}
+	c.Queue = kept
+
+	// Tick — barrier arrives at r1 but r1 doesn't have LSN 2.
+	// Barrier should re-queue (waiting for data).
+	c.TickN(10)
+
+	// Assert 1: sync_all blocked — CommittedLSN stuck at 1.
+	if c.Coordinator.CommittedLSN != 1 {
+		t.Fatalf("sync_all should be blocked at LSN 1, got committed=%d", c.Coordinator.CommittedLSN)
+	}
+
+	// Assert 2: LSN 2 is pending but NOT committed.
+	p2 := c.Pending[2]
+	if p2 == nil {
+		t.Fatal("LSN 2 should be pending")
+	}
+	if p2.Committed {
+		t.Fatal("LSN 2 committed under sync_all but r1 never received the write — safety violation")
+	}
+
+	// Assert 3: barrier still re-queuing — stall proven positively.
+	barrierRequeued := false
+	for _, item := range c.Queue {
+		if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 {
+			barrierRequeued = true
+			break
+		}
+	}
+	if !barrierRequeued {
+		t.Fatal("barrier for LSN 2 should still be re-queuing — stall not proven")
+	}
+	t.Logf("dropped write stall proven: committed=%d, pending[2].committed=%v, barrier re-queuing=%v",
+		c.Coordinator.CommittedLSN, p2.Committed, barrierRequeued)
+}
+
+// --- Item 5: Multi-node reservation expiry / rebuild timeout ---
+
+// Scenario: RF=3 cluster. Two replicas need catch-up. One's reservation
+// expires during recovery. Must handle correctly: one rebuilds, one catches up.
+
+func TestP02_MultiNode_ReservationExpiry_MixedOutcome(t *testing.T) {
+	// 5 nodes: p+r3+r4 provide quorum (3 of 5) while r1+r2 are disconnected.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4")
+
+	// Write initial data.
+	for i := uint64(1); i <= 10; i++ {
+		c.CommitWrite(i % 4)
+	}
+	c.TickN(5)
+
+	// Take snapshot for rebuild.
+	c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN)
+
+	// r1+r2 disconnect. r3+r4 stay for quorum (p+r3+r4 = 3 of 5).
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.Disconnect("p", "r2")
+	c.Disconnect("r2", "p")
+
+	// Write more during disconnect — committed via p+r3+r4 quorum.
+	for i := uint64(11); i <= 30; i++ {
+		c.CommitWrite(i % 4)
+	}
+	c.TickN(5)
+
+	// Reconnect both.
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+	c.Connect("p", "r2")
+	c.Connect("r2", "p")
+
+	// r1: reserved catch-up with tight expiry — MUST expire.
+	// 20 entries to replay, but only 2 ticks of budget.
+	r1 := c.Nodes["r1"]
+	shortExpiry := c.Now + 2
+	err := c.RecoverReplicaFromPrimaryReserved("r1", r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN, shortExpiry)
+	if err == nil {
+		t.Fatal("r1 reservation must expire — 20 entries with 2-tick budget")
+	}
+	r1.ReplicaState = NodeStateNeedsRebuild
+	t.Logf("r1 reservation expired: %v", err)
+
+	// r2: full catch-up (no reservation pressure).
+	r2 := c.Nodes["r2"]
+	if err := c.RecoverReplicaFromPrimary("r2", r2.Storage.FlushedLSN, c.Coordinator.CommittedLSN); err != nil {
+		t.Fatalf("r2 full catch-up failed: %v", err)
+	}
+	r2.ReplicaState = NodeStateInSync
+
+	// Deterministic mixed outcome: r1=NeedsRebuild, r2=InSync.
+	if r1.ReplicaState != NodeStateNeedsRebuild {
+		t.Fatalf("r1 should be NeedsRebuild, got %s", r1.ReplicaState)
+	}
+	if r2.ReplicaState != NodeStateInSync {
+		t.Fatalf("r2 should be InSync, got %s", r2.ReplicaState)
+	}
+
+	// r2 data correct.
+	if err := c.AssertCommittedRecoverable("r2"); err != nil {
+		t.Fatalf("r2 data incorrect: %v", err)
+	}
+
+	// r1 rebuild from snapshot.
+	c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN)
+	r1.ReplicaState = NodeStateInSync
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatalf("r1 data incorrect after rebuild: %v", err)
+	}
+
+	t.Logf("mixed outcome proven: r1=NeedsRebuild→rebuilt, r2=InSync")
+}
+
+// Scenario: all replicas need rebuild but only one snapshot exists.
+// First replica rebuilds from snapshot, second must wait or use first
+// replica as rebuild source.
+
+func TestP02_MultiNode_AllNeedRebuild(t *testing.T) {
+	// Use 5 nodes so quorum (3 of 5) can be met with p+r3+r4 while r1+r2 are down.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4")
+	c.MaxCatchupAttempts = 2
+
+	for i := uint64(1); i <= 5; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	c.Primary().Storage.TakeSnapshot("snap-all", c.Coordinator.CommittedLSN)
+
+	// r1 and r2 disconnect. r3+r4 stay connected so quorum (p+r3+r4) can commit.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.Disconnect("p", "r2")
+	c.Disconnect("r2", "p")
+	for i := uint64(6); i <= 100; i++ {
+		c.CommitWrite(i % 8)
+	}
+	c.TickN(5)
+
+	// Try catch-up for r1 and r2 — both will escalate.
+	// Pattern: write while target disconnected, then try partial catch-up.
+	for _, id := range []string{"r1", "r2"} {
+		n := c.Nodes[id]
+		n.ReplicaState = NodeStateCatchingUp
+		for attempt := 0; attempt < 5; attempt++ {
+			// Write MORE while target is still disconnected (r3+r4 provide quorum).
+			for w := 0; w < 20; w++ {
+				c.CommitWrite(uint64(101+attempt*20+w) % 8)
+			}
+			c.TickN(3) // 3 ticks: deliver writes, barriers, then acks
+			// Now try catch-up (partial, batch=1). Target stays disconnected —
+			// RecoverReplicaFromPrimaryPartial reads directly from primary WAL.
+			c.CatchUpWithEscalation(id, 1)
+			if n.ReplicaState == NodeStateNeedsRebuild {
+				break
+			}
+		}
+	}
+
+	// Reconnect all for rebuild.
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+	c.Connect("p", "r2")
+	c.Connect("r2", "p")
+
+	// Both should be NeedsRebuild.
+	if c.Nodes["r1"].ReplicaState != NodeStateNeedsRebuild {
+		t.Fatalf("r1: expected NeedsRebuild, got %s", c.Nodes["r1"].ReplicaState)
+	}
+	if c.Nodes["r2"].ReplicaState != NodeStateNeedsRebuild {
+		t.Fatalf("r2: expected NeedsRebuild, got %s", c.Nodes["r2"].ReplicaState)
+	}
+
+	// Rebuild both from snapshot.
+	c.RebuildReplicaFromSnapshot("r1", "snap-all", c.Coordinator.CommittedLSN)
+	c.RebuildReplicaFromSnapshot("r2", "snap-all", c.Coordinator.CommittedLSN)
+	c.Nodes["r1"].ReplicaState = NodeStateInSync
+	c.Nodes["r2"].ReplicaState = NodeStateInSync
+
+	// Both correct.
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatal(err)
+	}
+	if err := c.AssertCommittedRecoverable("r2"); err != nil {
+		t.Fatal(err)
+	}
+	t.Logf("multi-node rebuild complete: both replicas recovered from snapshot")
+}
+
+// Scenario: rebuild timeout — rebuild takes too long, coordinator
+// should be able to abort and retry or fail explicitly.
+
+func TestP02_RebuildTimeout_PartialRebuildAborts(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	for i := uint64(1); i <= 20; i++ {
+		c.CommitWrite(i % 4)
+	}
+	c.TickN(5)
+
+	c.Primary().Storage.TakeSnapshot("snap-timeout", c.Coordinator.CommittedLSN)
+
+	// Write much more.
+	for i := uint64(21); i <= 100; i++ {
+		c.CommitWrite(i % 4)
+	}
+	c.TickN(5)
+
+	// r1 needs rebuild — use partial rebuild with small max.
+	lastRecovered, err := c.RebuildReplicaFromSnapshotPartial("r1", "snap-timeout", c.Coordinator.CommittedLSN, 5)
+	if err != nil {
+		t.Fatalf("partial rebuild: %v", err)
+	}
+
+	// Partial rebuild: not complete.
+	if lastRecovered >= c.Coordinator.CommittedLSN {
+		t.Fatal("expected partial rebuild, not complete")
+	}
+
+	// r1 state should remain NeedsRebuild (not promoted to InSync).
+	c.Nodes["r1"].ReplicaState = NodeStateRebuilding
+	if c.Nodes["r1"].ReplicaState == NodeStateInSync {
+		t.Fatal("partial rebuild should not grant InSync")
+	}
+
+	// Full rebuild to complete.
+	c.RebuildReplicaFromSnapshot("r1", "snap-timeout", c.Coordinator.CommittedLSN)
+	c.Nodes["r1"].ReplicaState = NodeStateInSync
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatal(err)
+	}
+}
--- a/sw-block/prototype/distsim/phase02_test.go
+++ b/sw-block/prototype/distsim/phase02_test.go
@ -0,0 +1,359 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 02: Protocol-state assertions + version comparison
+// ============================================================
+
+// --- P0: Protocol-level rejection assertions ---
+
+func TestP02_EpochFencing_AllStaleTrafficRejected(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Partition + promote.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.Disconnect("p", "r2")
+	c.Disconnect("r2", "p")
+	c.Promote("r1")
+	staleEpoch := c.Coordinator.Epoch - 1
+	c.Nodes["p"].Epoch = staleEpoch
+
+	// Stale writes through protocol.
+	delivered := c.StaleWrite("p", staleEpoch, 99)
+
+	// Protocol-level assertion: zero accepted, all rejected by epoch.
+	if delivered > 0 {
+		t.Fatalf("stale traffic accepted: %d messages passed fencing", delivered)
+	}
+	epochRejects := c.RejectedByReason(RejectEpochMismatch)
+	if epochRejects == 0 {
+		t.Fatal("no epoch rejections recorded — fencing not tracked")
+	}
+
+	// Delivery log must show explicit rejections (protocol behavior, not just final state).
+	totalRejected := 0
+	for _, d := range c.Deliveries {
+		if !d.Accepted {
+			totalRejected++
+		}
+	}
+	if totalRejected == 0 {
+		t.Fatal("delivery log has no rejections — protocol behavior not recorded")
+	}
+	t.Logf("protocol-level: %d rejected, %d epoch_mismatch", totalRejected, epochRejects)
+}
+
+func TestP02_AcceptedDeliveries_Tracked(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Should have accepted write + barrier deliveries.
+	accepted := c.AcceptedCount()
+	if accepted == 0 {
+		t.Fatal("no accepted deliveries recorded")
+	}
+	acceptedWrites := c.AcceptedByKind(MsgWrite)
+	if acceptedWrites == 0 {
+		t.Fatal("no accepted write deliveries")
+	}
+	t.Logf("after 1 write: %d accepted total, %d writes", accepted, acceptedWrites)
+}
+
+// --- P1: S20 protocol-level closure ---
+
+func TestP02_S20_StaleTraffic_CommittedPrefixUnchanged(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "A", "B", "C")
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// Partition A, promote B.
+	c.Disconnect("A", "B")
+	c.Disconnect("B", "A")
+	c.Disconnect("A", "C")
+	c.Disconnect("C", "A")
+	c.Promote("B")
+	c.Nodes["A"].Epoch = c.Coordinator.Epoch - 1
+
+	// B writes (new epoch).
+	c.CommitWrite(3)
+	c.TickN(5)
+	committedBefore := c.Coordinator.CommittedLSN
+
+	// A stale writes through protocol.
+	c.StaleWrite("A", c.Nodes["A"].Epoch, 99)
+
+	// Protocol assertion: committed prefix unchanged by stale traffic.
+	committedAfter := c.Coordinator.CommittedLSN
+	if committedAfter != committedBefore {
+		t.Fatalf("stale traffic changed committed prefix: before=%d after=%d", committedBefore, committedAfter)
+	}
+
+	// All stale messages rejected by epoch.
+	if c.RejectedByReason(RejectEpochMismatch) == 0 {
+		t.Fatal("no epoch rejections for stale traffic")
+	}
+}
+
+// --- P1: S6 protocol-level closure ---
+
+func TestP02_S6_NonConvergent_ExplicitStateTransition(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 3
+
+	for i := uint64(1); i <= 5; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	for i := uint64(6); i <= 100; i++ {
+		c.CommitWrite(i % 8)
+	}
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+
+	// Protocol assertion: state transitions are explicit.
+	// Track the state at each step.
+	var stateTrace []ReplicaNodeState
+	for attempt := 0; attempt < 10; attempt++ {
+		c.Disconnect("p", "r1")
+		c.Disconnect("r1", "p")
+		for w := 0; w < 20; w++ {
+			c.CommitWrite(uint64(101+attempt*20+w) % 8)
+		}
+		c.TickN(2)
+		c.Connect("p", "r1")
+		c.Connect("r1", "p")
+
+		c.CatchUpWithEscalation("r1", 1)
+		stateTrace = append(stateTrace, r1.ReplicaState)
+
+		if r1.ReplicaState == NodeStateNeedsRebuild {
+			break
+		}
+	}
+
+	// Must have explicit state transitions: CatchingUp → ... → NeedsRebuild.
+	if r1.ReplicaState != NodeStateNeedsRebuild {
+		t.Fatalf("expected NeedsRebuild, got %s", r1.ReplicaState)
+	}
+	// Trace must show CatchingUp before NeedsRebuild.
+	hasCatchingUp := false
+	for _, s := range stateTrace {
+		if s == NodeStateCatchingUp {
+			hasCatchingUp = true
+		}
+	}
+	if !hasCatchingUp {
+		t.Fatal("state trace should include CatchingUp before NeedsRebuild")
+	}
+	t.Logf("state trace: %v", stateTrace)
+}
+
+// --- P1: S18 protocol-level closure ---
+
+func TestP02_S18_DelayedAck_ExplicitRejection(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Write 2 without r1 ack.
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(3)
+
+	// Restart primary with epoch bump.
+	c.StopNode("p")
+	c.Coordinator.Epoch++
+	for _, n := range c.Nodes {
+		if n.Running {
+			n.Epoch = c.Coordinator.Epoch
+		}
+	}
+	c.StartNode("p")
+
+	committedBefore := c.Coordinator.CommittedLSN
+	deliveriesBefore := len(c.Deliveries)
+
+	// Reconnect r1, deliver stale ack.
+	c.Connect("r1", "p")
+	c.Connect("p", "r1")
+	oldAck := Message{
+		Kind: MsgBarrierAck, From: "r1", To: "p",
+		Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2,
+	}
+	c.deliver(oldAck)
+	c.refreshCommits()
+
+	// Protocol assertion 1: committed prefix unchanged.
+	if c.Coordinator.CommittedLSN > committedBefore {
+		t.Fatalf("stale ack advanced prefix: %d → %d", committedBefore, c.Coordinator.CommittedLSN)
+	}
+
+	// Protocol assertion 2: the delivery was explicitly recorded as rejected.
+	newDeliveries := c.Deliveries[deliveriesBefore:]
+	found := false
+	for _, d := range newDeliveries {
+		if !d.Accepted && d.Reason == RejectEpochMismatch && d.Msg.Kind == MsgBarrierAck {
+			found = true
+		}
+	}
+	if !found {
+		t.Fatal("stale ack not recorded as epoch_mismatch rejection in delivery log")
+	}
+}
+
+// --- P2: Version comparison ---
+
+func TestP02_VersionComparison_BriefDisconnect(t *testing.T) {
+	// Same scenario under V1, V1.5, V2 — different expected outcomes.
+	for _, tc := range []struct {
+		version         ProtocolVersion
+		expectCatchup   bool
+		expectRebuild   bool
+	}{
+		{ProtocolV1, false, false}, // V1: no catch-up, stays degraded
+		{ProtocolV15, true, false}, // V1.5: catch-up possible if address stable
+		{ProtocolV2, true, false},  // V2: catch-up allowed for this recoverable short-gap case
+	} {
+		t.Run(string(tc.version), func(t *testing.T) {
+			c := NewClusterWithProtocol(CommitSyncQuorum, tc.version, "p", "r1", "r2")
+			c.MaxCatchupAttempts = 5
+
+			c.CommitWrite(1)
+			c.CommitWrite(2)
+			c.TickN(5)
+
+			// Brief disconnect.
+			c.Disconnect("p", "r1")
+			c.Disconnect("r1", "p")
+			c.CommitWrite(3)
+			c.CommitWrite(4)
+			c.TickN(5)
+			c.Connect("p", "r1")
+			c.Connect("r1", "p")
+
+			canCatchup := c.Protocol.CanAttemptCatchup(true)
+			if canCatchup != tc.expectCatchup {
+				t.Fatalf("CanAttemptCatchup: got %v, want %v", canCatchup, tc.expectCatchup)
+			}
+
+			if canCatchup {
+				// Catch up r1.
+				r1 := c.Nodes["r1"]
+				r1.ReplicaState = NodeStateCatchingUp
+				converged := c.CatchUpWithEscalation("r1", 100)
+				if !converged {
+					t.Fatal("expected catch-up to converge for short gap")
+				}
+				if r1.ReplicaState != NodeStateInSync {
+					t.Fatalf("expected InSync after catch-up, got %s", r1.ReplicaState)
+				}
+			}
+		})
+	}
+}
+
+func TestP02_VersionComparison_BriefDisconnectActions(t *testing.T) {
+	for _, tc := range []struct {
+		version      ProtocolVersion
+		addrStable   bool
+		recoverable  bool
+		expectAction string
+	}{
+		{ProtocolV1, true, true, "degrade_or_rebuild"},
+		{ProtocolV15, true, true, "catchup_if_history_survives"},
+		{ProtocolV15, false, true, "stall_or_control_plane_recovery"},
+		{ProtocolV2, true, true, "reserved_catchup"},
+		{ProtocolV2, false, false, "explicit_rebuild"},
+	} {
+		t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable)+"_recoverable="+boolStr(tc.recoverable), func(t *testing.T) {
+			policy := ProtocolPolicy{Version: tc.version}
+			action := policy.BriefDisconnectAction(tc.addrStable, tc.recoverable)
+			if action != tc.expectAction {
+				t.Fatalf("BriefDisconnectAction(%v,%v): got %q, want %q", tc.addrStable, tc.recoverable, action, tc.expectAction)
+			}
+		})
+	}
+}
+
+func TestP02_VersionComparison_TailChasing(t *testing.T) {
+	for _, tc := range []struct {
+		version        ProtocolVersion
+		expectAction   string
+	}{
+		{ProtocolV1, "degrade"},
+		{ProtocolV15, "stall_or_rebuild"},
+		{ProtocolV2, "abort_to_rebuild"},
+	} {
+		t.Run(string(tc.version), func(t *testing.T) {
+			policy := ProtocolPolicy{Version: tc.version}
+			action := policy.TailChasingAction(false) // non-convergent
+			if action != tc.expectAction {
+				t.Fatalf("TailChasingAction(false): got %q, want %q", action, tc.expectAction)
+			}
+		})
+	}
+}
+
+func TestP02_VersionComparison_RestartRejoin(t *testing.T) {
+	for _, tc := range []struct {
+		version       ProtocolVersion
+		addrStable    bool
+		expectAction  string
+	}{
+		{ProtocolV1, true, "control_plane_only"},
+		{ProtocolV1, false, "control_plane_only"},
+		{ProtocolV15, true, "background_reconnect_or_control_plane"},
+		{ProtocolV15, false, "control_plane_only"},
+		{ProtocolV2, true, "direct_reconnect_or_control_plane"},
+		{ProtocolV2, false, "explicit_reassignment_or_rebuild"},
+	} {
+		t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable), func(t *testing.T) {
+			policy := ProtocolPolicy{Version: tc.version}
+			action := policy.RestartRejoinAction(tc.addrStable)
+			if action != tc.expectAction {
+				t.Fatalf("RestartRejoinAction(%v): got %q, want %q", tc.addrStable, action, tc.expectAction)
+			}
+		})
+	}
+}
+
+func TestP02_VersionComparison_V15RestartAddressInstability(t *testing.T) {
+	v15 := ProtocolPolicy{Version: ProtocolV15}
+	v2 := ProtocolPolicy{Version: ProtocolV2}
+
+	if got := v15.RestartRejoinAction(false); got != "control_plane_only" {
+		t.Fatalf("v1.5 changed-address restart should fall back to control plane, got %q", got)
+	}
+	if got := v2.ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" {
+		t.Fatalf("v2 changed-address recoverable restart should use explicit reassignment + catch-up, got %q", got)
+	}
+	if got := v2.ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" {
+		t.Fatalf("v2 changed-address unrecoverable restart should go to explicit reassignment/rebuild, got %q", got)
+	}
+}
+
+func boolStr(b bool) string {
+	if b {
+		return "true"
+	}
+	return "false"
+}
--- a/sw-block/prototype/distsim/phase02_v1_failures_test.go
+++ b/sw-block/prototype/distsim/phase02_v1_failures_test.go
@ -0,0 +1,434 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 02 P2: Real V1/V1.5 failure reproductions
+// Source: actual Phase 13 hardware behavior and CP13-8 findings
+// ============================================================
+
+// --- Scenario: Changed-address restart (CP13-8 T4b) ---
+// Real bug: replica restarts on a different port. V1.5 shipper retries
+// the old address forever. Catch-up never succeeds because the old
+// address is dead.
+
+func TestP02_V1_ChangedAddressRestart_NeverRecovers(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// r1 restarts with changed address — endpoint version bumps.
+	c.StopNode("r1")
+	c.Coordinator.Epoch++
+	for _, n := range c.Nodes {
+		if n.Running {
+			n.Epoch = c.Coordinator.Epoch
+		}
+	}
+	c.RestartNodeWithNewAddress("r1")
+
+	// Messages from primary to r1 now rejected: stale endpoint.
+	staleRejects := c.RejectedByReason(RejectStaleEndpoint)
+
+	// Writes accumulate — r1 can't receive (endpoint mismatch).
+	for i := uint64(3); i <= 12; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	// Verify: messages rejected by stale endpoint, not just link down.
+	newStaleRejects := c.RejectedByReason(RejectStaleEndpoint) - staleRejects
+	if newStaleRejects == 0 {
+		t.Fatal("V1: writes to r1 should be rejected by stale_endpoint")
+	}
+
+	// V1: no recovery trigger available.
+	trigger, _, ok := c.TriggerRecoverySession("r1")
+	if ok {
+		t.Fatalf("V1 should not trigger recovery, got %s", trigger)
+	}
+
+	// Gap confirmed.
+	r1 := c.Nodes["r1"]
+	if err := c.AssertCommittedRecoverable("r1"); err == nil {
+		t.Fatal("V1: r1 should have data inconsistency")
+	}
+	t.Logf("V1: gap=%d, %d stale_endpoint rejections, no recovery path",
+		c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN, newStaleRejects)
+}
+
+func TestP02_V15_ChangedAddressRestart_RetriesToStaleAddress(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// r1 restarts with changed address — endpoint version bumps.
+	c.StopNode("r1")
+	c.Coordinator.Epoch++
+	for _, n := range c.Nodes {
+		if n.Running {
+			n.Epoch = c.Coordinator.Epoch
+		}
+	}
+	c.RestartNodeWithNewAddress("r1")
+
+	// Writes accumulate — rejected by stale endpoint.
+	for i := uint64(3); i <= 12; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	// V1.5: recovery trigger fails — address mismatch detected.
+	trigger, _, ok := c.TriggerRecoverySession("r1")
+	if ok {
+		t.Fatalf("V1.5 should not trigger recovery with changed address, got %s", trigger)
+	}
+
+	// Heartbeat reveals new endpoint, but V1.5 can only do control_plane_only.
+	report := c.ReportHeartbeat("r1")
+	update := c.CoordinatorDetectEndpointChange(report)
+	if update == nil {
+		t.Fatal("coordinator should detect endpoint change")
+	}
+	// V1.5: does NOT apply assignment update — no mechanism to update primary.
+	if got := c.Protocol.ChangedAddressRestartAction(true); got != "control_plane_only" {
+		t.Fatalf("V1.5: got %q, want control_plane_only", got)
+	}
+
+	// Gap persists, data inconsistency.
+	r1 := c.Nodes["r1"]
+	if err := c.AssertCommittedRecoverable("r1"); err == nil {
+		t.Fatal("V1.5: r1 should have data inconsistency")
+	}
+	t.Logf("V1.5: gap=%d, stale endpoint blocks recovery — control_plane_only",
+		c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN)
+}
+
+func TestP02_V2_ChangedAddressRestart_ExplicitReassignment(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 5
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// r1 restarts with changed address — endpoint version bumps.
+	c.StopNode("r1")
+	c.Coordinator.Epoch++
+	for _, n := range c.Nodes {
+		if n.Running {
+			n.Epoch = c.Coordinator.Epoch
+		}
+	}
+	c.RestartNodeWithNewAddress("r1")
+
+	// Writes accumulate — rejected by stale endpoint.
+	for i := uint64(3); i <= 12; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	// Before control-plane flow: recovery trigger fails (stale endpoint).
+	trigger, _, ok := c.TriggerRecoverySession("r1")
+	if ok {
+		t.Fatalf("V2: recovery should fail before assignment update, got %s", trigger)
+	}
+
+	// Step 1: heartbeat discovers new endpoint.
+	report := c.ReportHeartbeat("r1")
+	update := c.CoordinatorDetectEndpointChange(report)
+	if update == nil {
+		t.Fatal("coordinator should detect endpoint change")
+	}
+
+	// Step 2: coordinator applies assignment — primary learns new address.
+	c.ApplyAssignmentUpdate(*update)
+
+	// Step 3: recovery trigger now succeeds (endpoint matches).
+	trigger, _, ok = c.TriggerRecoverySession("r1")
+	if !ok || trigger != TriggerReassignment {
+		t.Fatalf("V2: expected reassignment trigger after update, got %s/%v", trigger, ok)
+	}
+
+	// Step 4: catch-up via protocol.
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("V2: catch-up should converge after reassignment")
+	}
+
+	// Data correct after full control-plane flow.
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatalf("V2: data incorrect after reassignment+catchup: %v", err)
+	}
+	t.Logf("V2: recovered via heartbeat→detect→assignment→trigger→catchup")
+}
+
+// --- Scenario: Same-address transient outage ---
+// Common case: brief network hiccup, same ports.
+
+func TestP02_V1_TransientOutage_Degrades(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Brief partition.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.CommitWrite(3)
+	c.TickN(5)
+
+	// Heal.
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// V1: no catch-up. r1 stays at flushed=1.
+	if c.Protocol.CanAttemptCatchup(true) {
+		t.Fatal("V1 should not catch-up even with stable address")
+	}
+
+	c.TickN(5)
+	r1 := c.Nodes["r1"]
+	if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN {
+		// V1 doesn't catch up — unless messages from BEFORE disconnect are still delivering.
+		// In our model, messages enqueued before disconnect may still arrive. That's a V1 "accident" not protocol.
+	}
+	t.Logf("V1 transient outage: flushed=%d committed=%d action=%s",
+		r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN,
+		c.Protocol.BriefDisconnectAction(true, true))
+}
+
+func TestP02_V15_TransientOutage_CatchesUp(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 5
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.CommitWrite(3)
+	c.TickN(5)
+
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// V1.5: catch-up works if address stable.
+	if !c.Protocol.CanAttemptCatchup(true) {
+		t.Fatal("V1.5 should catch-up with stable address")
+	}
+
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("V1.5: should converge for short gap with stable address")
+	}
+	if r1.ReplicaState != NodeStateInSync {
+		t.Fatalf("V1.5: expected InSync, got %s", r1.ReplicaState)
+	}
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatal(err)
+	}
+	t.Logf("V1.5 transient outage: recovered via catch-up, flushed=%d", r1.Storage.FlushedLSN)
+}
+
+func TestP02_V2_TransientOutage_ReservedCatchup(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 5
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.CommitWrite(3)
+	c.TickN(5)
+
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// V2: reserved catch-up — explicit recoverability check.
+	action := c.Protocol.BriefDisconnectAction(true, true)
+	if action != "reserved_catchup" {
+		t.Fatalf("V2 brief disconnect: got %q, want reserved_catchup", action)
+	}
+
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("V2: should converge for short gap")
+	}
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatal(err)
+	}
+	t.Logf("V2 transient outage: reserved catch-up succeeded")
+}
+
+// --- Scenario: Slow control-plane recovery ---
+// Source: real Phase 13 hardware behavior.
+// Data path recovers fast. Control plane (master) is slow to re-issue
+// assignments. During this window, V1/V1.5 behavior differs from V2.
+
+func TestP02_SlowControlPlane_V1_WaitsForMaster(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// r1 disconnects. Stays disconnected through outage + control-plane delay.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+
+	// Writes accumulate: outage write + delay-window writes. r1 misses all.
+	for i := uint64(2); i <= 10; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	// Data path heals — but V1 has no catch-up protocol.
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// V1: no recovery trigger even with address stable.
+	trigger, _, ok := c.TriggerRecoverySession("r1")
+	if ok {
+		t.Fatalf("V1 should not trigger recovery, got %s", trigger)
+	}
+
+	// r1 is behind: FlushedLSN=1, CommittedLSN=10. Gap = 9.
+	r1 := c.Nodes["r1"]
+	gap := c.Coordinator.CommittedLSN - r1.Storage.FlushedLSN
+	if gap < 9 {
+		t.Fatalf("V1: expected gap >= 9, got %d", gap)
+	}
+
+	// V1 data inconsistency: r1 missed writes 2-10. No self-heal mechanism.
+	err := c.AssertCommittedRecoverable("r1")
+	if err == nil {
+		t.Fatal("V1: r1 should have data inconsistency — no catch-up mechanism")
+	}
+	t.Logf("V1 slow control-plane: gap=%d, data inconsistency — %v", gap, err)
+}
+
+func TestP02_SlowControlPlane_V15_BackgroundReconnect(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 5
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// r1 disconnects. Stays disconnected through outage + delay window.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+
+	// Writes accumulate while r1 is disconnected.
+	for i := uint64(2); i <= 10; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	// Data path heals.
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// Before catch-up: r1 is behind (FlushedLSN=1, CommittedLSN=10).
+	r1 := c.Nodes["r1"]
+	if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN {
+		t.Fatal("V1.5: r1 should be behind before catch-up")
+	}
+	if err := c.AssertCommittedRecoverable("r1"); err == nil {
+		t.Fatal("V1.5: r1 should have data gap before catch-up")
+	}
+
+	// V1.5 policy: background reconnect if address stable.
+	if c.Protocol.RestartRejoinAction(true) != "background_reconnect_or_control_plane" {
+		t.Fatal("V1.5 stable-address should be background_reconnect_or_control_plane")
+	}
+
+	// V1.5 recovery trigger: background reconnect (address stable → endpoint matches).
+	trigger, _, ok := c.TriggerRecoverySession("r1")
+	if !ok || trigger != TriggerBackgroundReconnect {
+		t.Fatalf("V1.5: expected background_reconnect trigger, got %s/%v", trigger, ok)
+	}
+	// r1.ReplicaState is now CatchingUp (set by TriggerRecoverySession).
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("V1.5: should catch up with stable address")
+	}
+
+	// After catch-up: data correct.
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatalf("V1.5: data should be correct after catch-up — %v", err)
+	}
+
+	// V1.5 changed-address: falls back to control plane.
+	if c.Protocol.RestartRejoinAction(false) != "control_plane_only" {
+		t.Fatal("V1.5 changed-address should fall back to control_plane_only")
+	}
+	t.Logf("V1.5 slow control-plane: caught up %d entries via background reconnect",
+		c.Coordinator.CommittedLSN-1)
+}
+
+func TestP02_SlowControlPlane_V2_DirectReconnect(t *testing.T) {
+	c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2")
+	c.MaxCatchupAttempts = 5
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// r1 disconnects. Stays disconnected through outage + delay window.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+
+	// Writes accumulate while r1 is disconnected.
+	for i := uint64(2); i <= 10; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	// Data path heals.
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// Before catch-up: r1 is behind.
+	r1 := c.Nodes["r1"]
+	if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN {
+		t.Fatal("V2: r1 should be behind before direct reconnect")
+	}
+	if err := c.AssertCommittedRecoverable("r1"); err == nil {
+		t.Fatal("V2: r1 should have data gap before direct reconnect")
+	}
+
+	// V2 policy: direct reconnect, doesn't wait for master.
+	if c.Protocol.RestartRejoinAction(true) != "direct_reconnect_or_control_plane" {
+		t.Fatal("V2 should be direct_reconnect_or_control_plane")
+	}
+
+	// V2 recovery trigger: reassignment (address stable → endpoint matches).
+	trigger, _, ok := c.TriggerRecoverySession("r1")
+	if !ok || trigger != TriggerReassignment {
+		t.Fatalf("V2: expected reassignment trigger, got %s/%v", trigger, ok)
+	}
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("V2: should catch up directly without master intervention")
+	}
+
+	// After catch-up: data correct.
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatalf("V2: data should be correct after direct reconnect — %v", err)
+	}
+	t.Logf("V2 slow control-plane: caught up %d entries immediately via direct reconnect",
+		c.Coordinator.CommittedLSN-1)
+}
--- a/sw-block/prototype/distsim/phase03_p2_race_test.go
+++ b/sw-block/prototype/distsim/phase03_p2_race_test.go
@ -0,0 +1,287 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 03 P2: Timer-ordering races
+// ============================================================
+
+// --- Race 1: Concurrent barrier timeouts under sync_quorum ---
+
+func TestP03_P2_ConcurrentBarrierTimeout_QuorumEdge(t *testing.T) {
+	// RF=3 (p, r1, r2). sync_quorum (quorum=2).
+	// Both r1 and r2 have barrier timeouts. r1's ack arrives in the same tick
+	// as r2's timeout fires. The "data before timers" rule means:
+	// r1 ack processed → cancels r1 timeout → r2 timeout fires → quorum = p+r1 = 2 → committed.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.BarrierTimeoutTicks = 5
+
+	// r2 disconnected — barrier will time out. r1 connected — will ack.
+	c.Disconnect("p", "r2")
+	c.Disconnect("r2", "p")
+
+	c.CommitWrite(1) // barrier to r1 at Now+2, barrier to r2 at Now+2
+	// Barrier timeout for both at Now+5.
+
+	c.TickN(10)
+
+	// r1 ack arrived → cancelled r1 timeout.
+	// r2 barrier timed out (link down, no ack).
+	firedBarriers := c.FiredTimeoutsByKind(TimeoutBarrier)
+	if firedBarriers != 1 {
+		t.Fatalf("expected 1 barrier timeout (r2), got %d", firedBarriers)
+	}
+
+	// Event log: r1's barrier timeout was cancelled (ack arrived earlier).
+	// r2's barrier timeout fired. Verify both are in the TickLog.
+	var cancelCount, fireCount int
+	for _, e := range c.TickLog {
+		if e.Kind == EventTimeoutCancelled {
+			cancelCount++
+		}
+		if e.Kind == EventTimeoutFired {
+			fireCount++
+		}
+	}
+	if cancelCount != 1 {
+		t.Fatalf("expected 1 timeout cancel (r1 ack), got %d", cancelCount)
+	}
+	if fireCount != 1 {
+		t.Fatalf("expected 1 timeout fire (r2), got %d", fireCount)
+	}
+
+	// Quorum: p + r1 = 2 of 3 → committed.
+	if c.Coordinator.CommittedLSN != 1 {
+		t.Fatalf("LSN 1 should commit via quorum (p+r1), committed=%d", c.Coordinator.CommittedLSN)
+	}
+
+	// DurableOn: p=true (self-ack), r1=true (ack), r2 NOT set (timed out).
+	p1 := c.Pending[1]
+	if !p1.DurableOn["p"] || !p1.DurableOn["r1"] {
+		t.Fatal("DurableOn should have p and r1")
+	}
+	if p1.DurableOn["r2"] {
+		t.Fatal("DurableOn should NOT have r2 (timed out)")
+	}
+
+	t.Logf("concurrent timeout: r1 acked, r2 timed out, quorum met, committed=%d", c.Coordinator.CommittedLSN)
+}
+
+func TestP03_P2_ConcurrentBarrierTimeout_BothTimeout_NoQuorum(t *testing.T) {
+	// Both r1 and r2 disconnected. Both timeouts fire. Quorum = p alone = 1 < 2.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.BarrierTimeoutTicks = 5
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.Disconnect("p", "r2")
+	c.Disconnect("r2", "p")
+
+	c.CommitWrite(1)
+	c.TickN(10)
+
+	// Both barriers timed out.
+	if c.FiredTimeoutsByKind(TimeoutBarrier) != 2 {
+		t.Fatalf("expected 2 barrier timeouts, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
+	}
+
+	// No quorum — uncommitted.
+	if c.Coordinator.CommittedLSN != 0 {
+		t.Fatalf("LSN 1 should not commit without quorum, committed=%d", c.Coordinator.CommittedLSN)
+	}
+
+	// Neither r1 nor r2 in DurableOn.
+	p1 := c.Pending[1]
+	if p1.DurableOn["r1"] || p1.DurableOn["r2"] {
+		t.Fatal("DurableOn should not have r1 or r2")
+	}
+	t.Logf("both timeouts: no quorum, LSN 1 uncommitted")
+}
+
+func TestP03_P2_ConcurrentBarrierTimeout_SameTick_AckAndTimeout(t *testing.T) {
+	// The precise same-tick race: r1 ack arrives at exactly the tick when r2's
+	// timeout fires. Verify data-before-timers ordering in the event log.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.BarrierTimeoutTicks = 4 // timeout at Now+4
+
+	c.Disconnect("p", "r2")
+	c.Disconnect("r2", "p")
+
+	c.CommitWrite(1)
+	// Write at Now+1, barrier at Now+2, ack back at Now+3.
+	// Timeout for r1 at Now+4, timeout for r2 at Now+4.
+
+	// Tick to barrier ack arrival (tick 3): r1 ack delivered, cancels r1 timeout.
+	// Tick 4: r2 timeout fires. r1 timeout already cancelled.
+	c.TickN(6)
+
+	// Check event ordering at the timeout tick.
+	timeoutTick := uint64(0)
+	for _, ft := range c.FiredTimeouts {
+		timeoutTick = ft.FiredAt
+	}
+	events := c.TickEventsAt(timeoutTick)
+
+	// At the timeout tick, we should see: r2 timeout fired (r1 was cancelled earlier).
+	var firedDetails []string
+	for _, e := range events {
+		if e.Kind == EventTimeoutFired {
+			firedDetails = append(firedDetails, e.Detail)
+		}
+	}
+	if len(firedDetails) != 1 {
+		t.Fatalf("expected 1 timeout fire at tick %d, got %d: %v", timeoutTick, len(firedDetails), firedDetails)
+	}
+
+	// Committed via quorum.
+	if c.Coordinator.CommittedLSN != 1 {
+		t.Fatalf("committed=%d, want 1", c.Coordinator.CommittedLSN)
+	}
+	t.Logf("same-tick race: r1 ack cancelled at tick 3, r2 timeout fired at tick %d, committed=1", timeoutTick)
+}
+
+// --- Race 2: Epoch bump during active barrier timeout window ---
+
+func TestP03_P2_EpochBumpDuringBarrierTimeout_CrossSurface(t *testing.T) {
+	// Three cleanup mechanisms interact for the same barrier:
+	// 1. Epoch fencing in deliver() rejects old-epoch messages
+	// 2. Barrier timeout in fireTimeouts() removes queued barriers + marks expired
+	// 3. ExpiredBarriers in deliver() rejects late acks
+	//
+	// Scenario: barrier re-queues (r1 missing data), epoch bumps, then timeout fires.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.BarrierTimeoutTicks = 10
+
+	c.CommitWrite(1) // write+barrier to r1, r2
+
+	// Drop write to r1 so barrier keeps re-queuing.
+	var kept []inFlightMessage
+	for _, item := range c.Queue {
+		if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 1 {
+			continue
+		}
+		kept = append(kept, item)
+	}
+	c.Queue = kept
+
+	// Tick 1-3: r1's barrier delivers but r1 doesn't have data → re-queues.
+	// r2 gets write+barrier normally → acks.
+	c.TickN(3)
+
+	// Epoch bump: promote r2 (p stays running as demoted replica).
+	// This ensures the old-epoch barrier hits epoch fencing, not node_down.
+	if err := c.Promote("r2"); err != nil {
+		t.Fatal(err)
+	}
+
+	// Record state before timeout window.
+	epochRejectsBefore := c.RejectedByReason(RejectEpochMismatch)
+
+	// Tick 4-5: old-epoch barrier (p→r1) is in queue. deliver() rejects
+	// with epoch_mismatch (msg epoch=1 vs coordinator epoch=2).
+	c.TickN(2)
+
+	// Old barrier rejected by epoch fencing.
+	epochRejectsAfter := c.RejectedByReason(RejectEpochMismatch)
+	newEpochRejects := epochRejectsAfter - epochRejectsBefore
+	if newEpochRejects == 0 {
+		t.Fatal("old-epoch barrier should be rejected by epoch fencing")
+	}
+
+	// Tick past barrier timeout deadline.
+	c.TickN(10)
+
+	// Barrier timeout fires for r1/LSN 1 (removes any remaining queued copies).
+	if c.FiredTimeoutsByKind(TimeoutBarrier) == 0 {
+		t.Fatal("barrier timeout should fire for r1/LSN 1")
+	}
+
+	// Expired barrier marked.
+	if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] {
+		t.Fatal("r1/LSN 1 should be in ExpiredBarriers")
+	}
+
+	// Inject late ack from r1 for LSN 1 at current epoch (to new primary r2).
+	// The barrier is expired — ack should be rejected by barrier_expired.
+	deliveriesBefore := len(c.Deliveries)
+	c.InjectMessage(Message{
+		Kind: MsgBarrierAck, From: "r1", To: "r2",
+		Epoch: c.Coordinator.Epoch, TargetLSN: 1,
+	}, c.Now+1)
+	c.TickN(2)
+
+	// Late ack rejected by barrier_expired.
+	lateRejected := false
+	for _, d := range c.Deliveries[deliveriesBefore:] {
+		if d.Msg.Kind == MsgBarrierAck && d.Msg.From == "r1" && d.Msg.TargetLSN == 1 {
+			if !d.Accepted && d.Reason == RejectBarrierExpired {
+				lateRejected = true
+			}
+		}
+	}
+	if !lateRejected {
+		t.Fatal("late ack for expired barrier should be rejected as barrier_expired")
+	}
+
+	// Verify event log shows the cross-surface interaction.
+	var epochRejectEvents, timeoutFireEvents int
+	for _, e := range c.TickLog {
+		if e.Kind == EventDeliveryRejected {
+			epochRejectEvents++
+		}
+		if e.Kind == EventTimeoutFired {
+			timeoutFireEvents++
+		}
+	}
+	if epochRejectEvents == 0 || timeoutFireEvents == 0 {
+		t.Fatalf("event log should show both epoch rejections (%d) and timeout fires (%d)",
+			epochRejectEvents, timeoutFireEvents)
+	}
+
+	t.Logf("cross-surface: epoch_rejects=%d, timeout_fires=%d, expired_barrier=true, late_ack_rejected=true",
+		newEpochRejects, c.FiredTimeoutsByKind(TimeoutBarrier))
+}
+
+// --- TickEvents trace verification ---
+
+func TestP03_P2_TickEvents_OrderingVerifiable(t *testing.T) {
+	// Verify that TickEvents captures delivery → timeout ordering within a tick.
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	c.BarrierTimeoutTicks = 5
+
+	c.CommitWrite(1)
+	c.TickN(10) // normal flow: ack cancels timeout
+
+	// TickLog should have events.
+	if len(c.TickLog) == 0 {
+		t.Fatal("TickLog should record events")
+	}
+
+	// Find delivery events and timeout events.
+	var deliveries, cancels int
+	for _, e := range c.TickLog {
+		switch e.Kind {
+		case EventDeliveryAccepted:
+			deliveries++
+		case EventTimeoutCancelled:
+			cancels++
+		}
+	}
+	if deliveries == 0 {
+		t.Fatal("should have delivery events")
+	}
+	if cancels == 0 {
+		t.Fatal("should have timeout cancel events (ack cancelled barrier timeout)")
+	}
+
+	// BuildTrace includes TickEvents.
+	trace := BuildTrace(c)
+	if len(trace.TickEvents) == 0 {
+		t.Fatal("BuildTrace should include TickEvents")
+	}
+
+	t.Logf("tick events: %d deliveries, %d cancels, %d total events",
+		deliveries, cancels, len(c.TickLog))
+}
--- a/sw-block/prototype/distsim/phase03_race_test.go
+++ b/sw-block/prototype/distsim/phase03_race_test.go
@ -0,0 +1,281 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 03 P1: Race-focused tests with trace quality
+// ============================================================
+
+// --- Race 1: Promotion vs delayed catch-up timeout ---
+
+func TestP03_Race_PromotionThenStaleCatchupTimeout(t *testing.T) {
+	// r1 is CatchingUp with a catch-up timeout registered.
+	// Before the timeout fires, primary crashes and r1 is promoted.
+	// The stale catch-up timeout must not regress r1 (now primary) to NeedsRebuild.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// r1 falls behind, starts catching up.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(3)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10)
+
+	// r1 catches up successfully.
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("r1 should converge before promotion")
+	}
+
+	// Primary crashes. Promote r1.
+	c.StopNode("p")
+	if err := c.Promote("r1"); err != nil {
+		t.Fatal(err)
+	}
+
+	// Tick past the catch-up timeout deadline.
+	c.TickN(15)
+
+	// Stale timeout must not fire (was auto-cancelled on convergence).
+	if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 {
+		t.Fatal("stale catch-up timeout must not fire after promotion")
+	}
+	// r1 must remain primary and running.
+	if r1.Role != RolePrimary {
+		t.Fatalf("r1 should be primary, got %s", r1.Role)
+	}
+	if r1.ReplicaState == NodeStateNeedsRebuild {
+		t.Fatal("stale timeout regressed promoted r1 to NeedsRebuild")
+	}
+	t.Logf("promotion vs timeout: stale catch-up timeout suppressed, r1 is primary")
+}
+
+func TestP03_Race_PromotionThenStaleBarrierTimeout(t *testing.T) {
+	// Barrier timeout registered for r1 at old epoch.
+	// Promotion bumps epoch. The stale barrier timeout fires but must not
+	// affect the new epoch's commit state.
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	c.BarrierTimeoutTicks = 8
+
+	// Write 1 — barrier to r1. Disconnect r1 so barrier can't ack.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(1)
+
+	// Tick 2 — barrier timeout registered at Now+8.
+	c.TickN(2)
+
+	// Primary crashes, promote r1 (even though it doesn't have write 1).
+	c.StopNode("p")
+	c.StartNode("r1")
+	if err := c.Promote("r1"); err != nil {
+		t.Fatal(err)
+	}
+
+	// Snapshot committed prefix before stale timeout window.
+	committedBefore := c.Coordinator.CommittedLSN
+
+	// r1 is now primary at new epoch. Write new data.
+	c.CommitWrite(10)
+	c.TickN(10) // well past barrier timeout deadline
+
+	// Stale barrier timeout fires (from old epoch, old primary "p" → old replica "r1").
+	barriersFired := c.FiredTimeoutsByKind(TimeoutBarrier)
+
+	// Assert 1: old timed-out barrier did not change committed prefix unexpectedly.
+	// CommittedLSN may advance from r1's new-epoch writes, but must not regress
+	// or be influenced by the stale barrier timeout.
+	if c.Coordinator.CommittedLSN < committedBefore {
+		t.Fatalf("committed prefix regressed: before=%d after=%d",
+			committedBefore, c.Coordinator.CommittedLSN)
+	}
+
+	// Assert 2: old-epoch barrier did not set DurableOn for new-epoch writes.
+	// LSN 1 was written by old primary "p". Under the new epoch, DurableOn
+	// should not have been modified by the stale barrier's timeout path.
+	if p1 := c.Pending[1]; p1 != nil {
+		if p1.DurableOn["r1"] {
+			t.Fatal("stale barrier timeout should not set DurableOn[r1] for old-epoch LSN 1")
+		}
+	}
+
+	// Assert 3: old-epoch LSN 1 barrier is marked expired (stale timeout fired correctly).
+	if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] {
+		t.Fatal("old-epoch barrier for r1/LSN 1 should be in ExpiredBarriers")
+	}
+
+	t.Logf("promotion vs barrier timeout: committed=%d, fired=%d, DurableOn[r1]=%v, expired[r1/1]=%v",
+		c.Coordinator.CommittedLSN, barriersFired,
+		c.Pending[1] != nil && c.Pending[1].DurableOn["r1"],
+		c.ExpiredBarriers[barrierExpiredKey{"r1", 1}])
+}
+
+// --- Race 2: Rebuild completion vs epoch bump ---
+
+func TestP03_Race_RebuildCompletes_ThenEpochBumps(t *testing.T) {
+	// r1 needs rebuild. Rebuild completes, but before r1 can rejoin,
+	// epoch bumps (another failover). The rebuild result is valid but
+	// the replica must re-validate against the new epoch before rejoining.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	for i := uint64(1); i <= 10; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+	c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN)
+
+	// r1 needs rebuild.
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateNeedsRebuild
+
+	// Rebuild from snapshot — succeeds.
+	c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN)
+	r1.ReplicaState = NodeStateRebuilding // transitional
+
+	// Before r1 can rejoin: epoch bumps (simulate another failure/promotion).
+	epochBefore := c.Coordinator.Epoch
+	c.StopNode("p")
+	if err := c.Promote("r2"); err != nil {
+		t.Fatal(err)
+	}
+	epochAfter := c.Coordinator.Epoch
+
+	if epochAfter <= epochBefore {
+		t.Fatal("epoch should have bumped")
+	}
+
+	// r1's epoch is now stale (was set to epochBefore, promotion updated running nodes).
+	// r1 was stopped? No, r1 is still running. But Promote sets all running nodes' epoch.
+	// Wait — r1 IS running, so Promote set r1.Epoch = new epoch. Let me check.
+	// Actually Promote() sets all running nodes' epoch to new coordinator epoch.
+	// r1 is running. So r1.Epoch = epochAfter. But r1.Role = RoleReplica.
+
+	// The rebuild data is from the OLD epoch's committed prefix.
+	// Under the new primary (r2), committed prefix may differ.
+	// r1 must NOT be promoted to InSync until validated against new epoch.
+
+	// Eligibility check: r1 is Rebuilding — ineligible for promotion.
+	e := c.EvaluateCandidateEligibility("r1")
+	if e.Eligible {
+		t.Fatal("r1 in Rebuilding state should not be eligible")
+	}
+
+	// r1 should NOT be InSync until it completes catch-up from new primary.
+	if r1.ReplicaState == NodeStateInSync {
+		t.Fatal("r1 should not be InSync after epoch bump during rebuild")
+	}
+
+	// After catch-up from new primary (r2), r1 can rejoin.
+	r1.ReplicaState = NodeStateCatchingUp
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("r1 should converge from new primary")
+	}
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatalf("r1 data incorrect after post-epoch-bump catch-up: %v", err)
+	}
+
+	t.Logf("rebuild vs epoch bump: r1 rebuilt at epoch %d, bumped to %d, caught up from r2",
+		epochBefore, epochAfter)
+}
+
+func TestP03_Race_EpochBumpsDuringCatchupTimeout(t *testing.T) {
+	// Catch-up timeout registered. Epoch bumps before timeout fires.
+	// The timeout is now stale (different epoch context).
+	// Must not mutate state under the new epoch.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10)
+
+	// Epoch bumps (promotion) before timeout.
+	c.StopNode("p")
+	if err := c.Promote("r1"); err != nil {
+		t.Fatal(err)
+	}
+	// r1 is now primary. State changes from CatchingUp to... well, we need to
+	// set it. In production, promotion sets the role but the replica state is
+	// reset. Let me set it to InSync (as new primary).
+	r1.ReplicaState = NodeStateInSync
+
+	// Tick past timeout deadline.
+	c.TickN(15)
+
+	// Timeout should be ignored (r1 is InSync, not CatchingUp).
+	if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 {
+		t.Fatal("catch-up timeout should not fire after epoch bump + promotion")
+	}
+	if len(c.IgnoredTimeouts) != 1 {
+		t.Fatalf("expected 1 ignored (stale) timeout, got %d", len(c.IgnoredTimeouts))
+	}
+	if r1.ReplicaState != NodeStateInSync {
+		t.Fatalf("r1 should remain InSync, got %s", r1.ReplicaState)
+	}
+	t.Logf("epoch bump vs timeout: stale catch-up timeout correctly ignored")
+}
+
+// --- Trace quality: dump state on failure ---
+
+func TestP03_TraceQuality_FailingScenarioDumpsState(t *testing.T) {
+	// Verify that the timeout model produces debuggable traces.
+	// This test does NOT intentionally fail — it verifies that trace
+	// information is available for inspection.
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	c.BarrierTimeoutTicks = 5
+
+	c.CommitWrite(1)
+	c.TickN(3)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(10)
+
+	// Build trace.
+	trace := BuildTrace(c)
+
+	// Trace must contain key debugging information.
+	if trace.Tick == 0 {
+		t.Fatal("trace should have non-zero tick")
+	}
+	if trace.CommittedLSN == 0 && len(c.Pending) == 0 {
+		t.Fatal("trace should reflect cluster state")
+	}
+	if len(trace.FiredTimeouts) == 0 {
+		t.Fatal("trace should include fired timeouts")
+	}
+	if len(trace.NodeStates) < 2 {
+		t.Fatal("trace should include all node states")
+	}
+	if trace.Deliveries == 0 {
+		t.Fatal("trace should include deliveries")
+	}
+
+	t.Logf("trace: tick=%d committed=%d fired_timeouts=%d deliveries=%d nodes=%v",
+		trace.Tick, trace.CommittedLSN, len(trace.FiredTimeouts),
+		trace.Deliveries, trace.NodeStates)
+}
+
+// Trace infrastructure lives in eventsim.go (BuildTrace / Trace type).
--- a/sw-block/prototype/distsim/phase03_timeout_test.go
+++ b/sw-block/prototype/distsim/phase03_timeout_test.go
@ -0,0 +1,333 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// ============================================================
+// Phase 03 P0: Timeout-backed scenarios
+// ============================================================
+
+// --- Barrier timeout ---
+
+func TestP03_BarrierTimeout_SyncAllBlocked(t *testing.T) {
+	// Barrier sent to replica, link goes down, ack never arrives.
+	// Barrier timeout fires → barrier removed from queue.
+	// sync_all: write stays uncommitted.
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	c.BarrierTimeoutTicks = 5
+
+	c.CommitWrite(1)
+	c.TickN(10) // enough for barrier timeout to fire and normal commit
+
+	// LSN 1: p self-acks. r1 acks. sync_all: both must ack. Should commit.
+	if c.Coordinator.CommittedLSN != 1 {
+		t.Fatalf("LSN 1 should commit normally, got committed=%d", c.Coordinator.CommittedLSN)
+	}
+	// No timeouts fired for LSN 1 (ack arrived in time).
+	if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 {
+		t.Fatal("no barrier timeouts should have fired for LSN 1")
+	}
+
+	// Now disconnect r1. Write LSN 2. Barrier can't be acked.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(10) // barrier timeout fires after 5 ticks
+
+	// Barrier timeout should have fired for r1/LSN 2.
+	if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
+		t.Fatalf("expected 1 barrier timeout, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
+	}
+
+	// sync_all: LSN 2 NOT committed (r1 never acked).
+	if c.Coordinator.CommittedLSN != 1 {
+		t.Fatalf("LSN 2 should not commit under sync_all without r1 ack, committed=%d",
+			c.Coordinator.CommittedLSN)
+	}
+
+	// Barrier removed from queue (no indefinite re-queuing).
+	for _, item := range c.Queue {
+		if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 {
+			t.Fatal("timed-out barrier should be removed from queue")
+		}
+	}
+	t.Logf("barrier timeout: LSN 2 uncommitted, barrier cleaned from queue")
+}
+
+func TestP03_BarrierTimeout_SyncQuorum_StillCommits(t *testing.T) {
+	// RF=3 sync_quorum: r1 times out, but r2 acks → quorum met → commits.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	c.BarrierTimeoutTicks = 5
+
+	// Disconnect r1 only. r2 stays connected.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+
+	c.CommitWrite(1)
+	c.TickN(10)
+
+	// r1 barrier times out, but r2 acked. quorum = p + r2 = 2 of 3.
+	if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
+		t.Fatalf("expected 1 barrier timeout (r1), got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
+	}
+	if c.Coordinator.CommittedLSN != 1 {
+		t.Fatalf("LSN 1 should commit via quorum (p+r2), committed=%d", c.Coordinator.CommittedLSN)
+	}
+	t.Logf("barrier timeout: r1 timed out, LSN 1 committed via quorum")
+}
+
+// --- Catch-up timeout ---
+
+func TestP03_CatchupTimeout_EscalatesToNeedsRebuild(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// r1 disconnects, primary writes more.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	for i := uint64(2); i <= 20; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// Register catch-up timeout: 3 ticks from now.
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+3)
+
+	// Tick 3 times — timeout fires before catch-up completes.
+	c.TickN(3)
+
+	if r1.ReplicaState != NodeStateNeedsRebuild {
+		t.Fatalf("catch-up timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState)
+	}
+	if c.FiredTimeoutsByKind(TimeoutCatchup) != 1 {
+		t.Fatalf("expected 1 catchup timeout, got %d", c.FiredTimeoutsByKind(TimeoutCatchup))
+	}
+	t.Logf("catch-up timeout: escalated to NeedsRebuild after 3 ticks")
+}
+
+// --- Reservation expiry as timeout event ---
+
+func TestP03_ReservationTimeout_AbortsCatchup(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	for i := uint64(1); i <= 10; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+
+	// r1 disconnects, more writes.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	for i := uint64(11); i <= 30; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// Register reservation timeout: 2 ticks.
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+2)
+
+	c.TickN(2)
+
+	if r1.ReplicaState != NodeStateNeedsRebuild {
+		t.Fatalf("reservation timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState)
+	}
+	if c.FiredTimeoutsByKind(TimeoutReservation) != 1 {
+		t.Fatalf("expected 1 reservation timeout, got %d", c.FiredTimeoutsByKind(TimeoutReservation))
+	}
+}
+
+// --- Timer-race scenarios: same-tick resolution ---
+
+func TestP03_Race_AckArrivesBeforeTimeout_Cancels(t *testing.T) {
+	// Barrier ack arrives in the same tick as the timeout deadline.
+	// Rule: data events (ack) process before timeouts → timeout is cancelled.
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	c.BarrierTimeoutTicks = 4 // timeout at Now+4
+
+	c.CommitWrite(1) // barrier enqueued at Now+2, ack back at Now+3
+	// Barrier timeout registered at Now+4.
+
+	// Tick 1: write delivered.
+	// Tick 2: barrier delivered, ack enqueued at Now+1 = tick 3.
+	// Tick 3: ack delivered → cancels timeout.
+	// Tick 4: timeout deadline reached — but already cancelled.
+	c.TickN(5)
+
+	// Ack arrived first → timeout cancelled → LSN 1 committed.
+	if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 {
+		t.Fatal("barrier timeout should be cancelled by ack arriving first")
+	}
+	if c.Coordinator.CommittedLSN != 1 {
+		t.Fatalf("LSN 1 should commit (ack arrived before timeout), committed=%d",
+			c.Coordinator.CommittedLSN)
+	}
+	t.Logf("race resolved: ack cancelled timeout, LSN 1 committed")
+}
+
+func TestP03_Race_TimeoutBeforeAck_Fires(t *testing.T) {
+	// Timeout fires before barrier can deliver (timeout < barrier delivery time).
+	// CommitWrite enqueues barrier at Now+2. Timeout at Now+1 fires first.
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	c.BarrierTimeoutTicks = 1 // timeout at Now+1 — before barrier delivers at Now+2
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Timeout fires at tick 1. Barrier would deliver at tick 2, but timeout
+	// removes it from queue first.
+	if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
+		t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
+	}
+	// sync_all: uncommitted (r1 never acked).
+	if c.Coordinator.CommittedLSN != 0 {
+		t.Fatalf("LSN 1 should not commit (timeout before barrier delivery), committed=%d",
+			c.Coordinator.CommittedLSN)
+	}
+	t.Logf("race resolved: timeout fired before barrier delivery, LSN 1 uncommitted")
+}
+
+func TestP03_Race_CatchupConverges_CancelsTimeout(t *testing.T) {
+	// Catch-up completes before the timeout fires.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.CommitWrite(3)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// Register catch-up timeout: 10 ticks (generous).
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10)
+
+	// Catch-up completes immediately (small gap).
+	// CatchUpWithEscalation auto-cancels recovery timeouts on convergence.
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("catch-up should converge for small gap")
+	}
+
+	// Tick past deadline — timeout should already be cancelled.
+	c.TickN(15)
+
+	// Timeout should NOT have fired (was cancelled).
+	if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 {
+		t.Fatal("catch-up timeout should be cancelled on convergence")
+	}
+	if r1.ReplicaState != NodeStateInSync {
+		t.Fatalf("r1 should be InSync after convergence, got %s", r1.ReplicaState)
+	}
+	t.Logf("race resolved: catch-up converged, timeout auto-cancelled")
+}
+
+// --- Stale timeout hardening ---
+
+func TestP03_StaleReservationTimeout_AfterRecoverySuccess(t *testing.T) {
+	// Reservation timeout registered, but recovery completes before deadline.
+	// The stale timeout must NOT regress state from InSync back to NeedsRebuild.
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(3)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// Register reservation timeout: 10 ticks.
+	r1 := c.Nodes["r1"]
+	r1.ReplicaState = NodeStateCatchingUp
+	c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+10)
+
+	// Catch-up succeeds immediately — auto-cancels reservation timeout.
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("catch-up should converge")
+	}
+	if r1.ReplicaState != NodeStateInSync {
+		t.Fatalf("expected InSync after convergence, got %s", r1.ReplicaState)
+	}
+
+	// Tick well past the deadline.
+	c.TickN(20)
+
+	// Stale reservation timeout must NOT fire (cancelled by convergence).
+	if c.FiredTimeoutsByKind(TimeoutReservation) != 0 {
+		t.Fatal("stale reservation timeout should not fire after recovery success")
+	}
+	if r1.ReplicaState != NodeStateInSync {
+		t.Fatalf("stale timeout regressed state: expected InSync, got %s", r1.ReplicaState)
+	}
+	t.Logf("stale reservation timeout correctly suppressed after recovery")
+}
+
+func TestP03_LateBarrierAck_AfterTimeout_Rejected(t *testing.T) {
+	// Barrier times out, then a late ack arrives. The late ack must be
+	// rejected — it must not count toward DurableOn.
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	c.BarrierTimeoutTicks = 1 // timeout at Now+1
+
+	c.CommitWrite(1)
+
+	// Tick 1: write delivered, timeout fires (barrier at Now+2 not yet delivered).
+	c.TickN(1)
+
+	if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
+		t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
+	}
+
+	// LSN 1 should NOT be committed.
+	if c.Coordinator.CommittedLSN != 0 {
+		t.Fatalf("LSN 1 should not be committed after timeout, got %d", c.Coordinator.CommittedLSN)
+	}
+
+	// Now inject a late barrier ack (as if the network delayed it massively).
+	c.InjectMessage(Message{
+		Kind:      MsgBarrierAck,
+		From:      "r1",
+		To:        "p",
+		Epoch:     c.Coordinator.Epoch,
+		TargetLSN: 1,
+	}, c.Now+1)
+
+	c.TickN(5)
+
+	// Late ack must be rejected with barrier_expired reason.
+	expiredRejects := c.RejectedByReason(RejectBarrierExpired)
+	if expiredRejects == 0 {
+		t.Fatal("late barrier ack should be rejected as barrier_expired")
+	}
+
+	// LSN 1 must still be uncommitted (late ack did not count).
+	if c.Coordinator.CommittedLSN != 0 {
+		t.Fatalf("late ack should not commit LSN 1, got committed=%d", c.Coordinator.CommittedLSN)
+	}
+
+	// DurableOn should NOT include r1.
+	p1 := c.Pending[1]
+	if p1 != nil && p1.DurableOn["r1"] {
+		t.Fatal("late ack should not set DurableOn for r1")
+	}
+	t.Logf("late barrier ack: rejected as barrier_expired, LSN 1 stays uncommitted")
+}
--- a/sw-block/prototype/distsim/phase04a_ownership_test.go
+++ b/sw-block/prototype/distsim/phase04a_ownership_test.go
@ -0,0 +1,243 @@
+package distsim
+
+import "testing"
+
+// ============================================================
+// Phase 04a: Session ownership validation in distsim
+// ============================================================
+
+// --- Scenario 1: Endpoint change during active catch-up ---
+
+func TestP04a_EndpointChangeDuringCatchup_InvalidatesSession(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	// Start catch-up session for r1.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	trigger, sessID, ok := c.TriggerRecoverySession("r1")
+	if !ok || trigger != TriggerReassignment {
+		t.Fatalf("should trigger reassignment, got %s/%v", trigger, ok)
+	}
+
+	// Session is active.
+	sess := c.Sessions["r1"]
+	if !sess.Active {
+		t.Fatal("session should be active")
+	}
+
+	// Endpoint changes (replica restarts on new address).
+	c.StopNode("r1")
+	c.RestartNodeWithNewAddress("r1")
+
+	// Session invalidated by endpoint change.
+	if sess.Active {
+		t.Fatal("session should be invalidated after endpoint change")
+	}
+	if sess.Reason != "endpoint_changed" {
+		t.Fatalf("invalidation reason: got %q, want endpoint_changed", sess.Reason)
+	}
+
+	// Stale completion from old session is rejected.
+	if c.CompleteRecoverySession("r1", sessID) {
+		t.Fatal("stale session completion should be rejected")
+	}
+	t.Logf("endpoint change: session %d invalidated, stale completion rejected", sessID)
+}
+
+// --- Scenario 2: Epoch bump during active catch-up ---
+
+func TestP04a_EpochBumpDuringCatchup_InvalidatesSession(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	_, sessID, ok := c.TriggerRecoverySession("r1")
+	if !ok {
+		t.Fatal("trigger should succeed")
+	}
+	sess := c.Sessions["r1"]
+
+	// Epoch bumps (promotion).
+	c.StopNode("p")
+	c.Promote("r2")
+
+	// Session invalidated by epoch bump.
+	if sess.Active {
+		t.Fatal("session should be invalidated after epoch bump")
+	}
+	if sess.Reason != "epoch_bump_promotion" {
+		t.Fatalf("reason: got %q", sess.Reason)
+	}
+
+	// Stale completion rejected.
+	if c.CompleteRecoverySession("r1", sessID) {
+		t.Fatal("stale completion after epoch bump should be rejected")
+	}
+	t.Logf("epoch bump: session %d invalidated, completion rejected", sessID)
+}
+
+// --- Scenario 3: Stale late completion from old session ---
+
+func TestP04a_StaleCompletion_AfterSupersede_Rejected(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// First session.
+	_, oldSessID, _ := c.TriggerRecoverySession("r1")
+	oldSess := c.Sessions["r1"]
+
+	// Invalidate old session manually (simulate timeout or abort).
+	c.InvalidateReplicaSession("r1", "timeout")
+	if oldSess.Active {
+		t.Fatal("old session should be invalidated")
+	}
+
+	// New session triggered.
+	c.Nodes["r1"].ReplicaState = NodeStateLagging // reset state to allow retrigger
+	_, newSessID, ok := c.TriggerRecoverySession("r1")
+	if !ok {
+		t.Fatal("second trigger should succeed after invalidation")
+	}
+	newSess := c.Sessions["r1"]
+
+	// Old session completion attempt — must be rejected by ID mismatch.
+	if c.CompleteRecoverySession("r1", oldSessID) {
+		t.Fatal("old session completion must be rejected")
+	}
+	// New session still active.
+	if !newSess.Active {
+		t.Fatal("new session should still be active")
+	}
+
+	// New session completion succeeds.
+	if !c.CompleteRecoverySession("r1", newSessID) {
+		t.Fatal("new session completion should succeed")
+	}
+	if c.Nodes["r1"].ReplicaState != NodeStateInSync {
+		t.Fatalf("r1 should be InSync after new session completes, got %s", c.Nodes["r1"].ReplicaState)
+	}
+	t.Logf("stale completion: old=%d rejected, new=%d accepted", oldSessID, newSessID)
+}
+
+// --- Scenario 4: Duplicate recovery trigger while session active ---
+
+func TestP04a_DuplicateTrigger_WhileActive_Rejected(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.TickN(5)
+
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	c.CommitWrite(2)
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// First trigger succeeds.
+	_, _, ok := c.TriggerRecoverySession("r1")
+	if !ok {
+		t.Fatal("first trigger should succeed")
+	}
+
+	// Duplicate trigger while session active — rejected.
+	_, _, ok = c.TriggerRecoverySession("r1")
+	if ok {
+		t.Fatal("duplicate trigger should be rejected while session active")
+	}
+
+	// Session count: only one in history.
+	sessCount := 0
+	for _, s := range c.SessionHistory {
+		if s.ReplicaID == "r1" {
+			sessCount++
+		}
+	}
+	if sessCount != 1 {
+		t.Fatalf("should have exactly 1 session in history, got %d", sessCount)
+	}
+	t.Logf("duplicate trigger correctly rejected")
+}
+
+// --- Scenario 5: Session tracking through full lifecycle ---
+
+func TestP04a_FullLifecycle_SessionTracking(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+
+	c.CommitWrite(1)
+	c.CommitWrite(2)
+	c.TickN(5)
+
+	// Disconnect, write, reconnect.
+	c.Disconnect("p", "r1")
+	c.Disconnect("r1", "p")
+	for i := uint64(3); i <= 10; i++ {
+		c.CommitWrite(i)
+	}
+	c.TickN(5)
+	c.Connect("p", "r1")
+	c.Connect("r1", "p")
+
+	// Trigger session.
+	trigger, sessID, ok := c.TriggerRecoverySession("r1")
+	if !ok {
+		t.Fatal("trigger failed")
+	}
+	if trigger != TriggerReassignment {
+		t.Fatalf("expected reassignment, got %s", trigger)
+	}
+
+	// Catch up.
+	converged := c.CatchUpWithEscalation("r1", 100)
+	if !converged {
+		t.Fatal("catch-up should converge")
+	}
+
+	// Complete session.
+	if !c.CompleteRecoverySession("r1", sessID) {
+		t.Fatal("completion should succeed")
+	}
+
+	// Verify final state.
+	if c.Nodes["r1"].ReplicaState != NodeStateInSync {
+		t.Fatalf("r1 should be InSync, got %s", c.Nodes["r1"].ReplicaState)
+	}
+	if err := c.AssertCommittedRecoverable("r1"); err != nil {
+		t.Fatalf("data incorrect: %v", err)
+	}
+
+	// Session in history, not active.
+	sess := c.Sessions["r1"]
+	if sess.Active {
+		t.Fatal("session should not be active after completion")
+	}
+	if len(c.SessionHistory) != 1 {
+		t.Fatalf("expected 1 session in history, got %d", len(c.SessionHistory))
+	}
+	t.Logf("full lifecycle: trigger=%s session=%d → catch-up → complete → InSync", trigger, sessID)
+}
--- a/sw-block/prototype/distsim/protocol.go
+++ b/sw-block/prototype/distsim/protocol.go
@ -0,0 +1,102 @@
+package distsim
+
+type ProtocolVersion string
+
+const (
+	ProtocolV1  ProtocolVersion = "v1"
+	ProtocolV15 ProtocolVersion = "v1_5"
+	ProtocolV2  ProtocolVersion = "v2"
+)
+
+type ProtocolPolicy struct {
+	Version ProtocolVersion
+}
+
+func (p ProtocolPolicy) CanAttemptCatchup(addressStable bool) bool {
+	switch p.Version {
+	case ProtocolV1:
+		return false
+	case ProtocolV15:
+		return addressStable
+	case ProtocolV2:
+		return true
+	default:
+		return false
+	}
+}
+
+func (p ProtocolPolicy) BriefDisconnectAction(addressStable, recoverable bool) string {
+	switch p.Version {
+	case ProtocolV1:
+		return "degrade_or_rebuild"
+	case ProtocolV15:
+		if addressStable && recoverable {
+			return "catchup_if_history_survives"
+		}
+		return "stall_or_control_plane_recovery"
+	case ProtocolV2:
+		if recoverable {
+			return "reserved_catchup"
+		}
+		return "explicit_rebuild"
+	default:
+		return "unknown"
+	}
+}
+
+func (p ProtocolPolicy) TailChasingAction(converged bool) string {
+	switch p.Version {
+	case ProtocolV1:
+		if converged {
+			return "unexpected_catchup"
+		}
+		return "degrade"
+	case ProtocolV15:
+		if converged {
+			return "catchup"
+		}
+		return "stall_or_rebuild"
+	case ProtocolV2:
+		if converged {
+			return "catchup"
+		}
+		return "abort_to_rebuild"
+	default:
+		return "unknown"
+	}
+}
+
+func (p ProtocolPolicy) RestartRejoinAction(addressStable bool) string {
+	switch p.Version {
+	case ProtocolV1:
+		return "control_plane_only"
+	case ProtocolV15:
+		if addressStable {
+			return "background_reconnect_or_control_plane"
+		}
+		return "control_plane_only"
+	case ProtocolV2:
+		if addressStable {
+			return "direct_reconnect_or_control_plane"
+		}
+		return "explicit_reassignment_or_rebuild"
+	default:
+		return "unknown"
+	}
+}
+
+func (p ProtocolPolicy) ChangedAddressRestartAction(recoverable bool) string {
+	switch p.Version {
+	case ProtocolV1:
+		return "control_plane_only"
+	case ProtocolV15:
+		return "control_plane_only"
+	case ProtocolV2:
+		if recoverable {
+			return "explicit_reassignment_then_catchup"
+		}
+		return "explicit_reassignment_or_rebuild"
+	default:
+		return "unknown"
+	}
+}
--- a/sw-block/prototype/distsim/protocol_test.go
+++ b/sw-block/prototype/distsim/protocol_test.go
@ -0,0 +1,84 @@
+package distsim
+
+import "testing"
+
+func TestProtocolV1CannotAttemptCatchup(t *testing.T) {
+	p := ProtocolPolicy{Version: ProtocolV1}
+	if p.CanAttemptCatchup(true) {
+		t.Fatal("v1 should not expose meaningful catch-up path")
+	}
+}
+
+func TestProtocolV15CatchupDependsOnStableAddress(t *testing.T) {
+	p := ProtocolPolicy{Version: ProtocolV15}
+	if !p.CanAttemptCatchup(true) {
+		t.Fatal("v1.5 should allow catch-up when address is stable")
+	}
+	if p.CanAttemptCatchup(false) {
+		t.Fatal("v1.5 should not assume reconnect with changed address")
+	}
+}
+
+func TestProtocolV2AllowsCatchupByPolicy(t *testing.T) {
+	p := ProtocolPolicy{Version: ProtocolV2}
+	if !p.CanAttemptCatchup(true) || !p.CanAttemptCatchup(false) {
+		t.Fatal("v2 policy should allow catch-up attempt subject to explicit recoverability checks")
+	}
+}
+
+func TestProtocolBriefDisconnectActions(t *testing.T) {
+	if got := (ProtocolPolicy{Version: ProtocolV1}).BriefDisconnectAction(true, true); got != "degrade_or_rebuild" {
+		t.Fatalf("v1 brief-disconnect action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(true, true); got != "catchup_if_history_survives" {
+		t.Fatalf("v1.5 brief-disconnect action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(false, true); got != "stall_or_control_plane_recovery" {
+		t.Fatalf("v1.5 changed-address brief-disconnect action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(true, false); got != "explicit_rebuild" {
+		t.Fatalf("v2 unrecoverable brief-disconnect action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(false, true); got != "reserved_catchup" {
+		t.Fatalf("v2 recoverable brief-disconnect action = %s", got)
+	}
+}
+
+func TestProtocolTailChasingActions(t *testing.T) {
+	if got := (ProtocolPolicy{Version: ProtocolV1}).TailChasingAction(false); got != "degrade" {
+		t.Fatalf("v1 tail-chasing action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV15}).TailChasingAction(false); got != "stall_or_rebuild" {
+		t.Fatalf("v1.5 tail-chasing action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV2}).TailChasingAction(false); got != "abort_to_rebuild" {
+		t.Fatalf("v2 tail-chasing action = %s", got)
+	}
+}
+
+func TestProtocolRestartRejoinActions(t *testing.T) {
+	if got := (ProtocolPolicy{Version: ProtocolV1}).RestartRejoinAction(true); got != "control_plane_only" {
+		t.Fatalf("v1 restart action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV15}).RestartRejoinAction(false); got != "control_plane_only" {
+		t.Fatalf("v1.5 changed-address restart action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV2}).RestartRejoinAction(false); got != "explicit_reassignment_or_rebuild" {
+		t.Fatalf("v2 changed-address restart action = %s", got)
+	}
+}
+
+func TestProtocolChangedAddressRestartActions(t *testing.T) {
+	if got := (ProtocolPolicy{Version: ProtocolV1}).ChangedAddressRestartAction(true); got != "control_plane_only" {
+		t.Fatalf("v1 changed-address restart action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV15}).ChangedAddressRestartAction(true); got != "control_plane_only" {
+		t.Fatalf("v1.5 changed-address restart action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" {
+		t.Fatalf("v2 recoverable changed-address restart action = %s", got)
+	}
+	if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" {
+		t.Fatalf("v2 unrecoverable changed-address restart action = %s", got)
+	}
+}
--- a/sw-block/prototype/distsim/random.go
+++ b/sw-block/prototype/distsim/random.go
@ -0,0 +1,256 @@
+package distsim
+
+import (
+	"fmt"
+	"math/rand"
+	"sort"
+)
+
+type RandomEvent string
+
+const (
+	RandomCommitWrite   RandomEvent = "commit_write"
+	RandomTick          RandomEvent = "tick"
+	RandomDisconnect    RandomEvent = "disconnect"
+	RandomReconnect     RandomEvent = "reconnect"
+	RandomStopNode      RandomEvent = "stop_node"
+	RandomStartNode     RandomEvent = "start_node"
+	RandomPromote       RandomEvent = "promote"
+	RandomTakeSnapshot  RandomEvent = "take_snapshot"
+	RandomCatchup       RandomEvent = "catchup"
+	RandomRebuild       RandomEvent = "rebuild"
+)
+
+type RandomStep struct {
+	Step   int
+	Event  RandomEvent
+	Detail string
+}
+
+type RandomResult struct {
+	Seed      int64
+	Steps     []RandomStep
+	Cluster   *Cluster
+	Snapshots []string
+}
+
+func RunRandomScenario(seed int64, steps int) (*RandomResult, error) {
+	rng := rand.New(rand.NewSource(seed))
+	cluster := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	result := &RandomResult{
+		Seed:    seed,
+		Cluster: cluster,
+	}
+
+	for i := 0; i < steps; i++ {
+		step, err := runRandomStep(cluster, rng, i)
+		if err != nil {
+			result.Steps = append(result.Steps, step)
+			return result, err
+		}
+		result.Steps = append(result.Steps, step)
+		if err := assertClusterInvariants(cluster); err != nil {
+			return result, fmt.Errorf("seed=%d step=%d event=%s detail=%s: %w", seed, i, step.Event, step.Detail, err)
+		}
+	}
+	return result, assertClusterInvariants(cluster)
+}
+
+func runRandomStep(c *Cluster, rng *rand.Rand, step int) (RandomStep, error) {
+	events := []RandomEvent{
+		RandomCommitWrite,
+		RandomTick,
+		RandomDisconnect,
+		RandomReconnect,
+		RandomStopNode,
+		RandomStartNode,
+		RandomPromote,
+		RandomTakeSnapshot,
+		RandomCatchup,
+		RandomRebuild,
+	}
+	ev := events[rng.Intn(len(events))]
+	rs := RandomStep{Step: step, Event: ev}
+
+	switch ev {
+	case RandomCommitWrite:
+		block := uint64(rng.Intn(8) + 1)
+		lsn := c.CommitWrite(block)
+		rs.Detail = fmt.Sprintf("block=%d lsn=%d", block, lsn)
+	case RandomTick:
+		n := rng.Intn(3) + 1
+		c.TickN(n)
+		rs.Detail = fmt.Sprintf("ticks=%d", n)
+	case RandomDisconnect:
+		from, to := randomPair(c, rng)
+		c.Disconnect(from, to)
+		rs.Detail = fmt.Sprintf("%s->%s", from, to)
+	case RandomReconnect:
+		from, to := randomPair(c, rng)
+		c.Connect(from, to)
+		rs.Detail = fmt.Sprintf("%s->%s", from, to)
+	case RandomStopNode:
+		id := randomNodeID(c, rng)
+		c.StopNode(id)
+		rs.Detail = id
+	case RandomStartNode:
+		id := randomNodeID(c, rng)
+		c.StartNode(id)
+		rs.Detail = id
+	case RandomPromote:
+		if primary := c.Primary(); primary != nil && primary.Running {
+			rs.Detail = "primary_still_running"
+			return rs, nil
+		}
+		candidates := promotableNodes(c)
+		if len(candidates) == 0 {
+			rs.Detail = "no_candidate"
+			return rs, nil
+		}
+		id := candidates[rng.Intn(len(candidates))]
+		rs.Detail = id
+		if err := c.Promote(id); err != nil {
+			return rs, err
+		}
+	case RandomTakeSnapshot:
+		primary := c.Primary()
+		if primary == nil || !primary.Running {
+			rs.Detail = "no_primary"
+			return rs, nil
+		}
+		lsn := c.Coordinator.CommittedLSN
+		id := fmt.Sprintf("snap-%s-%d", primary.ID, lsn)
+		primary.Storage.TakeSnapshot(id, lsn)
+		rs.Detail = fmt.Sprintf("%s@%d", id, lsn)
+	case RandomCatchup:
+		id := randomReplicaID(c, rng)
+		if id == "" {
+			rs.Detail = "no_replica"
+			return rs, nil
+		}
+		node := c.Nodes[id]
+		if node == nil || !node.Running {
+			rs.Detail = id + ":down"
+			return rs, nil
+		}
+		start := node.Storage.FlushedLSN
+		end := c.Coordinator.CommittedLSN
+		if end <= start {
+			rs.Detail = fmt.Sprintf("%s:no_gap", id)
+			return rs, nil
+		}
+		rs.Detail = fmt.Sprintf("%s:%d..%d", id, start+1, end)
+		if err := c.RecoverReplicaFromPrimary(id, start, end); err != nil {
+			return rs, err
+		}
+	case RandomRebuild:
+		id := randomReplicaID(c, rng)
+		if id == "" {
+			rs.Detail = "no_replica"
+			return rs, nil
+		}
+		primary := c.Primary()
+		node := c.Nodes[id]
+		if primary == nil || node == nil || !primary.Running || !node.Running {
+			rs.Detail = id + ":unavailable"
+			return rs, nil
+		}
+		snapshotIDs := make([]string, 0, len(primary.Storage.Snapshots))
+		for snapID := range primary.Storage.Snapshots {
+			snapshotIDs = append(snapshotIDs, snapID)
+		}
+		if len(snapshotIDs) == 0 {
+			rs.Detail = id + ":no_snapshot"
+			return rs, nil
+		}
+		sort.Strings(snapshotIDs)
+		snapID := snapshotIDs[rng.Intn(len(snapshotIDs))]
+		rs.Detail = fmt.Sprintf("%s:%s->%d", id, snapID, c.Coordinator.CommittedLSN)
+		if err := c.RebuildReplicaFromSnapshot(id, snapID, c.Coordinator.CommittedLSN); err != nil {
+			return rs, err
+		}
+	default:
+		return rs, fmt.Errorf("unknown random event %s", ev)
+	}
+
+	return rs, nil
+}
+
+func randomNodeID(c *Cluster, rng *rand.Rand) string {
+	ids := append([]string(nil), c.Coordinator.Members...)
+	sort.Strings(ids)
+	if len(ids) == 0 {
+		return ""
+	}
+	return ids[rng.Intn(len(ids))]
+}
+
+func randomReplicaID(c *Cluster, rng *rand.Rand) string {
+	ids := c.replicaIDs()
+	if len(ids) == 0 {
+		return ""
+	}
+	return ids[rng.Intn(len(ids))]
+}
+
+func randomPair(c *Cluster, rng *rand.Rand) (string, string) {
+	from := randomNodeID(c, rng)
+	to := randomNodeID(c, rng)
+	if from == to {
+		ids := append([]string(nil), c.Coordinator.Members...)
+		sort.Strings(ids)
+		for _, id := range ids {
+			if id != from {
+				to = id
+				break
+			}
+		}
+	}
+	return from, to
+}
+
+func promotableNodes(c *Cluster) []string {
+	out := make([]string, 0)
+	want := c.Reference.StateAt(c.Coordinator.CommittedLSN)
+	for _, id := range c.Coordinator.Members {
+		n := c.Nodes[id]
+		if n == nil || !n.Running || n.Storage.FlushedLSN < c.Coordinator.CommittedLSN {
+			continue
+		}
+		if !EqualState(n.Storage.StateAt(c.Coordinator.CommittedLSN), want) {
+			continue
+		}
+		out = append(out, id)
+	}
+	sort.Strings(out)
+	return out
+}
+
+func assertClusterInvariants(c *Cluster) error {
+	committed := c.Coordinator.CommittedLSN
+	want := c.Reference.StateAt(committed)
+
+	for lsn, p := range c.Pending {
+		if p.Committed && lsn > committed {
+			return fmt.Errorf("pending lsn %d marked committed above coordinator committed lsn %d", lsn, committed)
+		}
+	}
+
+	for _, id := range promotableNodes(c) {
+		n := c.Nodes[id]
+		got := n.Storage.StateAt(committed)
+		if !EqualState(got, want) {
+			return fmt.Errorf("promotable node %s mismatch at committed lsn %d: got=%v want=%v", id, committed, got, want)
+		}
+	}
+
+	primary := c.Primary()
+	if primary != nil && primary.Running && primary.Epoch == c.Coordinator.Epoch {
+		got := primary.Storage.StateAt(committed)
+		if !EqualState(got, want) {
+			return fmt.Errorf("primary %s mismatch at committed lsn %d: got=%v want=%v", primary.ID, committed, got, want)
+		}
+	}
+
+	return nil
+}
--- a/sw-block/prototype/distsim/random_test.go
+++ b/sw-block/prototype/distsim/random_test.go
@ -0,0 +1,43 @@
+package distsim
+
+import "testing"
+
+func TestRandomScenarioSeeds(t *testing.T) {
+	seeds := []int64{
+		1, 2, 3, 4, 5,
+		11, 21, 34, 55, 89,
+		101, 202, 303, 404, 505,
+	}
+
+	for _, seed := range seeds {
+		seed := seed
+		t.Run("seed_"+itoa64(seed), func(t *testing.T) {
+			t.Parallel()
+			if _, err := RunRandomScenario(seed, 60); err != nil {
+				t.Fatal(err)
+			}
+		})
+	}
+}
+
+func itoa64(v int64) string {
+	if v == 0 {
+		return "0"
+	}
+	neg := v < 0
+	if neg {
+		v = -v
+	}
+	buf := make([]byte, 0, 20)
+	for v > 0 {
+		buf = append(buf, byte('0'+v%10))
+		v /= 10
+	}
+	if neg {
+		buf = append(buf, '-')
+	}
+	for i, j := 0, len(buf)-1; i < j; i, j = i+1, j-1 {
+		buf[i], buf[j] = buf[j], buf[i]
+	}
+	return string(buf)
+}
--- a/sw-block/prototype/distsim/reference.go
+++ b/sw-block/prototype/distsim/reference.go
@ -0,0 +1,95 @@
+package distsim
+
+type Write struct {
+	LSN   uint64
+	Block uint64
+	Value uint64
+}
+
+type Snapshot struct {
+	LSN   uint64
+	State map[uint64]uint64
+}
+
+type Reference struct {
+	writes    []Write
+	snapshots map[uint64]Snapshot
+}
+
+func NewReference() *Reference {
+	return &Reference{snapshots: map[uint64]Snapshot{}}
+}
+
+func (r *Reference) Apply(w Write) {
+	r.writes = append(r.writes, w)
+}
+
+func (r *Reference) StateAt(lsn uint64) map[uint64]uint64 {
+	state := make(map[uint64]uint64)
+	for _, w := range r.writes {
+		if w.LSN > lsn {
+			break
+		}
+		state[w.Block] = w.Value
+	}
+	return state
+}
+
+func cloneMap(in map[uint64]uint64) map[uint64]uint64 {
+	out := make(map[uint64]uint64, len(in))
+	for k, v := range in {
+		out[k] = v
+	}
+	return out
+}
+
+func (r *Reference) TakeSnapshot(lsn uint64) Snapshot {
+	s := Snapshot{LSN: lsn, State: cloneMap(r.StateAt(lsn))}
+	r.snapshots[lsn] = s
+	return s
+}
+
+func (r *Reference) SnapshotAt(lsn uint64) (Snapshot, bool) {
+	s, ok := r.snapshots[lsn]
+	return s, ok
+}
+
+type Node struct {
+	Extent map[uint64]uint64
+}
+
+func NewNode() *Node {
+	return &Node{Extent: map[uint64]uint64{}}
+}
+
+func (n *Node) ApplyWrite(w Write) {
+	n.Extent[w.Block] = w.Value
+}
+
+func (n *Node) LoadSnapshot(s Snapshot) {
+	n.Extent = cloneMap(s.State)
+}
+
+func (n *Node) ReplayFromWrites(writes []Write, startExclusive, endInclusive uint64) {
+	for _, w := range writes {
+		if w.LSN <= startExclusive {
+			continue
+		}
+		if w.LSN > endInclusive {
+			break
+		}
+		n.ApplyWrite(w)
+	}
+}
+
+func EqualState(a, b map[uint64]uint64) bool {
+	if len(a) != len(b) {
+		return false
+	}
+	for k, v := range a {
+		if b[k] != v {
+			return false
+		}
+	}
+	return true
+}
--- a/sw-block/prototype/distsim/reference_test.go
+++ b/sw-block/prototype/distsim/reference_test.go
@ -0,0 +1,66 @@
+package distsim
+
+import "testing"
+
+func TestWALReplayPreservesHistoricalValue(t *testing.T) {
+	ref := NewReference()
+	ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
+	ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
+
+	node := NewNode()
+	node.ReplayFromWrites(ref.writes, 0, 10)
+
+	want := ref.StateAt(10)
+	if !EqualState(node.Extent, want) {
+		t.Fatalf("replay mismatch: got=%v want=%v", node.Extent, want)
+	}
+}
+
+func TestCurrentExtentCannotRecoverOldLSN(t *testing.T) {
+	ref := NewReference()
+	ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
+	ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
+
+	primary := NewNode()
+	for _, w := range ref.writes {
+		primary.ApplyWrite(w)
+	}
+
+	wantOld := ref.StateAt(10)
+	if EqualState(primary.Extent, wantOld) {
+		t.Fatalf("latest extent should not equal old LSN state: latest=%v old=%v", primary.Extent, wantOld)
+	}
+}
+
+func TestSnapshotAtCpLSNRecoversCorrectHistoricalValue(t *testing.T) {
+	ref := NewReference()
+	ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
+	snap := ref.TakeSnapshot(10)
+	ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
+
+	node := NewNode()
+	node.LoadSnapshot(snap)
+
+	want := ref.StateAt(10)
+	if !EqualState(node.Extent, want) {
+		t.Fatalf("snapshot mismatch: got=%v want=%v", node.Extent, want)
+	}
+}
+
+func TestSnapshotPlusTrailingReplayReachesTargetLSN(t *testing.T) {
+	ref := NewReference()
+	ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
+	ref.Apply(Write{LSN: 11, Block: 2, Value: 11})
+	snap := ref.TakeSnapshot(11)
+	ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
+	ref.Apply(Write{LSN: 13, Block: 9, Value: 13})
+
+	node := NewNode()
+	node.LoadSnapshot(snap)
+	node.ReplayFromWrites(ref.writes, 11, 13)
+
+	want := ref.StateAt(13)
+	if !EqualState(node.Extent, want) {
+		t.Fatalf("snapshot+replay mismatch: got=%v want=%v", node.Extent, want)
+	}
+}
--- a/sw-block/prototype/distsim/simulator.go
+++ b/sw-block/prototype/distsim/simulator.go
@ -0,0 +1,581 @@
+package distsim
+
+import (
+	"container/heap"
+	"fmt"
+	"math/rand"
+	"strings"
+)
+
+// --- Event types ---
+
+type EventKind int
+
+const (
+	EvWriteStart    EventKind = iota // client writes to primary
+	EvShipEntry                      // primary sends WAL entry to replica
+	EvShipDeliver                    // entry arrives at replica
+	EvBarrierSend                    // primary sends barrier to replica
+	EvBarrierDeliver                 // barrier arrives at replica
+	EvBarrierFsync                   // replica fsync completes
+	EvBarrierAck                     // ack arrives back at primary
+	EvNodeCrash                      // node crashes
+	EvNodeRestart                    // node restarts
+	EvLinkDown                       // network link drops
+	EvLinkUp                         // network link restores
+	EvFlusherTick                    // flusher checkpoint cycle
+	EvPromote                        // coordinator promotes a node
+	EvLockAcquire                    // thread tries to acquire lock
+	EvLockRelease                    // thread releases lock
+)
+
+func (k EventKind) String() string {
+	names := [...]string{
+		"WriteStart", "ShipEntry", "ShipDeliver",
+		"BarrierSend", "BarrierDeliver", "BarrierFsync", "BarrierAck",
+		"NodeCrash", "NodeRestart", "LinkDown", "LinkUp",
+		"FlusherTick", "Promote",
+		"LockAcquire", "LockRelease",
+	}
+	if int(k) < len(names) {
+		return names[k]
+	}
+	return fmt.Sprintf("Event(%d)", k)
+}
+
+type Event struct {
+	Time    uint64
+	ID      uint64 // unique, for stable ordering
+	Kind    EventKind
+	NodeID  string
+	Payload EventPayload
+}
+
+type EventPayload struct {
+	Write     Write  // for WriteStart, ShipEntry, ShipDeliver
+	TargetLSN uint64 // for barriers
+	FromNode  string // for delivered messages
+	ToNode    string
+	LockName  string // for lock events
+	ThreadID  string
+	PromoteID string // for EvPromote
+}
+
+// --- Priority queue ---
+
+type eventHeap []Event
+
+func (h eventHeap) Len() int      { return len(h) }
+func (h eventHeap) Swap(i, j int) { h[i], h[j] = h[j], h[i] }
+func (h eventHeap) Less(i, j int) bool {
+	if h[i].Time != h[j].Time {
+		return h[i].Time < h[j].Time
+	}
+	return h[i].ID < h[j].ID // stable tie-break
+}
+func (h *eventHeap) Push(x interface{}) { *h = append(*h, x.(Event)) }
+func (h *eventHeap) Pop() interface{} {
+	old := *h
+	n := len(old)
+	e := old[n-1]
+	*h = old[:n-1]
+	return e
+}
+
+// --- Lock model ---
+
+type lockState struct {
+	held    bool
+	holder  string // threadID
+	waiting []Event // parked EvLockAcquire events
+}
+
+// --- Trace ---
+
+type TraceEntry struct {
+	Time  uint64
+	Event Event
+	Note  string
+}
+
+// --- Simulator ---
+
+type Simulator struct {
+	Cluster   *Cluster
+	rng       *rand.Rand
+	queue     eventHeap
+	nextID    uint64
+	locks     map[string]*lockState // lockName -> state
+	trace     []TraceEntry
+	Errors    []string
+	maxTime   uint64
+	jitterMax uint64 // max random delay added to message delivery
+
+	// Config
+	FaultRate     float64 // probability of injecting a fault per step [0,1]
+	MaxEvents     int     // stop after this many events
+	eventsRun     int
+}
+
+func NewSimulator(cluster *Cluster, seed int64) *Simulator {
+	return &Simulator{
+		Cluster:   cluster,
+		rng:       rand.New(rand.NewSource(seed)),
+		locks:     map[string]*lockState{},
+		maxTime:   100000,
+		jitterMax: 3,
+		FaultRate: 0.05,
+		MaxEvents: 5000,
+	}
+}
+
+// Enqueue adds an event to the priority queue.
+func (s *Simulator) Enqueue(e Event) {
+	s.nextID++
+	e.ID = s.nextID
+	heap.Push(&s.queue, e)
+}
+
+// EnqueueAt is a convenience for enqueueing at a specific time.
+func (s *Simulator) EnqueueAt(time uint64, kind EventKind, nodeID string, payload EventPayload) {
+	s.Enqueue(Event{Time: time, Kind: kind, NodeID: nodeID, Payload: payload})
+}
+
+// jitter returns a random delay in [1, jitterMax].
+func (s *Simulator) jitter() uint64 {
+	if s.jitterMax <= 1 {
+		return 1
+	}
+	return 1 + uint64(s.rng.Int63n(int64(s.jitterMax)))
+}
+
+// --- Main loop ---
+
+// Step executes the next event. Returns false if queue is empty or limit reached.
+// When multiple events share the same timestamp, one is chosen randomly
+// to explore different interleavings across runs with different seeds.
+func (s *Simulator) Step() bool {
+	if s.queue.Len() == 0 || s.eventsRun >= s.MaxEvents {
+		return false
+	}
+	// Collect all events at the earliest timestamp.
+	earliest := s.queue[0].Time
+	if earliest > s.maxTime {
+		return false
+	}
+	var ready []Event
+	for s.queue.Len() > 0 && s.queue[0].Time == earliest {
+		ready = append(ready, heap.Pop(&s.queue).(Event))
+	}
+	// Shuffle to randomize interleaving of equal-time events.
+	s.rng.Shuffle(len(ready), func(i, j int) { ready[i], ready[j] = ready[j], ready[i] })
+	// Execute the first, re-enqueue the rest.
+	e := ready[0]
+	for _, r := range ready[1:] {
+		heap.Push(&s.queue, r)
+	}
+
+	s.Cluster.Now = e.Time
+	s.eventsRun++
+
+	s.execute(e)
+	s.checkInvariants(e)
+
+	return len(s.Errors) == 0
+}
+
+// Run executes until queue empty, limit reached, or invariant violated.
+func (s *Simulator) Run() {
+	for s.Step() {
+	}
+}
+
+// --- Event execution ---
+
+func (s *Simulator) execute(e Event) {
+	node := s.Cluster.Nodes[e.NodeID]
+
+	switch e.Kind {
+	case EvWriteStart:
+		s.executeWriteStart(e)
+
+	case EvShipEntry:
+		// Primary ships entry to a replica. Enqueue delivery with jitter.
+		if node != nil && node.Running {
+			deliverTime := s.Cluster.Now + s.jitter()
+			s.EnqueueAt(deliverTime, EvShipDeliver, e.Payload.ToNode, EventPayload{
+				Write:    e.Payload.Write,
+				FromNode: e.NodeID,
+			})
+			s.record(e, fmt.Sprintf("ship LSN=%d to %s, deliver@%d", e.Payload.Write.LSN, e.Payload.ToNode, deliverTime))
+		}
+
+	case EvShipDeliver:
+		if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch {
+			if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] {
+				node.Storage.AppendWrite(e.Payload.Write)
+				s.record(e, fmt.Sprintf("deliver LSN=%d on %s, receivedLSN=%d", e.Payload.Write.LSN, e.NodeID, node.Storage.ReceivedLSN))
+			} else {
+				s.record(e, fmt.Sprintf("drop LSN=%d to %s (link down)", e.Payload.Write.LSN, e.NodeID))
+			}
+		}
+
+	case EvBarrierSend:
+		if node != nil && node.Running {
+			deliverTime := s.Cluster.Now + s.jitter()
+			s.EnqueueAt(deliverTime, EvBarrierDeliver, e.Payload.ToNode, EventPayload{
+				TargetLSN: e.Payload.TargetLSN,
+				FromNode:  e.NodeID,
+			})
+			s.record(e, fmt.Sprintf("barrier LSN=%d to %s", e.Payload.TargetLSN, e.Payload.ToNode))
+		}
+
+	case EvBarrierDeliver:
+		if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch {
+			if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] {
+				if node.Storage.ReceivedLSN >= e.Payload.TargetLSN {
+					// Can fsync now. Enqueue fsync completion with small delay.
+					s.EnqueueAt(s.Cluster.Now+1, EvBarrierFsync, e.NodeID, EventPayload{
+						TargetLSN: e.Payload.TargetLSN,
+						FromNode:  e.Payload.FromNode,
+					})
+					s.record(e, fmt.Sprintf("barrier deliver LSN=%d, fsync scheduled", e.Payload.TargetLSN))
+				} else {
+					// Not enough entries yet. Re-enqueue barrier with delay (retry).
+					s.EnqueueAt(s.Cluster.Now+1, EvBarrierDeliver, e.NodeID, e.Payload)
+					s.record(e, fmt.Sprintf("barrier LSN=%d waiting (received=%d)", e.Payload.TargetLSN, node.Storage.ReceivedLSN))
+				}
+			}
+		}
+
+	case EvBarrierFsync:
+		if node != nil && node.Running {
+			node.Storage.AdvanceFlush(e.Payload.TargetLSN)
+			// Send ack back to primary.
+			deliverTime := s.Cluster.Now + s.jitter()
+			s.EnqueueAt(deliverTime, EvBarrierAck, e.Payload.FromNode, EventPayload{
+				TargetLSN: e.Payload.TargetLSN,
+				FromNode:  e.NodeID,
+			})
+			s.record(e, fmt.Sprintf("fsync LSN=%d on %s, flushedLSN=%d", e.Payload.TargetLSN, e.NodeID, node.Storage.FlushedLSN))
+		}
+
+	case EvBarrierAck:
+		// Only process acks on running nodes in the current epoch.
+		// After crash+promote, stale acks for the old primary must not advance commits.
+		if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch {
+			if pending := s.Cluster.Pending[e.Payload.TargetLSN]; pending != nil {
+				pending.DurableOn[e.Payload.FromNode] = true
+				s.Cluster.refreshCommits()
+				s.record(e, fmt.Sprintf("ack LSN=%d from %s, durable=%d", e.Payload.TargetLSN, e.Payload.FromNode, s.Cluster.durableAckCount(pending)))
+			}
+		} else {
+			s.record(e, fmt.Sprintf("ack LSN=%d from %s DROPPED (node down or stale epoch)", e.Payload.TargetLSN, e.Payload.FromNode))
+		}
+
+	case EvNodeCrash:
+		if node != nil {
+			node.Running = false
+			// Drop all pending events for this node.
+			s.dropEventsForNode(e.NodeID)
+			s.record(e, fmt.Sprintf("CRASH %s", e.NodeID))
+		}
+
+	case EvNodeRestart:
+		if node != nil {
+			node.Running = true
+			node.Epoch = s.Cluster.Coordinator.Epoch
+			s.record(e, fmt.Sprintf("RESTART %s epoch=%d", e.NodeID, node.Epoch))
+		}
+
+	case EvLinkDown:
+		s.Cluster.Disconnect(e.Payload.FromNode, e.Payload.ToNode)
+		s.Cluster.Disconnect(e.Payload.ToNode, e.Payload.FromNode)
+		s.record(e, fmt.Sprintf("LINK DOWN %s <-> %s", e.Payload.FromNode, e.Payload.ToNode))
+
+	case EvLinkUp:
+		s.Cluster.Connect(e.Payload.FromNode, e.Payload.ToNode)
+		s.Cluster.Connect(e.Payload.ToNode, e.Payload.FromNode)
+		s.record(e, fmt.Sprintf("LINK UP %s <-> %s", e.Payload.FromNode, e.Payload.ToNode))
+
+	case EvFlusherTick:
+		if node != nil && node.Running {
+			node.Storage.AdvanceCheckpoint(node.Storage.FlushedLSN)
+			s.record(e, fmt.Sprintf("flusher tick %s checkpoint=%d", e.NodeID, node.Storage.CheckpointLSN))
+		}
+
+	case EvPromote:
+		if err := s.Cluster.Promote(e.Payload.PromoteID); err != nil {
+			s.record(e, fmt.Sprintf("promote %s FAILED: %v", e.Payload.PromoteID, err))
+		} else {
+			s.record(e, fmt.Sprintf("PROMOTE %s epoch=%d", e.Payload.PromoteID, s.Cluster.Coordinator.Epoch))
+		}
+
+	case EvLockAcquire:
+		s.executeLockAcquire(e)
+
+	case EvLockRelease:
+		s.executeLockRelease(e)
+	}
+}
+
+func (s *Simulator) executeWriteStart(e Event) {
+	c := s.Cluster
+	primary := c.Primary()
+	if primary == nil || !primary.Running || primary.Epoch != c.Coordinator.Epoch {
+		s.record(e, "write rejected: no valid primary")
+		return
+	}
+	c.nextLSN++
+	w := Write{LSN: c.nextLSN, Block: e.Payload.Write.Block, Value: c.nextLSN}
+	primary.Storage.AppendWrite(w)
+	primary.Storage.AdvanceFlush(w.LSN)
+	c.Reference.Apply(w)
+	c.Pending[w.LSN] = &PendingCommit{
+		Write:     w,
+		DurableOn: map[string]bool{primary.ID: true},
+	}
+	c.refreshCommits()
+
+	// Ship to each replica with jitter.
+	for _, rid := range c.replicaIDs() {
+		shipTime := s.Cluster.Now + s.jitter()
+		s.EnqueueAt(shipTime, EvShipEntry, primary.ID, EventPayload{
+			Write:  w,
+			ToNode: rid,
+		})
+	}
+	// Barrier after ship.
+	for _, rid := range c.replicaIDs() {
+		barrierTime := s.Cluster.Now + s.jitter() + 2
+		s.EnqueueAt(barrierTime, EvBarrierSend, primary.ID, EventPayload{
+			TargetLSN: w.LSN,
+			ToNode:    rid,
+		})
+	}
+
+	s.record(e, fmt.Sprintf("write block=%d LSN=%d", w.Block, w.LSN))
+}
+
+func (s *Simulator) executeLockAcquire(e Event) {
+	name := e.Payload.LockName
+	ls, ok := s.locks[name]
+	if !ok {
+		ls = &lockState{}
+		s.locks[name] = ls
+	}
+	if !ls.held {
+		ls.held = true
+		ls.holder = e.Payload.ThreadID
+		s.record(e, fmt.Sprintf("lock %s acquired by %s", name, e.Payload.ThreadID))
+	} else {
+		// Park — will be released when current holder releases.
+		ls.waiting = append(ls.waiting, e)
+		s.record(e, fmt.Sprintf("lock %s BLOCKED %s (held by %s)", name, e.Payload.ThreadID, ls.holder))
+	}
+}
+
+func (s *Simulator) executeLockRelease(e Event) {
+	name := e.Payload.LockName
+	ls := s.locks[name]
+	if ls == nil || !ls.held {
+		return
+	}
+	// Validate: only the holder can release.
+	if ls.holder != e.Payload.ThreadID {
+		s.record(e, fmt.Sprintf("lock %s release REJECTED: %s is not holder (held by %s)", name, e.Payload.ThreadID, ls.holder))
+		return
+	}
+	s.record(e, fmt.Sprintf("lock %s released by %s", name, ls.holder))
+	ls.held = false
+	ls.holder = ""
+	// Grant to next waiter (random pick among waiters for interleaving exploration).
+	if len(ls.waiting) > 0 {
+		idx := s.rng.Intn(len(ls.waiting))
+		next := ls.waiting[idx]
+		ls.waiting = append(ls.waiting[:idx], ls.waiting[idx+1:]...)
+		ls.held = true
+		ls.holder = next.Payload.ThreadID
+		s.record(next, fmt.Sprintf("lock %s granted to %s (was waiting)", name, next.Payload.ThreadID))
+	}
+}
+
+func (s *Simulator) dropEventsForNode(nodeID string) {
+	var kept eventHeap
+	for _, e := range s.queue {
+		if e.NodeID != nodeID {
+			kept = append(kept, e)
+		}
+	}
+	s.queue = kept
+	heap.Init(&s.queue)
+}
+
+// --- Invariant checking ---
+
+func (s *Simulator) checkInvariants(after Event) {
+	// 1. Commit safety: committed LSN must be durable on policy-required nodes.
+	for lsn := uint64(1); lsn <= s.Cluster.Coordinator.CommittedLSN; lsn++ {
+		p := s.Cluster.Pending[lsn]
+		if p == nil {
+			continue
+		}
+		if !s.Cluster.commitSatisfied(p) {
+			s.addError(after, fmt.Sprintf("committed LSN %d not durable per policy", lsn))
+		}
+	}
+
+	// 2. No false commit on promoted node.
+	primary := s.Cluster.Primary()
+	if primary != nil && primary.Running {
+		committedLSN := s.Cluster.Coordinator.CommittedLSN
+		for lsn := committedLSN + 1; lsn <= s.Cluster.nextLSN; lsn++ {
+			p := s.Cluster.Pending[lsn]
+			if p != nil && !p.Committed && p.DurableOn[primary.ID] {
+				// Uncommitted but durable on primary — only a problem if primary changed.
+				// This is expected on the original primary. Only flag if this is a PROMOTED node.
+			}
+		}
+	}
+
+	// 3. Data correctness: primary state matches reference at the LSN it actually has.
+	// After promotion, the new primary may not have all writes the old primary committed.
+	// Verify correctness only up to what the current primary has durably received.
+	if primary != nil && primary.Running {
+		checkLSN := primary.Storage.FlushedLSN
+		if checkLSN > s.Cluster.Coordinator.CommittedLSN {
+			checkLSN = s.Cluster.Coordinator.CommittedLSN
+		}
+		if checkLSN > 0 {
+			refState := s.Cluster.Reference.StateAt(checkLSN)
+			nodeState := primary.Storage.StateAt(checkLSN)
+			if !EqualState(refState, nodeState) {
+				s.addError(after, fmt.Sprintf("data divergence on primary %s at LSN=%d",
+					primary.ID, checkLSN))
+			}
+		}
+	}
+
+	// 4. Epoch fencing: no node has accepted a stale epoch.
+	for id, node := range s.Cluster.Nodes {
+		if node.Running && node.Epoch > s.Cluster.Coordinator.Epoch {
+			s.addError(after, fmt.Sprintf("node %s has future epoch %d > coordinator %d", id, node.Epoch, s.Cluster.Coordinator.Epoch))
+		}
+	}
+
+	// 5. Lock safety: no two threads hold the same lock.
+	for name, ls := range s.locks {
+		if ls.held && ls.holder == "" {
+			s.addError(after, fmt.Sprintf("lock %s held but no holder", name))
+		}
+	}
+}
+
+func (s *Simulator) addError(after Event, msg string) {
+	s.Errors = append(s.Errors, fmt.Sprintf("t=%d after %s on %s: %s",
+		after.Time, after.Kind, after.NodeID, msg))
+}
+
+func (s *Simulator) record(e Event, note string) {
+	s.trace = append(s.trace, TraceEntry{Time: e.Time, Event: e, Note: note})
+}
+
+// --- Random fault injection ---
+
+// InjectRandomFault schedules a random fault (crash, partition, heal)
+// at a random future time within [Now+1, Now+spread).
+func (s *Simulator) InjectRandomFault() {
+	s.InjectRandomFaultWithin(30)
+}
+
+// InjectRandomFaultWithin schedules a random fault at a random time
+// within [Now+1, Now+spread).
+func (s *Simulator) InjectRandomFaultWithin(spread uint64) {
+	if s.rng.Float64() > s.FaultRate {
+		return
+	}
+	members := s.Cluster.Coordinator.Members
+	if len(members) == 0 {
+		return
+	}
+	faultTime := s.Cluster.Now + 1 + uint64(s.rng.Int63n(int64(spread)))
+
+	switch s.rng.Intn(3) {
+	case 0: // crash a random node
+		id := members[s.rng.Intn(len(members))]
+		s.EnqueueAt(faultTime, EvNodeCrash, id, EventPayload{})
+	case 1: // drop a link
+		from := members[s.rng.Intn(len(members))]
+		to := members[s.rng.Intn(len(members))]
+		if from != to {
+			s.EnqueueAt(faultTime, EvLinkDown, from, EventPayload{FromNode: from, ToNode: to})
+		}
+	case 2: // restore a link
+		from := members[s.rng.Intn(len(members))]
+		to := members[s.rng.Intn(len(members))]
+		if from != to {
+			s.EnqueueAt(faultTime, EvLinkUp, from, EventPayload{FromNode: from, ToNode: to})
+		}
+	}
+}
+
+// --- Scenario helpers ---
+
+// ScheduleWrites enqueues n writes at random times in [start, start+spread).
+func (s *Simulator) ScheduleWrites(n int, start, spread uint64) {
+	for i := 0; i < n; i++ {
+		t := start + uint64(s.rng.Int63n(int64(spread)))
+		block := uint64(s.rng.Intn(16))
+		s.EnqueueAt(t, EvWriteStart, s.Cluster.Coordinator.PrimaryID, EventPayload{
+			Write: Write{Block: block},
+		})
+	}
+}
+
+// ScheduleCrashAndPromote enqueues a primary crash at crashTime and promotes promoteID at promoteTime.
+func (s *Simulator) ScheduleCrashAndPromote(crashTime uint64, promoteID string, promoteTime uint64) {
+	s.EnqueueAt(crashTime, EvNodeCrash, s.Cluster.Coordinator.PrimaryID, EventPayload{})
+	s.EnqueueAt(promoteTime, EvPromote, "", EventPayload{PromoteID: promoteID})
+}
+
+// ScheduleFlusherTicks enqueues periodic flusher ticks for a node.
+func (s *Simulator) ScheduleFlusherTicks(nodeID string, start, interval uint64, count int) {
+	for i := 0; i < count; i++ {
+		s.EnqueueAt(start+uint64(i)*interval, EvFlusherTick, nodeID, EventPayload{})
+	}
+}
+
+// --- Output ---
+
+// TraceString returns the full trace as a string.
+func (s *Simulator) TraceString() string {
+	var sb strings.Builder
+	for _, te := range s.trace {
+		fmt.Fprintf(&sb, "[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note)
+	}
+	return sb.String()
+}
+
+// ErrorString returns all errors.
+func (s *Simulator) ErrorString() string {
+	return strings.Join(s.Errors, "\n")
+}
+
+// AssertCommittedDataCorrect checks that the current primary's state matches the reference.
+func (s *Simulator) AssertCommittedDataCorrect() error {
+	primary := s.Cluster.Primary()
+	if primary == nil {
+		return fmt.Errorf("no primary")
+	}
+	committedLSN := s.Cluster.Coordinator.CommittedLSN
+	if committedLSN == 0 {
+		return nil
+	}
+	refState := s.Cluster.Reference.StateAt(committedLSN)
+	nodeState := primary.Storage.StateAt(committedLSN)
+	if !EqualState(refState, nodeState) {
+		return fmt.Errorf("data divergence on %s at LSN=%d: ref=%v node=%v",
+			primary.ID, committedLSN, refState, nodeState)
+	}
+	return nil
+}
--- a/sw-block/prototype/distsim/simulator_test.go
+++ b/sw-block/prototype/distsim/simulator_test.go
@ -0,0 +1,285 @@
+package distsim
+
+import (
+	"fmt"
+	"strings"
+	"testing"
+)
+
+// --- Fixed scenarios ---
+
+func TestSim_BasicWriteAndCommit(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	sim := NewSimulator(c, 42)
+
+	sim.ScheduleWrites(3, 1, 5)
+	sim.Run()
+
+	if c.Coordinator.CommittedLSN < 1 {
+		t.Fatalf("expected at least 1 committed write, got %d", c.Coordinator.CommittedLSN)
+	}
+	if err := sim.AssertCommittedDataCorrect(); err != nil {
+		t.Fatal(err)
+	}
+	if len(sim.Errors) > 0 {
+		t.Fatalf("invariant violations:\n%s", sim.ErrorString())
+	}
+}
+
+func TestSim_CrashAfterCommit_DataSurvives(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	sim := NewSimulator(c, 99)
+
+	// Write, let it commit, then crash primary, promote r1.
+	sim.ScheduleWrites(5, 1, 3)
+	sim.ScheduleCrashAndPromote(20, "r1", 22)
+
+	sim.Run()
+
+	if len(sim.Errors) > 0 {
+		t.Fatalf("invariant violations:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString())
+	}
+	if err := sim.AssertCommittedDataCorrect(); err != nil {
+		t.Fatal(err)
+	}
+}
+
+func TestSim_PartitionThenHeal(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	sim := NewSimulator(c, 777)
+
+	// Write some data.
+	sim.ScheduleWrites(3, 1, 3)
+	// Partition r2 at time 5.
+	sim.EnqueueAt(5, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r2"})
+	// Write more during partition.
+	sim.ScheduleWrites(3, 8, 3)
+	// Heal at time 15.
+	sim.EnqueueAt(15, EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r2"})
+	// Write after heal.
+	sim.ScheduleWrites(2, 18, 3)
+
+	sim.Run()
+
+	if len(sim.Errors) > 0 {
+		t.Fatalf("invariant violations:\n%s", sim.ErrorString())
+	}
+	if err := sim.AssertCommittedDataCorrect(); err != nil {
+		t.Fatal(err)
+	}
+}
+
+func TestSim_SyncAll_UncommittedNotVisible(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r1")
+	sim := NewSimulator(c, 123)
+
+	// Partition r1 so nothing can commit under sync_all.
+	sim.EnqueueAt(0, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"})
+	sim.ScheduleWrites(3, 1, 3)
+
+	sim.Run()
+
+	// Nothing should be committed.
+	if c.Coordinator.CommittedLSN != 0 {
+		t.Fatalf("sync_all with partitioned replica should not commit, got %d", c.Coordinator.CommittedLSN)
+	}
+	if len(sim.Errors) > 0 {
+		t.Fatalf("invariant violations:\n%s", sim.ErrorString())
+	}
+}
+
+func TestSim_MessageReorderingDoesNotBreakSafety(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	sim := NewSimulator(c, 555)
+	sim.jitterMax = 8 // high jitter to force reordering
+
+	sim.ScheduleWrites(10, 1, 5)
+	sim.Run()
+
+	if len(sim.Errors) > 0 {
+		t.Fatalf("invariant violations with high jitter:\n%s", sim.ErrorString())
+	}
+	if err := sim.AssertCommittedDataCorrect(); err != nil {
+		t.Fatal(err)
+	}
+}
+
+// --- Randomized property-based testing ---
+
+func TestSim_Randomized_CommitSafety(t *testing.T) {
+	const numSeeds = 500
+	const numWrites = 20
+	failures := 0
+
+	for seed := int64(0); seed < numSeeds; seed++ {
+		c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+		sim := NewSimulator(c, seed)
+		sim.MaxEvents = 2000
+
+		// Random writes.
+		sim.ScheduleWrites(numWrites, 1, 30)
+
+		// Random crash + promote somewhere in the middle.
+		crashTime := uint64(sim.rng.Intn(25) + 5)
+		sim.ScheduleCrashAndPromote(crashTime, "r1", crashTime+3)
+
+		sim.Run()
+
+		if len(sim.Errors) > 0 {
+			t.Errorf("seed %d: invariant violation:\n%s\nTrace (last 20):\n%s",
+				seed, sim.ErrorString(), lastN(sim.trace, 20))
+			failures++
+			if failures >= 3 {
+				t.Fatal("too many failures, stopping")
+			}
+		}
+	}
+	t.Logf("randomized: %d/%d seeds passed", numSeeds-failures, numSeeds)
+}
+
+func TestSim_Randomized_WithFaults(t *testing.T) {
+	const numSeeds = 300
+	failures := 0
+
+	for seed := int64(0); seed < numSeeds; seed++ {
+		c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+		sim := NewSimulator(c, seed)
+		sim.FaultRate = 0.08
+		sim.MaxEvents = 1500
+		sim.jitterMax = 5
+
+		// Interleave writes and random faults.
+		for i := 0; i < 15; i++ {
+			t := uint64(i*3 + 1)
+			sim.EnqueueAt(t, EvWriteStart, "p", EventPayload{
+				Write: Write{Block: uint64(sim.rng.Intn(8))},
+			})
+			sim.InjectRandomFault()
+		}
+
+		sim.Run()
+
+		if len(sim.Errors) > 0 {
+			t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString())
+			failures++
+			if failures >= 3 {
+				t.Fatal("too many failures, stopping")
+			}
+		}
+	}
+	t.Logf("randomized+faults: %d/%d seeds passed", numSeeds-failures, numSeeds)
+}
+
+func TestSim_Randomized_SyncAll(t *testing.T) {
+	const numSeeds = 200
+	failures := 0
+
+	for seed := int64(0); seed < numSeeds; seed++ {
+		c := NewCluster(CommitSyncAll, "p", "r1")
+		sim := NewSimulator(c, seed)
+		sim.MaxEvents = 1000
+
+		sim.ScheduleWrites(10, 1, 20)
+
+		// Random partition/heal.
+		if sim.rng.Float64() < 0.5 {
+			pTime := uint64(sim.rng.Intn(15) + 1)
+			sim.EnqueueAt(pTime, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"})
+			sim.EnqueueAt(pTime+uint64(sim.rng.Intn(10)+3), EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r1"})
+		}
+
+		sim.Run()
+
+		if len(sim.Errors) > 0 {
+			t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString())
+			failures++
+			if failures >= 3 {
+				t.Fatal("too many failures, stopping")
+			}
+		}
+	}
+	t.Logf("sync_all randomized: %d/%d seeds passed", numSeeds-failures, numSeeds)
+}
+
+// --- Lock contention tests ---
+
+func TestSim_LockContention_NoDoubleHold(t *testing.T) {
+	c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+	sim := NewSimulator(c, 42)
+
+	// Two threads try to acquire the same lock at the same time.
+	sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"})
+	sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"})
+
+	// First release.
+	sim.EnqueueAt(8, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"})
+	// Second release (whoever got granted after writer-1 releases).
+	sim.EnqueueAt(11, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"})
+
+	sim.Run()
+
+	if len(sim.Errors) > 0 {
+		t.Fatalf("lock invariant violated:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString())
+	}
+
+	// Verify the trace shows one blocked, one granted.
+	trace := sim.TraceString()
+	if !containsStr(trace, "BLOCKED") {
+		t.Fatal("expected one thread to be BLOCKED on lock contention")
+	}
+	if !containsStr(trace, "granted to") {
+		t.Fatal("expected blocked thread to be granted after release")
+	}
+}
+
+func TestSim_LockContention_Randomized(t *testing.T) {
+	// Run many seeds with concurrent lock acquires at the same time.
+	// The simulator should pick a random winner each time (seed-dependent).
+	winners := map[string]int{}
+	for seed := int64(0); seed < 100; seed++ {
+		c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
+		sim := NewSimulator(c, seed)
+
+		sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "A"})
+		sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "B"})
+		sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "A"})
+		sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "B"})
+
+		sim.Run()
+
+		if len(sim.Errors) > 0 {
+			t.Fatalf("seed %d: %s", seed, sim.ErrorString())
+		}
+
+		// Check who got the lock first by looking at the trace.
+		for _, te := range sim.trace {
+			if te.Event.Kind == EvLockAcquire && containsStr(te.Note, "acquired") {
+				winners[te.Event.Payload.ThreadID]++
+				break
+			}
+		}
+	}
+	// Both threads should win at least some seeds (randomization works).
+	if winners["A"] == 0 || winners["B"] == 0 {
+		t.Fatalf("lock winner not randomized: A=%d B=%d", winners["A"], winners["B"])
+	}
+	t.Logf("lock winner distribution: A=%d B=%d", winners["A"], winners["B"])
+}
+
+func containsStr(s, substr string) bool {
+	return len(s) > 0 && len(substr) > 0 && strings.Contains(s, substr)
+}
+
+// --- Helpers ---
+
+func lastN(trace []TraceEntry, n int) string {
+	start := len(trace) - n
+	if start < 0 {
+		start = 0
+	}
+	s := ""
+	for _, te := range trace[start:] {
+		s += fmt.Sprintf("[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note)
+	}
+	return s
+}
--- a/sw-block/prototype/distsim/storage.go
+++ b/sw-block/prototype/distsim/storage.go
@ -0,0 +1,129 @@
+package distsim
+
+import "sort"
+
+type SnapshotState struct {
+	ID    string
+	LSN   uint64
+	State map[uint64]uint64
+}
+
+type Storage struct {
+	WAL          []Write
+	Extent       map[uint64]uint64
+	ReceivedLSN  uint64
+	FlushedLSN   uint64
+	CheckpointLSN uint64
+	Snapshots    map[string]SnapshotState
+	BaseSnapshot *SnapshotState
+}
+
+func NewStorage() *Storage {
+	return &Storage{
+		Extent:    map[uint64]uint64{},
+		Snapshots: map[string]SnapshotState{},
+	}
+}
+
+func (s *Storage) AppendWrite(w Write) {
+	// Insert in LSN order (handles out-of-order delivery from jitter).
+	inserted := false
+	for i, existing := range s.WAL {
+		if w.LSN == existing.LSN {
+			return // duplicate, skip
+		}
+		if w.LSN < existing.LSN {
+			s.WAL = append(s.WAL[:i], append([]Write{w}, s.WAL[i:]...)...)
+			inserted = true
+			break
+		}
+	}
+	if !inserted {
+		s.WAL = append(s.WAL, w)
+	}
+	s.Extent[w.Block] = w.Value
+	if w.LSN > s.ReceivedLSN {
+		s.ReceivedLSN = w.LSN
+	}
+}
+
+func (s *Storage) AdvanceFlush(lsn uint64) {
+	if lsn > s.ReceivedLSN {
+		lsn = s.ReceivedLSN
+	}
+	if lsn > s.FlushedLSN {
+		s.FlushedLSN = lsn
+	}
+}
+
+func (s *Storage) AdvanceCheckpoint(lsn uint64) {
+	if lsn > s.FlushedLSN {
+		lsn = s.FlushedLSN
+	}
+	if lsn > s.CheckpointLSN {
+		s.CheckpointLSN = lsn
+	}
+}
+
+func (s *Storage) StateAt(lsn uint64) map[uint64]uint64 {
+	state := map[uint64]uint64{}
+	if s.BaseSnapshot != nil {
+		if s.BaseSnapshot.LSN > lsn {
+			return cloneMap(s.BaseSnapshot.State)
+		}
+		state = cloneMap(s.BaseSnapshot.State)
+	}
+	for _, w := range s.WAL {
+		if w.LSN > lsn {
+			break
+		}
+		if s.BaseSnapshot != nil && w.LSN <= s.BaseSnapshot.LSN {
+			continue
+		}
+		state[w.Block] = w.Value
+	}
+	return state
+}
+
+func (s *Storage) TakeSnapshot(id string, lsn uint64) SnapshotState {
+	snap := SnapshotState{
+		ID:    id,
+		LSN:   lsn,
+		State: cloneMap(s.StateAt(lsn)),
+	}
+	s.Snapshots[id] = snap
+	return snap
+}
+
+func (s *Storage) LoadSnapshot(snap SnapshotState) {
+	s.Extent = cloneMap(snap.State)
+	s.FlushedLSN = snap.LSN
+	s.ReceivedLSN = snap.LSN
+	s.CheckpointLSN = snap.LSN
+	s.BaseSnapshot = &SnapshotState{
+		ID:    snap.ID,
+		LSN:   snap.LSN,
+		State: cloneMap(snap.State),
+	}
+	s.WAL = nil
+}
+
+func (s *Storage) ReplaceWAL(writes []Write) {
+	s.WAL = append([]Write(nil), writes...)
+	sort.Slice(s.WAL, func(i, j int) bool { return s.WAL[i].LSN < s.WAL[j].LSN })
+	s.Extent = s.StateAt(s.ReceivedLSN)
+}
+
+func writesInRange(writes []Write, startExclusive, endInclusive uint64) []Write {
+	out := make([]Write, 0)
+	for _, w := range writes {
+		if w.LSN <= startExclusive {
+			continue
+		}
+		if w.LSN > endInclusive {
+			break
+		}
+		out = append(out, w)
+	}
+	return out
+}
--- a/sw-block/prototype/enginev2/assignment.go
+++ b/sw-block/prototype/enginev2/assignment.go
@ -0,0 +1,64 @@
+package enginev2
+
+// AssignmentIntent represents a coordinator-driven assignment update.
+// It specifies the desired replica set and which replicas need recovery.
+type AssignmentIntent struct {
+	Endpoints       map[string]Endpoint    // desired replica set
+	Epoch           uint64                 // current epoch
+	RecoveryTargets map[string]SessionKind // replicas that need recovery (nil = no recovery)
+}
+
+// AssignmentResult records what the SenderGroup did in response to an assignment.
+type AssignmentResult struct {
+	Added              []string // new senders created
+	Removed            []string // old senders stopped
+	SessionsCreated    []string // fresh recovery sessions attached
+	SessionsSuperseded []string // existing sessions superseded by new ones
+	SessionsFailed     []string // recovery sessions that couldn't be created
+}
+
+// ApplyAssignment processes a coordinator assignment intent:
+//  1. Reconcile endpoints — add/remove/update senders
+//  2. For each recovery target, create a recovery session on the sender
+//
+// Epoch fencing: if intent.Epoch < sender.Epoch for any target, that target
+// is rejected. Stale assignment intent cannot create live sessions.
+func (sg *SenderGroup) ApplyAssignment(intent AssignmentIntent) AssignmentResult {
+	var result AssignmentResult
+
+	// Step 1: reconcile topology.
+	result.Added, result.Removed = sg.Reconcile(intent.Endpoints, intent.Epoch)
+
+	// Step 2: create recovery sessions for designated targets.
+	if intent.RecoveryTargets == nil {
+		return result
+	}
+
+	sg.mu.RLock()
+	defer sg.mu.RUnlock()
+	for replicaID, kind := range intent.RecoveryTargets {
+		sender, ok := sg.senders[replicaID]
+		if !ok {
+			result.SessionsFailed = append(result.SessionsFailed, replicaID)
+			continue
+		}
+		// Reject stale assignment: intent epoch must match sender epoch.
+		if intent.Epoch < sender.Epoch {
+			result.SessionsFailed = append(result.SessionsFailed, replicaID)
+			continue
+		}
+		_, err := sender.AttachSession(intent.Epoch, kind)
+		if err != nil {
+			// Session already active at current epoch — supersede it.
+			sess := sender.SupersedeSession(kind, "assignment_intent")
+			if sess != nil {
+				result.SessionsSuperseded = append(result.SessionsSuperseded, replicaID)
+			} else {
+				result.SessionsFailed = append(result.SessionsFailed, replicaID)
+			}
+			continue
+		}
+		result.SessionsCreated = append(result.SessionsCreated, replicaID)
+	}
+	return result
+}
--- a/sw-block/prototype/enginev2/execution_test.go
+++ b/sw-block/prototype/enginev2/execution_test.go
@ -0,0 +1,420 @@
+package enginev2
+
+import "testing"
+
+// ============================================================
+// Phase 04 P1: Session execution and sender-group orchestration
+// ============================================================
+
+// --- Execution API: full lifecycle ---
+
+func TestExec_FullRecoveryLifecycle(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	id := sess.ID
+
+	// init → connecting
+	if err := s.BeginConnect(id); err != nil {
+		t.Fatalf("BeginConnect: %v", err)
+	}
+	if s.State != StateConnecting {
+		t.Fatalf("state=%s, want connecting", s.State)
+	}
+
+	// connecting → handshake
+	if err := s.RecordHandshake(id, 5, 20); err != nil {
+		t.Fatalf("RecordHandshake: %v", err)
+	}
+	if sess.StartLSN != 5 || sess.TargetLSN != 20 {
+		t.Fatalf("range: start=%d target=%d", sess.StartLSN, sess.TargetLSN)
+	}
+
+	// handshake → catchup
+	if err := s.BeginCatchUp(id); err != nil {
+		t.Fatalf("BeginCatchUp: %v", err)
+	}
+	if s.State != StateCatchingUp {
+		t.Fatalf("state=%s, want catching_up", s.State)
+	}
+
+	// progress
+	if err := s.RecordCatchUpProgress(id, 15); err != nil {
+		t.Fatalf("progress to 15: %v", err)
+	}
+	if err := s.RecordCatchUpProgress(id, 20); err != nil {
+		t.Fatalf("progress to 20: %v", err)
+	}
+	if !sess.Converged() {
+		t.Fatal("should be converged at 20/20")
+	}
+
+	// complete
+	if !s.CompleteSessionByID(id) {
+		t.Fatal("completion should succeed")
+	}
+	if s.State != StateInSync {
+		t.Fatalf("state=%s, want in_sync", s.State)
+	}
+}
+
+// --- Stale sessionID rejection across all execution APIs ---
+
+func TestExec_StaleID_AllAPIsReject(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess1, _ := s.AttachSession(1, SessionCatchUp)
+	oldID := sess1.ID
+
+	// Supersede with new session.
+	s.UpdateEpoch(2)
+	sess2, _ := s.AttachSession(2, SessionCatchUp)
+	_ = sess2
+
+	// All APIs must reject oldID.
+	if err := s.BeginConnect(oldID); err == nil {
+		t.Fatal("BeginConnect should reject stale ID")
+	}
+	if err := s.RecordHandshake(oldID, 0, 10); err == nil {
+		t.Fatal("RecordHandshake should reject stale ID")
+	}
+	if err := s.BeginCatchUp(oldID); err == nil {
+		t.Fatal("BeginCatchUp should reject stale ID")
+	}
+	if err := s.RecordCatchUpProgress(oldID, 5); err == nil {
+		t.Fatal("RecordCatchUpProgress should reject stale ID")
+	}
+	if s.CompleteSessionByID(oldID) {
+		t.Fatal("CompleteSessionByID should reject stale ID")
+	}
+}
+
+// --- Phase ordering enforcement ---
+
+func TestExec_WrongPhaseOrder_Rejected(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	id := sess.ID
+
+	// Skip connecting → go directly to handshake: rejected.
+	if err := s.RecordHandshake(id, 0, 10); err == nil {
+		t.Fatal("handshake from init should be rejected")
+	}
+
+	// Skip to catch-up from init: rejected.
+	if err := s.BeginCatchUp(id); err == nil {
+		t.Fatal("catch-up from init should be rejected")
+	}
+
+	// Progress from init: rejected (not in catch-up phase).
+	if err := s.RecordCatchUpProgress(id, 5); err == nil {
+		t.Fatal("progress from init should be rejected")
+	}
+
+	// Correct path: init → connecting.
+	s.BeginConnect(id)
+	// Now try catch-up from connecting: rejected (must handshake first).
+	if err := s.BeginCatchUp(id); err == nil {
+		t.Fatal("catch-up from connecting should be rejected")
+	}
+}
+
+// --- Progress regression rejection ---
+
+func TestExec_ProgressRegression_Rejected(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	id := sess.ID
+
+	s.BeginConnect(id)
+	s.RecordHandshake(id, 0, 100)
+	s.BeginCatchUp(id)
+
+	s.RecordCatchUpProgress(id, 50)
+
+	// Regression: 30 < 50.
+	if err := s.RecordCatchUpProgress(id, 30); err == nil {
+		t.Fatal("progress regression should be rejected")
+	}
+
+	// Same value: 50 = 50.
+	if err := s.RecordCatchUpProgress(id, 50); err == nil {
+		t.Fatal("non-advancing progress should be rejected")
+	}
+
+	// Advance: 60 > 50.
+	if err := s.RecordCatchUpProgress(id, 60); err != nil {
+		t.Fatalf("valid progress should succeed: %v", err)
+	}
+}
+
+// --- Epoch bump during execution ---
+
+func TestExec_EpochBumpDuringExecution_InvalidatesAuthority(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	id := sess.ID
+
+	s.BeginConnect(id)
+	s.RecordHandshake(id, 0, 100)
+	s.BeginCatchUp(id)
+	s.RecordCatchUpProgress(id, 50)
+
+	// Epoch bumps mid-execution.
+	s.UpdateEpoch(2)
+
+	// All further execution on old session rejected.
+	if err := s.RecordCatchUpProgress(id, 60); err == nil {
+		t.Fatal("progress after epoch bump should be rejected")
+	}
+	if s.CompleteSessionByID(id) {
+		t.Fatal("completion after epoch bump should be rejected")
+	}
+
+	// Sender is disconnected, ready for new session.
+	if s.State != StateDisconnected {
+		t.Fatalf("state=%s, want disconnected", s.State)
+	}
+}
+
+// --- Endpoint change during execution ---
+
+func TestExec_EndpointChangeDuringExecution_InvalidatesAuthority(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	id := sess.ID
+
+	s.BeginConnect(id)
+	s.RecordHandshake(id, 0, 50)
+	s.BeginCatchUp(id)
+
+	// Endpoint changes mid-execution.
+	s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", CtrlAddr: "r1:9445", Version: 2})
+
+	// All further execution rejected.
+	if err := s.RecordCatchUpProgress(id, 10); err == nil {
+		t.Fatal("progress after endpoint change should be rejected")
+	}
+	if s.CompleteSessionByID(id) {
+		t.Fatal("completion after endpoint change should be rejected")
+	}
+}
+
+// --- Completion authority enforcement ---
+
+func TestExec_CompletionRejected_FromInit(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+
+	if s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion from PhaseInit should be rejected")
+	}
+}
+
+func TestExec_CompletionRejected_FromConnecting(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+
+	if s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion from PhaseConnecting should be rejected")
+	}
+}
+
+func TestExec_CompletionRejected_FromHandshakeWithGap(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+	s.RecordHandshake(sess.ID, 5, 20) // gap exists: 5 → 20
+
+	if s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion from PhaseHandshake with gap should be rejected")
+	}
+}
+
+func TestExec_CompletionAllowed_FromHandshakeZeroGap(t *testing.T) {
+	// Fast path: handshake shows replica already at target (zero gap).
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+	s.RecordHandshake(sess.ID, 10, 10) // zero gap: start == target
+
+	if !s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion from handshake with zero gap should be allowed")
+	}
+	if s.State != StateInSync {
+		t.Fatalf("state=%s, want in_sync", s.State)
+	}
+}
+
+func TestExec_CompletionRejected_FromCatchUpNotConverged(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+	s.RecordHandshake(sess.ID, 0, 100)
+	s.BeginCatchUp(sess.ID)
+	s.RecordCatchUpProgress(sess.ID, 50) // not converged (50 < 100)
+
+	if s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion before convergence should be rejected")
+	}
+}
+
+func TestExec_HandshakeInvalidRange_Rejected(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+
+	if err := s.RecordHandshake(sess.ID, 20, 5); err == nil {
+		t.Fatal("handshake with target < start should be rejected")
+	}
+}
+
+// --- SenderGroup orchestration ---
+
+func TestOrch_RepeatedReconnectCycles_PreserveSenderIdentity(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+	}, 1)
+
+	s := sg.Sender("r1:9333")
+	original := s // save pointer
+
+	// 5 reconnect cycles — sender identity preserved.
+	for cycle := 0; cycle < 5; cycle++ {
+		sess, err := s.AttachSession(1, SessionCatchUp)
+		if err != nil {
+			t.Fatalf("cycle %d attach: %v", cycle, err)
+		}
+		s.BeginConnect(sess.ID)
+		s.RecordHandshake(sess.ID, 0, 10)
+		s.BeginCatchUp(sess.ID)
+		s.RecordCatchUpProgress(sess.ID, 10)
+		s.CompleteSessionByID(sess.ID)
+
+		if s.State != StateInSync {
+			t.Fatalf("cycle %d: state=%s, want in_sync", cycle, s.State)
+		}
+	}
+
+	// Same pointer — identity preserved.
+	if sg.Sender("r1:9333") != original {
+		t.Fatal("sender identity should be preserved across cycles")
+	}
+}
+
+func TestOrch_EndpointUpdateSupersedesActiveSession(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1},
+	}, 1)
+
+	s := sg.Sender("r1:9333")
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+
+	// Endpoint update via reconcile — session invalidated.
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 2},
+	}, 1)
+
+	if sess.Active() {
+		t.Fatal("session should be invalidated by endpoint update")
+	}
+	// Sender preserved, session gone.
+	if sg.Sender("r1:9333") != s {
+		t.Fatal("sender identity should be preserved")
+	}
+	if s.Session() != nil {
+		t.Fatal("session should be nil after endpoint invalidation")
+	}
+}
+
+func TestOrch_ReconcileMixedAddRemoveUpdate(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+		"r3:9333": {DataAddr: "r3:9333", Version: 1},
+	}, 1)
+
+	r1 := sg.Sender("r1:9333")
+	r2 := sg.Sender("r2:9333")
+
+	// Attach sessions to r1 and r2.
+	r1Sess, _ := r1.AttachSession(1, SessionCatchUp)
+	r2Sess, _ := r2.AttachSession(1, SessionCatchUp)
+
+	// Reconcile: keep r1, remove r2, update r3, add r4.
+	added, removed := sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},    // kept
+		"r3:9333": {DataAddr: "r3:9333", Version: 2},    // updated
+		"r4:9333": {DataAddr: "r4:9333", Version: 1},    // added
+	}, 1)
+
+	if len(added) != 1 || added[0] != "r4:9333" {
+		t.Fatalf("added=%v", added)
+	}
+	if len(removed) != 1 || removed[0] != "r2:9333" {
+		t.Fatalf("removed=%v", removed)
+	}
+
+	// r1: preserved with active session.
+	if sg.Sender("r1:9333") != r1 {
+		t.Fatal("r1 should be preserved")
+	}
+	if !r1Sess.Active() {
+		t.Fatal("r1 session should still be active")
+	}
+
+	// r2: stopped and removed.
+	if sg.Sender("r2:9333") != nil {
+		t.Fatal("r2 should be removed")
+	}
+	if r2.Stopped() != true {
+		t.Fatal("r2 should be stopped")
+	}
+	if r2Sess.Active() {
+		t.Fatal("r2 session should be invalidated (sender stopped)")
+	}
+
+	// r4: new sender, no session.
+	if sg.Sender("r4:9333") == nil {
+		t.Fatal("r4 should exist")
+	}
+}
+
+func TestOrch_EpochBumpInvalidatesExecutingSessions(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+	}, 1)
+
+	r1 := sg.Sender("r1:9333")
+	r2 := sg.Sender("r2:9333")
+
+	sess1, _ := r1.AttachSession(1, SessionCatchUp)
+	r1.BeginConnect(sess1.ID)
+	r1.RecordHandshake(sess1.ID, 0, 50)
+	r1.BeginCatchUp(sess1.ID)
+	r1.RecordCatchUpProgress(sess1.ID, 25) // mid-execution
+
+	sess2, _ := r2.AttachSession(1, SessionCatchUp)
+	r2.BeginConnect(sess2.ID)
+
+	// Epoch bump.
+	count := sg.InvalidateEpoch(2)
+	if count != 2 {
+		t.Fatalf("should invalidate 2 sessions, got %d", count)
+	}
+
+	// Both sessions dead.
+	if sess1.Active() || sess2.Active() {
+		t.Fatal("both sessions should be invalidated")
+	}
+
+	// r1's mid-execution progress cannot continue.
+	if err := r1.RecordCatchUpProgress(sess1.ID, 30); err == nil {
+		t.Fatal("progress on invalidated session should be rejected")
+	}
+}
--- a/sw-block/prototype/enginev2/go.mod
+++ b/sw-block/prototype/enginev2/go.mod
@ -0,0 +1,3 @@
+module github.com/seaweedfs/seaweedfs/sw-block/prototype/enginev2
+
+go 1.23.0
--- a/sw-block/prototype/enginev2/outcome.go
+++ b/sw-block/prototype/enginev2/outcome.go
@ -0,0 +1,39 @@
+package enginev2
+
+// HandshakeResult captures what the reconnect handshake reveals about a
+// replica's state relative to the primary's lineage-safe boundary.
+type HandshakeResult struct {
+	ReplicaFlushedLSN uint64 // highest LSN durably persisted on replica
+	CommittedLSN      uint64 // lineage-safe recovery target (committed prefix)
+	RetentionStartLSN uint64 // oldest LSN still available in primary WAL
+}
+
+// RecoveryOutcome classifies the gap between replica and primary.
+type RecoveryOutcome string
+
+const (
+	OutcomeZeroGap      RecoveryOutcome = "zero_gap"      // replica has full committed prefix
+	OutcomeCatchUp      RecoveryOutcome = "catchup"        // gap within WAL retention
+	OutcomeNeedsRebuild RecoveryOutcome = "needs_rebuild"  // gap exceeds retention
+)
+
+// ClassifyRecoveryOutcome determines the recovery path from handshake data.
+//
+// Uses CommittedLSN (not WAL head) as the target boundary. This is the
+// lineage-safe recovery point — only acknowledged data counts. A replica
+// with FlushedLSN > CommittedLSN has divergent/uncommitted tail that must
+// NOT be treated as "already in sync."
+//
+// Decision matrix (matches CP13-5 gap analysis):
+//   - ReplicaFlushedLSN >= CommittedLSN        → zero gap, has full committed prefix
+//   - ReplicaFlushedLSN+1 >= RetentionStartLSN → recoverable via WAL catch-up
+//   - otherwise                                 → gap too large, needs rebuild
+func ClassifyRecoveryOutcome(result HandshakeResult) RecoveryOutcome {
+	if result.ReplicaFlushedLSN >= result.CommittedLSN {
+		return OutcomeZeroGap
+	}
+	if result.RetentionStartLSN == 0 || result.ReplicaFlushedLSN+1 >= result.RetentionStartLSN {
+		return OutcomeCatchUp
+	}
+	return OutcomeNeedsRebuild
+}
--- a/sw-block/prototype/enginev2/p2_test.go
+++ b/sw-block/prototype/enginev2/p2_test.go
@ -0,0 +1,482 @@
+package enginev2
+
+import "testing"
+
+// ============================================================
+// Phase 04 P2: Outcome branching, assignment intent, end-to-end
+// ============================================================
+
+// --- Recovery outcome classification ---
+
+func TestOutcome_ZeroGap(t *testing.T) {
+	o := ClassifyRecoveryOutcome(HandshakeResult{
+		ReplicaFlushedLSN: 100,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if o != OutcomeZeroGap {
+		t.Fatalf("got %s, want zero_gap", o)
+	}
+}
+
+func TestOutcome_ZeroGap_ReplicaAtCommitted(t *testing.T) {
+	// Replica has exactly the committed prefix — zero gap.
+	// Note: replica may have uncommitted tail beyond CommittedLSN;
+	// that is handled by truncation, not by recovery classification.
+	o := ClassifyRecoveryOutcome(HandshakeResult{
+		ReplicaFlushedLSN: 100,
+		CommittedLSN:      100,
+		RetentionStartLSN: 50,
+	})
+	if o != OutcomeZeroGap {
+		t.Fatalf("got %s, want zero_gap", o)
+	}
+}
+
+func TestOutcome_CatchUp(t *testing.T) {
+	o := ClassifyRecoveryOutcome(HandshakeResult{
+		ReplicaFlushedLSN: 80,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if o != OutcomeCatchUp {
+		t.Fatalf("got %s, want catchup", o)
+	}
+}
+
+func TestOutcome_CatchUp_ExactBoundary(t *testing.T) {
+	// ReplicaFlushedLSN+1 == RetentionStartLSN → recoverable (just barely).
+	o := ClassifyRecoveryOutcome(HandshakeResult{
+		ReplicaFlushedLSN: 49,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if o != OutcomeCatchUp {
+		t.Fatalf("got %s, want catchup (exact boundary)", o)
+	}
+}
+
+func TestOutcome_NeedsRebuild(t *testing.T) {
+	o := ClassifyRecoveryOutcome(HandshakeResult{
+		ReplicaFlushedLSN: 10,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if o != OutcomeNeedsRebuild {
+		t.Fatalf("got %s, want needs_rebuild", o)
+	}
+}
+
+func TestOutcome_NeedsRebuild_OffByOne(t *testing.T) {
+	// ReplicaFlushedLSN+1 < RetentionStartLSN → unrecoverable.
+	o := ClassifyRecoveryOutcome(HandshakeResult{
+		ReplicaFlushedLSN: 48,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if o != OutcomeNeedsRebuild {
+		t.Fatalf("got %s, want needs_rebuild (off-by-one)", o)
+	}
+}
+
+// --- RecordHandshakeWithOutcome execution ---
+
+func TestExec_HandshakeOutcome_ZeroGap_FastComplete(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+
+	outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
+		ReplicaFlushedLSN: 100,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if outcome != OutcomeZeroGap {
+		t.Fatalf("outcome=%s, want zero_gap", outcome)
+	}
+
+	// Zero-gap: can complete directly from handshake phase.
+	if !s.CompleteSessionByID(sess.ID) {
+		t.Fatal("zero-gap fast completion should succeed")
+	}
+	if s.State != StateInSync {
+		t.Fatalf("state=%s, want in_sync", s.State)
+	}
+}
+
+func TestExec_HandshakeOutcome_CatchUp_NormalPath(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+
+	outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
+		ReplicaFlushedLSN: 80,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if outcome != OutcomeCatchUp {
+		t.Fatalf("outcome=%s, want catchup", outcome)
+	}
+
+	// Must catch up before completing.
+	if s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion should be rejected before catch-up")
+	}
+
+	s.BeginCatchUp(sess.ID)
+	s.RecordCatchUpProgress(sess.ID, 100)
+	if !s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion should succeed after convergence")
+	}
+}
+
+func TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.BeginConnect(sess.ID)
+
+	outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
+		ReplicaFlushedLSN: 10,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if err != nil {
+		t.Fatal(err)
+	}
+	if outcome != OutcomeNeedsRebuild {
+		t.Fatalf("outcome=%s, want needs_rebuild", outcome)
+	}
+
+	// Session invalidated, sender at NeedsRebuild.
+	if sess.Active() {
+		t.Fatal("session should be invalidated")
+	}
+	if s.State != StateNeedsRebuild {
+		t.Fatalf("state=%s, want needs_rebuild", s.State)
+	}
+	if s.Session() != nil {
+		t.Fatal("session should be nil after NeedsRebuild")
+	}
+}
+
+// --- Assignment-intent orchestration ---
+
+func TestAssignment_CreatesSessionsForTargets(t *testing.T) {
+	sg := NewSenderGroup()
+
+	result := sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+			"r2:9333": {DataAddr: "r2:9333", Version: 1},
+		},
+		Epoch: 1,
+		RecoveryTargets: map[string]SessionKind{
+			"r1:9333": SessionCatchUp,
+		},
+	})
+
+	if len(result.Added) != 2 {
+		t.Fatalf("added=%d, want 2", len(result.Added))
+	}
+	if len(result.SessionsCreated) != 1 || result.SessionsCreated[0] != "r1:9333" {
+		t.Fatalf("sessions created=%v", result.SessionsCreated)
+	}
+
+	// r1 has session, r2 does not.
+	r1 := sg.Sender("r1:9333")
+	if r1.Session() == nil {
+		t.Fatal("r1 should have a session")
+	}
+	r2 := sg.Sender("r2:9333")
+	if r2.Session() != nil {
+		t.Fatal("r2 should not have a session")
+	}
+}
+
+func TestAssignment_SupersedesExistingSession(t *testing.T) {
+	sg := NewSenderGroup()
+
+	// First assignment with catch-up session.
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
+	})
+	oldSess := sg.Sender("r1:9333").Session()
+
+	// Second assignment with rebuild session — supersedes.
+	result := sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild},
+	})
+	newSess := sg.Sender("r1:9333").Session()
+
+	if oldSess.Active() {
+		t.Fatal("old session should be invalidated")
+	}
+	if !newSess.Active() {
+		t.Fatal("new session should be active")
+	}
+	if newSess.Kind != SessionRebuild {
+		t.Fatalf("new session kind=%s, want rebuild", newSess.Kind)
+	}
+	if len(result.SessionsSuperseded) != 1 || result.SessionsSuperseded[0] != "r1:9333" {
+		t.Fatalf("superseded=%v, want [r1:9333]", result.SessionsSuperseded)
+	}
+}
+
+func TestAssignment_FailsForUnknownReplica(t *testing.T) {
+	sg := NewSenderGroup()
+
+	result := sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r99:9333": SessionCatchUp},
+	})
+
+	if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r99:9333" {
+		t.Fatalf("sessions failed=%v, want [r99:9333]", result.SessionsFailed)
+	}
+}
+
+func TestAssignment_StaleEpoch_Rejected(t *testing.T) {
+	sg := NewSenderGroup()
+
+	// Epoch 2 assignment.
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch: 2,
+	})
+
+	// Stale epoch 1 assignment with recovery — must be rejected.
+	result := sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
+	})
+
+	if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r1:9333" {
+		t.Fatalf("stale epoch should fail: failed=%v created=%v", result.SessionsFailed, result.SessionsCreated)
+	}
+	if sg.Sender("r1:9333").Session() != nil {
+		t.Fatal("stale intent must not create a session")
+	}
+}
+
+// --- End-to-end prototype recovery flows ---
+
+func TestE2E_CatchUpRecovery_FullFlow(t *testing.T) {
+	sg := NewSenderGroup()
+
+	// Step 1: Assignment creates replicas + recovery intent.
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+			"r2:9333": {DataAddr: "r2:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
+	})
+
+	r1 := sg.Sender("r1:9333")
+	sess := r1.Session()
+
+	// Step 2: Execute recovery.
+	r1.BeginConnect(sess.ID)
+
+	outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
+		ReplicaFlushedLSN: 80,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if outcome != OutcomeCatchUp {
+		t.Fatalf("outcome=%s", outcome)
+	}
+
+	r1.BeginCatchUp(sess.ID)
+	r1.RecordCatchUpProgress(sess.ID, 90)
+	r1.RecordCatchUpProgress(sess.ID, 100) // converged
+
+	// Step 3: Complete.
+	if !r1.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion should succeed")
+	}
+
+	// Step 4: Verify final state.
+	if r1.State != StateInSync {
+		t.Fatalf("r1 state=%s, want in_sync", r1.State)
+	}
+	if r1.Session() != nil {
+		t.Fatal("session should be nil after completion")
+	}
+
+	t.Logf("e2e catch-up: assignment → connect → handshake(catchup) → progress → complete → InSync")
+}
+
+func TestE2E_NeedsRebuild_Escalation(t *testing.T) {
+	sg := NewSenderGroup()
+
+	// Step 1: Assignment with catch-up intent.
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
+	})
+
+	r1 := sg.Sender("r1:9333")
+	sess := r1.Session()
+
+	// Step 2: Connect + handshake → unrecoverable gap.
+	r1.BeginConnect(sess.ID)
+	outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
+		ReplicaFlushedLSN: 10,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if outcome != OutcomeNeedsRebuild {
+		t.Fatalf("outcome=%s", outcome)
+	}
+
+	// Step 3: Sender is at NeedsRebuild, session dead.
+	if r1.State != StateNeedsRebuild {
+		t.Fatalf("state=%s", r1.State)
+	}
+
+	// Step 4: New assignment with rebuild intent.
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild},
+	})
+
+	rebuildSess := r1.Session()
+	if rebuildSess == nil || rebuildSess.Kind != SessionRebuild {
+		t.Fatal("should have rebuild session")
+	}
+
+	// Step 5: Execute rebuild recovery (simulated).
+	r1.BeginConnect(rebuildSess.ID)
+	r1.RecordHandshake(rebuildSess.ID, 0, 100) // full rebuild range
+	r1.BeginCatchUp(rebuildSess.ID)
+	r1.RecordCatchUpProgress(rebuildSess.ID, 100)
+
+	if !r1.CompleteSessionByID(rebuildSess.ID) {
+		t.Fatal("rebuild completion should succeed")
+	}
+	if r1.State != StateInSync {
+		t.Fatalf("after rebuild: state=%s, want in_sync", r1.State)
+	}
+
+	t.Logf("e2e rebuild: catch-up→NeedsRebuild→rebuild assignment→recover→InSync")
+}
+
+func TestE2E_ZeroGap_FastPath(t *testing.T) {
+	sg := NewSenderGroup()
+
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
+	})
+
+	r1 := sg.Sender("r1:9333")
+	sess := r1.Session()
+
+	r1.BeginConnect(sess.ID)
+	outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
+		ReplicaFlushedLSN: 100,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	if outcome != OutcomeZeroGap {
+		t.Fatalf("outcome=%s", outcome)
+	}
+
+	// Fast path: complete directly from handshake.
+	if !r1.CompleteSessionByID(sess.ID) {
+		t.Fatal("zero-gap fast completion should succeed")
+	}
+	if r1.State != StateInSync {
+		t.Fatalf("state=%s, want in_sync", r1.State)
+	}
+
+	t.Logf("e2e zero-gap: assignment → connect → handshake(zero_gap) → complete → InSync")
+}
+
+func TestE2E_EpochBump_MidRecovery_FullCycle(t *testing.T) {
+	sg := NewSenderGroup()
+
+	// Epoch 1: start recovery.
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           1,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
+	})
+
+	r1 := sg.Sender("r1:9333")
+	sess1 := r1.Session()
+	r1.BeginConnect(sess1.ID)
+
+	// Epoch bumps mid-recovery.
+	sg.InvalidateEpoch(2)
+	// Must also update sender epoch for the new assignment.
+	r1.UpdateEpoch(2)
+
+	// Old session dead.
+	if sess1.Active() {
+		t.Fatal("epoch-1 session should be invalidated")
+	}
+
+	// Epoch 2: new assignment, new session.
+	sg.ApplyAssignment(AssignmentIntent{
+		Endpoints: map[string]Endpoint{
+			"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		},
+		Epoch:           2,
+		RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
+	})
+
+	sess2 := r1.Session()
+	if sess2 == nil || sess2.Epoch != 2 {
+		t.Fatal("should have new session at epoch 2")
+	}
+
+	// Complete at epoch 2.
+	r1.BeginConnect(sess2.ID)
+	r1.RecordHandshakeWithOutcome(sess2.ID, HandshakeResult{
+		ReplicaFlushedLSN: 100,
+		CommittedLSN:    100,
+		RetentionStartLSN: 50,
+	})
+	r1.CompleteSessionByID(sess2.ID)
+
+	if r1.State != StateInSync {
+		t.Fatalf("state=%s", r1.State)
+	}
+	t.Logf("e2e epoch bump: epoch1 recovery → bump → epoch2 recovery → InSync")
+}
--- a/sw-block/prototype/enginev2/sender.go
+++ b/sw-block/prototype/enginev2/sender.go
@ -0,0 +1,347 @@
+// Package enginev2 implements V2 per-replica sender/session ownership.
+//
+// Each replica has exactly one Sender that owns its identity (canonical address)
+// and at most one active RecoverySession per epoch. The Sender survives topology
+// changes; the session does not survive epoch bumps.
+package enginev2
+
+import (
+	"fmt"
+	"sync"
+)
+
+// ReplicaState tracks the per-replica replication state machine.
+type ReplicaState string
+
+const (
+	StateDisconnected ReplicaState = "disconnected"
+	StateConnecting   ReplicaState = "connecting"
+	StateCatchingUp   ReplicaState = "catching_up"
+	StateInSync       ReplicaState = "in_sync"
+	StateDegraded     ReplicaState = "degraded"
+	StateNeedsRebuild ReplicaState = "needs_rebuild"
+)
+
+// Endpoint represents a replica's network identity.
+type Endpoint struct {
+	DataAddr string
+	CtrlAddr string
+	Version  uint64 // bumped on address change
+}
+
+// Sender owns the replication channel to one replica. It is identified
+// by ReplicaID (canonical data address at creation time) and survives
+// topology changes as long as the replica stays in the set.
+//
+// A Sender holds at most one active RecoverySession. Normal in-sync
+// operation does not require a session — Ship/Barrier work directly.
+type Sender struct {
+	mu sync.Mutex
+
+	ReplicaID string   // canonical identity — stable across reconnects
+	Endpoint  Endpoint // current network address (may change via UpdateEndpoint)
+	Epoch     uint64   // current epoch
+	State     ReplicaState
+
+	session *RecoverySession // nil when in-sync or disconnected without recovery
+	stopped bool
+}
+
+// NewSender creates a sender for a replica at the given endpoint and epoch.
+func NewSender(replicaID string, endpoint Endpoint, epoch uint64) *Sender {
+	return &Sender{
+		ReplicaID: replicaID,
+		Endpoint:  endpoint,
+		Epoch:     epoch,
+		State:     StateDisconnected,
+	}
+}
+
+// UpdateEpoch updates the sender's epoch. If a recovery session is active
+// at a stale epoch, it is invalidated.
+func (s *Sender) UpdateEpoch(epoch uint64) {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if s.stopped || epoch <= s.Epoch {
+		return
+	}
+	oldEpoch := s.Epoch
+	s.Epoch = epoch
+	if s.session != nil && s.session.Epoch < epoch {
+		s.session.invalidate(fmt.Sprintf("epoch_advanced_%d_to_%d", oldEpoch, epoch))
+		s.session = nil
+		s.State = StateDisconnected
+	}
+}
+
+// UpdateEndpoint updates the sender's target address after a control-plane
+// assignment refresh. If a recovery session is active and the address changed,
+// the session is invalidated (the new address needs a fresh session).
+func (s *Sender) UpdateEndpoint(ep Endpoint) {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if s.stopped {
+		return
+	}
+	addrChanged := s.Endpoint.DataAddr != ep.DataAddr || s.Endpoint.CtrlAddr != ep.CtrlAddr || s.Endpoint.Version != ep.Version
+	s.Endpoint = ep
+	if addrChanged && s.session != nil {
+		s.session.invalidate("endpoint_changed")
+		s.session = nil
+		s.State = StateDisconnected
+	}
+}
+
+// AttachSession creates and attaches a new recovery session for this sender.
+// The session epoch must match the sender's current epoch — stale or future
+// epoch sessions are rejected. Returns an error if a session is already active,
+// the sender is stopped, or the epoch doesn't match.
+func (s *Sender) AttachSession(epoch uint64, kind SessionKind) (*RecoverySession, error) {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if s.stopped {
+		return nil, fmt.Errorf("sender stopped")
+	}
+	if epoch != s.Epoch {
+		return nil, fmt.Errorf("epoch mismatch: sender=%d session=%d", s.Epoch, epoch)
+	}
+	if s.session != nil && s.session.Active() {
+		return nil, fmt.Errorf("session already active (epoch=%d kind=%s)", s.session.Epoch, s.session.Kind)
+	}
+	sess := newRecoverySession(s.ReplicaID, epoch, kind)
+	s.session = sess
+	// Ownership established but execution not started.
+	// BeginConnect() is the first execution-state transition.
+	return sess, nil
+}
+
+// SupersedeSession invalidates the current session (if any) and attaches
+// a new one at the sender's current epoch. Used when an assignment change
+// requires a fresh recovery path. The old session is invalidated with the
+// given reason. Always uses s.Epoch — does not accept an epoch parameter
+// to prevent epoch coherence drift.
+//
+// Establishes ownership only — does not mutate sender state.
+// BeginConnect() starts execution.
+func (s *Sender) SupersedeSession(kind SessionKind, reason string) *RecoverySession {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if s.stopped {
+		return nil
+	}
+	if s.session != nil {
+		s.session.invalidate(reason)
+	}
+	sess := newRecoverySession(s.ReplicaID, s.Epoch, kind)
+	s.session = sess
+	return sess
+}
+
+// Session returns the current recovery session, or nil if none.
+func (s *Sender) Session() *RecoverySession {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	return s.session
+}
+
+// CompleteSessionByID marks the session as completed and transitions the
+// sender to InSync. Requires:
+//   - sessionID matches the current active session
+//   - session is in PhaseCatchUp and has Converged (normal path)
+//   - OR session is in PhaseHandshake and gap is zero (fast path: already in sync)
+//
+// Returns false if any check fails (stale ID, wrong phase, not converged).
+func (s *Sender) CompleteSessionByID(sessionID uint64) bool {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if err := s.checkSessionAuthority(sessionID); err != nil {
+		return false
+	}
+	sess := s.session
+	switch sess.Phase {
+	case PhaseCatchUp:
+		if !sess.Converged() {
+			return false // not converged yet
+		}
+	case PhaseHandshake:
+		if sess.TargetLSN != sess.StartLSN {
+			return false // has a gap — must catch up first
+		}
+		// Zero-gap fast path: handshake showed replica already at target.
+	default:
+		return false // not at a completion-ready phase
+	}
+	sess.complete()
+	s.session = nil
+	s.State = StateInSync
+	return true
+}
+
+// === Execution APIs — sender-owned authority gate ===
+//
+// All execution APIs validate the sessionID against the current active session.
+// This prevents stale results from old/superseded sessions from mutating state.
+// The sender is the authority boundary, not the session object.
+
+// BeginConnect transitions the session from init to connecting.
+// Mutates: session.Phase → PhaseConnecting. Sender.State → StateConnecting.
+// Rejects: wrong sessionID, stopped sender, session not in PhaseInit.
+func (s *Sender) BeginConnect(sessionID uint64) error {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if err := s.checkSessionAuthority(sessionID); err != nil {
+		return err
+	}
+	if !s.session.Advance(PhaseConnecting) {
+		return fmt.Errorf("cannot begin connect: session phase=%s", s.session.Phase)
+	}
+	s.State = StateConnecting
+	return nil
+}
+
+// RecordHandshake records a successful handshake result and sets the catch-up range.
+// Mutates: session.Phase → PhaseHandshake, session.StartLSN/TargetLSN.
+// Rejects: wrong sessionID, wrong phase, invalid range.
+func (s *Sender) RecordHandshake(sessionID uint64, startLSN, targetLSN uint64) error {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if err := s.checkSessionAuthority(sessionID); err != nil {
+		return err
+	}
+	if targetLSN < startLSN {
+		return fmt.Errorf("invalid handshake range: target=%d < start=%d", targetLSN, startLSN)
+	}
+	if !s.session.Advance(PhaseHandshake) {
+		return fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase)
+	}
+	s.session.SetRange(startLSN, targetLSN)
+	return nil
+}
+
+// RecordHandshakeWithOutcome records the handshake AND classifies the recovery
+// outcome. This is the preferred handshake API — it determines the recovery
+// path in one step:
+//   - OutcomeZeroGap:      sets zero range, ready for fast completion
+//   - OutcomeCatchUp:       sets catch-up range, ready for BeginCatchUp
+//   - OutcomeNeedsRebuild: invalidates session, transitions sender to NeedsRebuild
+//
+// Returns the outcome. On NeedsRebuild, the session is dead and the caller
+// should not attempt further execution.
+func (s *Sender) RecordHandshakeWithOutcome(sessionID uint64, result HandshakeResult) (RecoveryOutcome, error) {
+	outcome := ClassifyRecoveryOutcome(result)
+
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if err := s.checkSessionAuthority(sessionID); err != nil {
+		return outcome, err
+	}
+	// Must be in PhaseConnecting — require valid execution entry point.
+	if s.session.Phase != PhaseConnecting {
+		return outcome, fmt.Errorf("handshake requires PhaseConnecting, got %s", s.session.Phase)
+	}
+
+	if outcome == OutcomeNeedsRebuild {
+		s.session.invalidate("gap_exceeds_retention")
+		s.session = nil
+		s.State = StateNeedsRebuild
+		return outcome, nil
+	}
+
+	if !s.session.Advance(PhaseHandshake) {
+		return outcome, fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase)
+	}
+
+	switch outcome {
+	case OutcomeZeroGap:
+		s.session.SetRange(result.ReplicaFlushedLSN, result.ReplicaFlushedLSN)
+	case OutcomeCatchUp:
+		s.session.SetRange(result.ReplicaFlushedLSN, result.CommittedLSN)
+	}
+	return outcome, nil
+}
+
+// BeginCatchUp transitions the session from handshake to catch-up phase.
+// Mutates: session.Phase → PhaseCatchUp. Sender.State → StateCatchingUp.
+// Rejects: wrong sessionID, wrong phase.
+func (s *Sender) BeginCatchUp(sessionID uint64) error {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if err := s.checkSessionAuthority(sessionID); err != nil {
+		return err
+	}
+	if !s.session.Advance(PhaseCatchUp) {
+		return fmt.Errorf("cannot begin catch-up: session phase=%s", s.session.Phase)
+	}
+	s.State = StateCatchingUp
+	return nil
+}
+
+// RecordCatchUpProgress records catch-up progress (highest LSN recovered).
+// Mutates: session.RecoveredTo (monotonic only).
+// Rejects: wrong sessionID, wrong phase, progress regression, invalidated session.
+func (s *Sender) RecordCatchUpProgress(sessionID uint64, recoveredTo uint64) error {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if err := s.checkSessionAuthority(sessionID); err != nil {
+		return err
+	}
+	if s.session.Phase != PhaseCatchUp {
+		return fmt.Errorf("cannot record progress: session phase=%s, want catchup", s.session.Phase)
+	}
+	if recoveredTo <= s.session.RecoveredTo {
+		return fmt.Errorf("progress regression: current=%d proposed=%d", s.session.RecoveredTo, recoveredTo)
+	}
+	s.session.UpdateProgress(recoveredTo)
+	return nil
+}
+
+// checkSessionAuthority validates that the sender has an active session
+// matching the given ID. Must be called with s.mu held.
+func (s *Sender) checkSessionAuthority(sessionID uint64) error {
+	if s.stopped {
+		return fmt.Errorf("sender stopped")
+	}
+	if s.session == nil {
+		return fmt.Errorf("no active session")
+	}
+	if s.session.ID != sessionID {
+		return fmt.Errorf("session ID mismatch: active=%d requested=%d", s.session.ID, sessionID)
+	}
+	if !s.session.Active() {
+		return fmt.Errorf("session %d is no longer active (phase=%s)", sessionID, s.session.Phase)
+	}
+	return nil
+}
+
+// InvalidateSession invalidates the current session with a reason.
+// Transitions the sender to the given target state.
+func (s *Sender) InvalidateSession(reason string, targetState ReplicaState) {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if s.session != nil {
+		s.session.invalidate(reason)
+		s.session = nil
+	}
+	s.State = targetState
+}
+
+// Stop shuts down the sender and any active session.
+func (s *Sender) Stop() {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	if s.stopped {
+		return
+	}
+	s.stopped = true
+	if s.session != nil {
+		s.session.invalidate("sender_stopped")
+		s.session = nil
+	}
+}
+
+// Stopped returns true if the sender has been stopped.
+func (s *Sender) Stopped() bool {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	return s.stopped
+}
--- a/sw-block/prototype/enginev2/sender_group.go
+++ b/sw-block/prototype/enginev2/sender_group.go
@ -0,0 +1,119 @@
+package enginev2
+
+import (
+	"sort"
+	"sync"
+)
+
+// SenderGroup manages per-replica Senders with identity-preserving reconciliation.
+// It is the V2 equivalent of ShipperGroup.
+type SenderGroup struct {
+	mu      sync.RWMutex
+	senders map[string]*Sender // keyed by ReplicaID
+}
+
+// NewSenderGroup creates an empty SenderGroup.
+func NewSenderGroup() *SenderGroup {
+	return &SenderGroup{
+		senders: map[string]*Sender{},
+	}
+}
+
+// Reconcile diffs the current sender set against newEndpoints.
+// Matching senders (same ReplicaID) are preserved with all state.
+// Removed senders are stopped. New senders are created at the given epoch.
+// Returns lists of added and removed ReplicaIDs.
+func (sg *SenderGroup) Reconcile(newEndpoints map[string]Endpoint, epoch uint64) (added, removed []string) {
+	sg.mu.Lock()
+	defer sg.mu.Unlock()
+
+	// Stop and remove senders not in the new set.
+	for id, s := range sg.senders {
+		if _, keep := newEndpoints[id]; !keep {
+			s.Stop()
+			delete(sg.senders, id)
+			removed = append(removed, id)
+		}
+	}
+
+	// Add new senders; update endpoints and epoch for existing.
+	for id, ep := range newEndpoints {
+		if existing, ok := sg.senders[id]; ok {
+			existing.UpdateEndpoint(ep)
+			existing.UpdateEpoch(epoch)
+		} else {
+			sg.senders[id] = NewSender(id, ep, epoch)
+			added = append(added, id)
+		}
+	}
+
+	sort.Strings(added)
+	sort.Strings(removed)
+	return added, removed
+}
+
+// Sender returns the sender for a ReplicaID, or nil.
+func (sg *SenderGroup) Sender(replicaID string) *Sender {
+	sg.mu.RLock()
+	defer sg.mu.RUnlock()
+	return sg.senders[replicaID]
+}
+
+// All returns all senders in deterministic order (sorted by ReplicaID).
+func (sg *SenderGroup) All() []*Sender {
+	sg.mu.RLock()
+	defer sg.mu.RUnlock()
+	out := make([]*Sender, 0, len(sg.senders))
+	for _, s := range sg.senders {
+		out = append(out, s)
+	}
+	sort.Slice(out, func(i, j int) bool {
+		return out[i].ReplicaID < out[j].ReplicaID
+	})
+	return out
+}
+
+// Len returns the number of senders.
+func (sg *SenderGroup) Len() int {
+	sg.mu.RLock()
+	defer sg.mu.RUnlock()
+	return len(sg.senders)
+}
+
+// StopAll stops all senders.
+func (sg *SenderGroup) StopAll() {
+	sg.mu.Lock()
+	defer sg.mu.Unlock()
+	for _, s := range sg.senders {
+		s.Stop()
+	}
+}
+
+// InSyncCount returns the number of senders in StateInSync.
+func (sg *SenderGroup) InSyncCount() int {
+	sg.mu.RLock()
+	defer sg.mu.RUnlock()
+	count := 0
+	for _, s := range sg.senders {
+		if s.State == StateInSync {
+			count++
+		}
+	}
+	return count
+}
+
+// InvalidateEpoch invalidates all active sessions that are bound to
+// a stale epoch. Called after promotion/epoch bump.
+func (sg *SenderGroup) InvalidateEpoch(currentEpoch uint64) int {
+	sg.mu.RLock()
+	defer sg.mu.RUnlock()
+	count := 0
+	for _, s := range sg.senders {
+		sess := s.Session()
+		if sess != nil && sess.Epoch < currentEpoch && sess.Active() {
+			s.InvalidateSession("epoch_bump", StateDisconnected)
+			count++
+		}
+	}
+	return count
+}
--- a/sw-block/prototype/enginev2/sender_group_test.go
+++ b/sw-block/prototype/enginev2/sender_group_test.go
@ -0,0 +1,203 @@
+package enginev2
+
+import "testing"
+
+// === SenderGroup reconciliation ===
+
+func TestSenderGroup_Reconcile_AddNew(t *testing.T) {
+	sg := NewSenderGroup()
+
+	eps := map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+	}
+	added, removed := sg.Reconcile(eps, 1)
+
+	if len(added) != 2 || len(removed) != 0 {
+		t.Fatalf("added=%v removed=%v", added, removed)
+	}
+	if sg.Len() != 2 {
+		t.Fatalf("len: got %d, want 2", sg.Len())
+	}
+}
+
+func TestSenderGroup_Reconcile_RemoveStale(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+	}, 1)
+
+	// Remove r2, keep r1.
+	_, removed := sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+	}, 1)
+
+	if len(removed) != 1 || removed[0] != "r2:9333" {
+		t.Fatalf("removed=%v, want [r2:9333]", removed)
+	}
+	if sg.Sender("r2:9333") != nil {
+		t.Fatal("r2 should be removed")
+	}
+	if sg.Sender("r1:9333") == nil {
+		t.Fatal("r1 should be preserved")
+	}
+}
+
+func TestSenderGroup_Reconcile_PreservesState(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+	}, 1)
+
+	// Attach session and advance.
+	s := sg.Sender("r1:9333")
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	sess.SetRange(0, 100)
+	sess.UpdateProgress(50)
+
+	// Reconcile with same address — sender preserved.
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+	}, 1)
+
+	s2 := sg.Sender("r1:9333")
+	if s2 != s {
+		t.Fatal("reconcile should preserve the same sender object")
+	}
+	if s2.Session() != sess {
+		t.Fatal("reconcile should preserve the session")
+	}
+	if !sess.Active() {
+		t.Fatal("session should still be active after same-address reconcile")
+	}
+}
+
+func TestSenderGroup_Reconcile_MixedUpdate(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+	}, 1)
+
+	// Keep r1, remove r2, add r3.
+	added, removed := sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r3:9333": {DataAddr: "r3:9333", Version: 1},
+	}, 1)
+
+	if len(added) != 1 || added[0] != "r3:9333" {
+		t.Fatalf("added=%v, want [r3:9333]", added)
+	}
+	if len(removed) != 1 || removed[0] != "r2:9333" {
+		t.Fatalf("removed=%v, want [r2:9333]", removed)
+	}
+	if sg.Len() != 2 {
+		t.Fatalf("len=%d, want 2", sg.Len())
+	}
+}
+
+func TestSenderGroup_Reconcile_EndpointChange_InvalidatesSession(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+	}, 1)
+
+	s := sg.Sender("r1:9333")
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+
+	// Same ReplicaID but new endpoint version.
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 2},
+	}, 1)
+
+	if sess.Active() {
+		t.Fatal("endpoint version change should invalidate session")
+	}
+	if s.Session() != nil {
+		t.Fatal("session should be nil after endpoint change")
+	}
+}
+
+// === Epoch invalidation ===
+
+func TestSenderGroup_InvalidateEpoch(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+	}, 1)
+
+	// Both have sessions at epoch 1.
+	s1 := sg.Sender("r1:9333")
+	s2 := sg.Sender("r2:9333")
+	sess1, _ := s1.AttachSession(1, SessionCatchUp)
+	sess2, _ := s2.AttachSession(1, SessionCatchUp)
+
+	// Epoch bumps to 2. Both sessions stale.
+	count := sg.InvalidateEpoch(2)
+	if count != 2 {
+		t.Fatalf("should invalidate 2 sessions, got %d", count)
+	}
+	if sess1.Active() || sess2.Active() {
+		t.Fatal("both sessions should be invalidated")
+	}
+	if s1.State != StateDisconnected || s2.State != StateDisconnected {
+		t.Fatal("senders should be disconnected after epoch invalidation")
+	}
+}
+
+func TestSenderGroup_InvalidateEpoch_SkipsCurrentEpoch(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+	}, 2)
+
+	s := sg.Sender("r1:9333")
+	sess, _ := s.AttachSession(2, SessionCatchUp) // epoch 2 session
+
+	// Invalidate epoch 2 — session AT epoch 2 should NOT be invalidated.
+	count := sg.InvalidateEpoch(2)
+	if count != 0 {
+		t.Fatalf("should not invalidate current-epoch session, got %d", count)
+	}
+	if !sess.Active() {
+		t.Fatal("current-epoch session should remain active")
+	}
+}
+
+func TestSenderGroup_StopAll(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+	}, 1)
+
+	sg.StopAll()
+
+	for _, s := range sg.All() {
+		if !s.Stopped() {
+			t.Fatalf("%s should be stopped", s.ReplicaID)
+		}
+	}
+}
+
+func TestSenderGroup_All_DeterministicOrder(t *testing.T) {
+	sg := NewSenderGroup()
+	sg.Reconcile(map[string]Endpoint{
+		"r3:9333": {DataAddr: "r3:9333", Version: 1},
+		"r1:9333": {DataAddr: "r1:9333", Version: 1},
+		"r2:9333": {DataAddr: "r2:9333", Version: 1},
+	}, 1)
+
+	all := sg.All()
+	if len(all) != 3 {
+		t.Fatalf("len=%d, want 3", len(all))
+	}
+	expected := []string{"r1:9333", "r2:9333", "r3:9333"}
+	for i, exp := range expected {
+		if all[i].ReplicaID != exp {
+			t.Fatalf("all[%d]=%s, want %s", i, all[i].ReplicaID, exp)
+		}
+	}
+}
--- a/sw-block/prototype/enginev2/sender_test.go
+++ b/sw-block/prototype/enginev2/sender_test.go
@ -0,0 +1,407 @@
+package enginev2
+
+import "testing"
+
+// === Sender lifecycle ===
+
+func TestSender_NewSender_Disconnected(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1)
+	if s.State != StateDisconnected {
+		t.Fatalf("new sender should be Disconnected, got %s", s.State)
+	}
+	if s.Session() != nil {
+		t.Fatal("new sender should have no session")
+	}
+}
+
+func TestSender_AttachSession_Success(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, err := s.AttachSession(1, SessionCatchUp)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if sess.Kind != SessionCatchUp {
+		t.Fatalf("session kind: got %s, want catchup", sess.Kind)
+	}
+	if sess.Epoch != 1 {
+		t.Fatalf("session epoch: got %d, want 1", sess.Epoch)
+	}
+	if !sess.Active() {
+		t.Fatal("session should be active")
+	}
+	// AttachSession is ownership-only — sender stays Disconnected until BeginConnect.
+	if s.State != StateDisconnected {
+		t.Fatalf("sender state after attach: got %s, want disconnected (ownership-only)", s.State)
+	}
+}
+
+func TestSender_AttachSession_RejectsDouble(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	_, err := s.AttachSession(1, SessionCatchUp)
+	if err != nil {
+		t.Fatal(err)
+	}
+	_, err = s.AttachSession(1, SessionBootstrap)
+	if err == nil {
+		t.Fatal("should reject second attach while session active")
+	}
+}
+
+func TestSender_CompleteSession_TransitionsInSync(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	// Must execute full lifecycle before completing.
+	s.BeginConnect(sess.ID)
+	s.RecordHandshake(sess.ID, 5, 10)
+	s.BeginCatchUp(sess.ID)
+	s.RecordCatchUpProgress(sess.ID, 10) // converged
+
+	if !s.CompleteSessionByID(sess.ID) {
+		t.Fatal("completion should succeed when converged")
+	}
+	if s.State != StateInSync {
+		t.Fatalf("after complete: got %s, want in_sync", s.State)
+	}
+	if s.Session() != nil {
+		t.Fatal("session should be nil after complete")
+	}
+	if sess.Active() {
+		t.Fatal("completed session should not be active")
+	}
+}
+
+func TestSender_SupersedeSession(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	old, _ := s.AttachSession(1, SessionCatchUp)
+	s.UpdateEpoch(2) // epoch bumps — old session invalidated by UpdateEpoch
+	new := s.SupersedeSession(SessionReassign, "explicit_supersede")
+
+	if old.Active() {
+		t.Fatal("old session should be invalidated")
+	}
+	// Invalidated by UpdateEpoch, not by SupersedeSession (already dead).
+	if old.InvalidateReason == "" {
+		t.Fatal("old session should have invalidation reason")
+	}
+	if !new.Active() {
+		t.Fatal("new session should be active")
+	}
+	if new.Epoch != 2 {
+		t.Fatalf("new session epoch: got %d, want 2", new.Epoch)
+	}
+}
+
+func TestSender_UpdateEndpoint_InvalidatesSession(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2})
+
+	if sess.Active() {
+		t.Fatal("session should be invalidated after endpoint change")
+	}
+	if sess.InvalidateReason != "endpoint_changed" {
+		t.Fatalf("invalidation reason: got %q", sess.InvalidateReason)
+	}
+	if s.State != StateDisconnected {
+		t.Fatalf("sender should be disconnected after endpoint change, got %s", s.State)
+	}
+	if s.Session() != nil {
+		t.Fatal("session should be nil after endpoint change")
+	}
+}
+
+func TestSender_UpdateEndpoint_SameAddr_PreservesSession(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", Version: 1})
+
+	if !sess.Active() {
+		t.Fatal("same-address update should preserve session")
+	}
+}
+
+func TestSender_UpdateEndpoint_CtrlAddrOnly_InvalidatesSession(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9444", Version: 1})
+
+	if sess.Active() {
+		t.Fatal("CtrlAddr-only change should invalidate session")
+	}
+	if s.State != StateDisconnected {
+		t.Fatalf("sender should be disconnected, got %s", s.State)
+	}
+}
+
+func TestSender_Stop_InvalidatesSession(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.Stop()
+
+	if sess.Active() {
+		t.Fatal("session should be invalidated after stop")
+	}
+	if !s.Stopped() {
+		t.Fatal("sender should be stopped")
+	}
+
+	// Attach after stop fails.
+	_, err := s.AttachSession(1, SessionBootstrap)
+	if err == nil {
+		t.Fatal("attach after stop should fail")
+	}
+}
+
+func TestSender_InvalidateSession_TargetState(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	s.InvalidateSession("timeout", StateNeedsRebuild)
+
+	if sess.Active() {
+		t.Fatal("session should be invalidated")
+	}
+	if s.State != StateNeedsRebuild {
+		t.Fatalf("sender state: got %s, want needs_rebuild", s.State)
+	}
+}
+
+// === Session lifecycle ===
+
+func TestSession_Advance_ValidTransitions(t *testing.T) {
+	sess := newRecoverySession("r1", 1, SessionCatchUp)
+
+	if !sess.Advance(PhaseConnecting) {
+		t.Fatal("init → connecting should succeed")
+	}
+	if !sess.Advance(PhaseHandshake) {
+		t.Fatal("connecting → handshake should succeed")
+	}
+	if !sess.Advance(PhaseCatchUp) {
+		t.Fatal("handshake → catchup should succeed")
+	}
+	if !sess.Advance(PhaseCompleted) {
+		t.Fatal("catchup → completed should succeed")
+	}
+}
+
+func TestSession_Advance_RejectsInvalidJump(t *testing.T) {
+	sess := newRecoverySession("r1", 1, SessionCatchUp)
+
+	// init → catchup is not valid (must go through connecting, handshake)
+	if sess.Advance(PhaseCatchUp) {
+		t.Fatal("init → catchup should be rejected")
+	}
+	// init → completed is not valid
+	if sess.Advance(PhaseCompleted) {
+		t.Fatal("init → completed should be rejected")
+	}
+}
+
+func TestSession_Advance_StopsOnInvalidate(t *testing.T) {
+	sess := newRecoverySession("r1", 1, SessionCatchUp)
+	sess.Advance(PhaseConnecting)
+	sess.Advance(PhaseHandshake)
+	sess.invalidate("test")
+
+	if sess.Advance(PhaseCatchUp) {
+		t.Fatal("advance after invalidate should fail")
+	}
+}
+
+func TestSender_AttachSession_RejectsEpochMismatch(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	_, err := s.AttachSession(2, SessionCatchUp)
+	if err == nil {
+		t.Fatal("should reject session at epoch 2 when sender is at epoch 1")
+	}
+}
+
+func TestSender_UpdateEpoch_InvalidatesStaleSession(t *testing.T) {
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+
+	s.UpdateEpoch(2)
+
+	if sess.Active() {
+		t.Fatal("session at epoch 1 should be invalidated after UpdateEpoch(2)")
+	}
+	if s.Epoch != 2 {
+		t.Fatalf("sender epoch should be 2, got %d", s.Epoch)
+	}
+	if s.State != StateDisconnected {
+		t.Fatalf("sender should be disconnected after epoch bump, got %s", s.State)
+	}
+
+	// Can now attach at epoch 2.
+	sess2, err := s.AttachSession(2, SessionCatchUp)
+	if err != nil {
+		t.Fatalf("attach at new epoch should succeed: %v", err)
+	}
+	if sess2.Epoch != 2 {
+		t.Fatalf("new session epoch: got %d, want 2", sess2.Epoch)
+	}
+}
+
+func TestSession_Progress_StopsOnComplete(t *testing.T) {
+	sess := newRecoverySession("r1", 1, SessionCatchUp)
+	sess.SetRange(0, 100)
+
+	sess.UpdateProgress(50)
+	if sess.Converged() {
+		t.Fatal("should not converge at 50/100")
+	}
+
+	sess.complete()
+
+	if sess.UpdateProgress(100) {
+		t.Fatal("update after complete should return false")
+	}
+}
+
+func TestSession_Converged(t *testing.T) {
+	sess := newRecoverySession("r1", 1, SessionCatchUp)
+	sess.SetRange(0, 10)
+
+	sess.UpdateProgress(9)
+	if sess.Converged() {
+		t.Fatal("9 < 10: not converged")
+	}
+
+	sess.UpdateProgress(10)
+	if !sess.Converged() {
+		t.Fatal("10 >= 10: should be converged")
+	}
+}
+
+// === Bridge tests: ownership invariants matching distsim scenarios ===
+
+func TestBridge_StaleCompletion_AfterSupersede_HasNoEffect(t *testing.T) {
+	// Matches distsim TestP04a_StaleCompletion_AfterSupersede_Rejected.
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	// First session.
+	sess1, _ := s.AttachSession(1, SessionCatchUp)
+	sess1.Advance(PhaseConnecting)
+	sess1.Advance(PhaseHandshake)
+	sess1.Advance(PhaseCatchUp)
+
+	// Supersede with new session.
+	s.UpdateEpoch(2)
+	sess2, _ := s.AttachSession(2, SessionCatchUp)
+
+	// Old session: advance/complete has no effect (already invalidated).
+	if sess1.Advance(PhaseCompleted) {
+		t.Fatal("stale session should not advance to completed")
+	}
+	if sess1.Active() {
+		t.Fatal("old session should be inactive")
+	}
+
+	// New session: still active and owns the sender.
+	if !sess2.Active() {
+		t.Fatal("new session should be active")
+	}
+	if s.Session() != sess2 {
+		t.Fatal("sender should own the new session")
+	}
+
+	// Stale completion by OLD session ID — REJECTED by identity check.
+	if s.CompleteSessionByID(sess1.ID) {
+		t.Fatal("stale completion with old session ID must be rejected")
+	}
+	// Sender must NOT have moved to InSync.
+	if s.State == StateInSync {
+		t.Fatal("sender must not be InSync after stale completion")
+	}
+	// New session must still be active.
+	if !sess2.Active() {
+		t.Fatal("new session must still be active after stale completion rejected")
+	}
+
+	// Correct completion by NEW session ID — requires full execution path.
+	s.BeginConnect(sess2.ID)
+	s.RecordHandshake(sess2.ID, 0, 10)
+	s.BeginCatchUp(sess2.ID)
+	s.RecordCatchUpProgress(sess2.ID, 10)
+	if !s.CompleteSessionByID(sess2.ID) {
+		t.Fatal("completion with correct session ID should succeed after convergence")
+	}
+	if s.State != StateInSync {
+		t.Fatalf("sender should be InSync after correct completion, got %s", s.State)
+	}
+}
+
+func TestBridge_EpochBump_RejectedCompletion(t *testing.T) {
+	// Matches distsim TestP04a_EpochBumpDuringCatchup_InvalidatesSession.
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+	sess.Advance(PhaseConnecting)
+
+	// Epoch bumps — session invalidated.
+	s.UpdateEpoch(2)
+
+	// Attempting to advance the old session fails.
+	if sess.Advance(PhaseHandshake) {
+		t.Fatal("stale session should not advance after epoch bump")
+	}
+
+	// Attempting to attach at old epoch fails.
+	_, err := s.AttachSession(1, SessionCatchUp)
+	if err == nil {
+		t.Fatal("attach at stale epoch should fail")
+	}
+
+	// Attach at new epoch succeeds.
+	sess2, err := s.AttachSession(2, SessionCatchUp)
+	if err != nil {
+		t.Fatalf("attach at new epoch should succeed: %v", err)
+	}
+	if sess2.Epoch != 2 {
+		t.Fatalf("new session epoch=%d, want 2", sess2.Epoch)
+	}
+}
+
+func TestBridge_EndpointChange_InvalidatesAndAllowsNewSession(t *testing.T) {
+	// Matches distsim TestP04a_EndpointChangeDuringCatchup_InvalidatesSession.
+	s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
+
+	sess, _ := s.AttachSession(1, SessionCatchUp)
+
+	// Endpoint changes.
+	s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2})
+
+	// Old session dead.
+	if sess.Active() {
+		t.Fatal("session should be invalidated")
+	}
+
+	// New session can be attached (same epoch, new endpoint).
+	sess2, err := s.AttachSession(1, SessionCatchUp)
+	if err != nil {
+		t.Fatalf("new session after endpoint change: %v", err)
+	}
+	if !sess2.Active() {
+		t.Fatal("new session should be active")
+	}
+}
+
+func TestSession_DoubleInvalidate_Safe(t *testing.T) {
+	sess := newRecoverySession("r1", 1, SessionCatchUp)
+	sess.invalidate("first")
+	sess.invalidate("second") // should not panic or change reason
+
+	if sess.InvalidateReason != "first" {
+		t.Fatalf("reason should be first, got %q", sess.InvalidateReason)
+	}
+}
--- a/sw-block/prototype/enginev2/session.go
+++ b/sw-block/prototype/enginev2/session.go
@ -0,0 +1,151 @@
+package enginev2
+
+import (
+	"sync"
+	"sync/atomic"
+)
+
+// SessionKind identifies how the recovery session was created.
+type SessionKind string
+
+const (
+	SessionBootstrap  SessionKind = "bootstrap"  // fresh replica, no prior state
+	SessionCatchUp    SessionKind = "catchup"     // WAL gap recovery
+	SessionRebuild    SessionKind = "rebuild"     // full extent + WAL rebuild
+	SessionReassign   SessionKind = "reassign"    // address change recovery
+)
+
+// SessionPhase tracks progress within a recovery session.
+type SessionPhase string
+
+const (
+	PhaseInit       SessionPhase = "init"
+	PhaseConnecting SessionPhase = "connecting"
+	PhaseHandshake  SessionPhase = "handshake"
+	PhaseCatchUp    SessionPhase = "catchup"
+	PhaseCompleted  SessionPhase = "completed"
+	PhaseInvalidated SessionPhase = "invalidated"
+)
+
+// sessionIDCounter generates unique session IDs across all senders.
+var sessionIDCounter atomic.Uint64
+
+// RecoverySession represents one recovery attempt for a specific replica
+// at a specific epoch. It is owned by a Sender and has exclusive authority
+// to transition the replica through connecting → handshake → catchup → complete.
+//
+// Each session has a unique ID. Stale completions are rejected by ID, not
+// by pointer comparison. This prevents old sessions from mutating state
+// even if they retain a reference to the sender.
+//
+// Lifecycle rules:
+//   - At most one active session per Sender
+//   - Session is bound to an epoch; epoch bump invalidates it
+//   - Session is bound to an endpoint; address change invalidates it
+//   - Completed sessions release ownership back to the Sender
+//   - Invalidated sessions are dead and cannot be reused
+type RecoverySession struct {
+	mu sync.Mutex
+
+	ID               uint64 // unique, monotonic, never reused
+	ReplicaID        string
+	Epoch            uint64
+	Kind             SessionKind
+	Phase            SessionPhase
+	InvalidateReason string // non-empty when invalidated
+
+	// Progress tracking.
+	StartLSN    uint64 // gap start (exclusive)
+	TargetLSN   uint64 // gap end (inclusive)
+	RecoveredTo uint64 // highest LSN recovered so far
+}
+
+func newRecoverySession(replicaID string, epoch uint64, kind SessionKind) *RecoverySession {
+	return &RecoverySession{
+		ID:        sessionIDCounter.Add(1),
+		ReplicaID: replicaID,
+		Epoch:     epoch,
+		Kind:      kind,
+		Phase:     PhaseInit,
+	}
+}
+
+// Active returns true if the session has not been completed or invalidated.
+func (rs *RecoverySession) Active() bool {
+	rs.mu.Lock()
+	defer rs.mu.Unlock()
+	return rs.Phase != PhaseCompleted && rs.Phase != PhaseInvalidated
+}
+
+// validTransitions defines the allowed phase transitions.
+// Each phase maps to the set of phases it can transition to.
+var validTransitions = map[SessionPhase]map[SessionPhase]bool{
+	PhaseInit:       {PhaseConnecting: true, PhaseInvalidated: true},
+	PhaseConnecting: {PhaseHandshake: true, PhaseInvalidated: true},
+	PhaseHandshake:  {PhaseCatchUp: true, PhaseCompleted: true, PhaseInvalidated: true},
+	PhaseCatchUp:    {PhaseCompleted: true, PhaseInvalidated: true},
+}
+
+// Advance moves the session to the next phase. Returns false if the
+// transition is not valid (wrong source phase, already terminal, or
+// illegal jump). Enforces the lifecycle:
+//
+//	init → connecting → handshake → catchup → completed
+//	                                        ↘ invalidated (from any non-terminal)
+func (rs *RecoverySession) Advance(phase SessionPhase) bool {
+	rs.mu.Lock()
+	defer rs.mu.Unlock()
+	if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated {
+		return false
+	}
+	allowed := validTransitions[rs.Phase]
+	if !allowed[phase] {
+		return false
+	}
+	rs.Phase = phase
+	return true
+}
+
+// UpdateProgress records catch-up progress. Returns false if stale.
+func (rs *RecoverySession) UpdateProgress(recoveredTo uint64) bool {
+	rs.mu.Lock()
+	defer rs.mu.Unlock()
+	if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated {
+		return false
+	}
+	if recoveredTo > rs.RecoveredTo {
+		rs.RecoveredTo = recoveredTo
+	}
+	return true
+}
+
+// SetRange sets the recovery LSN range.
+func (rs *RecoverySession) SetRange(start, target uint64) {
+	rs.mu.Lock()
+	defer rs.mu.Unlock()
+	rs.StartLSN = start
+	rs.TargetLSN = target
+}
+
+// Converged returns true if recovery has reached the target.
+func (rs *RecoverySession) Converged() bool {
+	rs.mu.Lock()
+	defer rs.mu.Unlock()
+	return rs.TargetLSN > 0 && rs.RecoveredTo >= rs.TargetLSN
+}
+
+func (rs *RecoverySession) complete() {
+	rs.mu.Lock()
+	defer rs.mu.Unlock()
+	rs.Phase = PhaseCompleted
+}
+
+func (rs *RecoverySession) invalidate(reason string) {
+	rs.mu.Lock()
+	defer rs.mu.Unlock()
+	if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated {
+		return
+	}
+	rs.Phase = PhaseInvalidated
+	rs.InvalidateReason = reason
+}
--- a/sw-block/prototype/fsmv2/apply.go
+++ b/sw-block/prototype/fsmv2/apply.go
@ -0,0 +1,162 @@
+package fsmv2
+
+func (f *FSM) Apply(evt Event) ([]Action, error) {
+	switch evt.Kind {
+	case EventEpochChanged:
+		if evt.Epoch <= f.Epoch {
+			return nil, nil
+		}
+		f.Epoch = evt.Epoch
+		switch f.State {
+		case StateInSync:
+			f.State = StateLagging
+			f.clearCatchup()
+			f.clearRebuild()
+			return []Action{ActionRevokeSyncEligibility}, nil
+		case StateCatchingUp, StatePromotionHold, StateRebuilding, StateCatchUpAfterBuild:
+			f.State = StateLagging
+			f.clearCatchup()
+			f.clearRebuild()
+			return []Action{ActionAbortRecovery, ActionRevokeSyncEligibility}, nil
+		default:
+			f.clearCatchup()
+			f.clearRebuild()
+			return nil, nil
+		}
+	case EventFatal:
+		f.State = StateFailed
+		f.clearCatchup()
+		f.clearRebuild()
+		return []Action{ActionFailReplica, ActionRevokeSyncEligibility}, nil
+	}
+
+	switch f.State {
+	case StateBootstrapping:
+		switch evt.Kind {
+		case EventBootstrapComplete:
+			f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
+			f.State = StateInSync
+			return []Action{ActionGrantSyncEligibility}, nil
+		case EventDisconnect:
+			f.State = StateLagging
+			return nil, nil
+		}
+	case StateInSync:
+		switch evt.Kind {
+		case EventDurableProgress:
+			if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
+				return nil, invalid(f.State, evt.Kind)
+			}
+			f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
+			return nil, nil
+		case EventDisconnect:
+			f.State = StateLagging
+			return []Action{ActionRevokeSyncEligibility}, nil
+		}
+	case StateLagging:
+		switch evt.Kind {
+		case EventReconnectCatchup:
+			f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
+			f.CatchupStartLSN = evt.ReplicaFlushedLSN
+			f.CatchupTargetLSN = evt.TargetLSN
+			f.PromotionBarrierLSN = evt.TargetLSN
+			f.RecoveryReservationID = evt.ReservationID
+			f.ReservationExpiry = evt.ReservationTTL
+			f.State = StateCatchingUp
+			return []Action{ActionStartCatchup}, nil
+		case EventReconnectRebuild:
+			f.State = StateNeedsRebuild
+			return []Action{ActionRevokeSyncEligibility}, nil
+		}
+	case StateCatchingUp:
+		switch evt.Kind {
+		case EventCatchupProgress:
+			if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
+				return nil, invalid(f.State, evt.Kind)
+			}
+			f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
+			if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN {
+				f.State = StatePromotionHold
+				f.PromotionHoldUntil = evt.PromotionHoldTill
+				return []Action{ActionEnterPromotionHold}, nil
+			}
+			return nil, nil
+		case EventRetentionLost, EventCatchupTimeout:
+			f.State = StateNeedsRebuild
+			f.clearCatchup()
+			return []Action{ActionAbortRecovery}, nil
+		case EventDisconnect:
+			f.State = StateLagging
+			f.clearCatchup()
+			return []Action{ActionAbortRecovery}, nil
+		}
+	case StatePromotionHold:
+		switch evt.Kind {
+		case EventDurableProgress:
+			if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
+				return nil, invalid(f.State, evt.Kind)
+			}
+			f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
+			return nil, nil
+		case EventPromotionHealthy:
+			if evt.Now < f.PromotionHoldUntil {
+				return nil, nil
+			}
+			f.State = StateInSync
+			f.clearCatchup()
+			return []Action{ActionGrantSyncEligibility}, nil
+		case EventDisconnect:
+			f.State = StateLagging
+			f.clearCatchup()
+			return []Action{ActionRevokeSyncEligibility}, nil
+		}
+	case StateNeedsRebuild:
+		switch evt.Kind {
+		case EventStartRebuild:
+			f.State = StateRebuilding
+			f.SnapshotID = evt.SnapshotID
+			f.SnapshotCpLSN = evt.SnapshotCpLSN
+			f.RecoveryReservationID = evt.ReservationID
+			f.ReservationExpiry = evt.ReservationTTL
+			return []Action{ActionStartRebuild}, nil
+		}
+	case StateRebuilding:
+		switch evt.Kind {
+		case EventRebuildBaseApplied:
+			f.State = StateCatchUpAfterBuild
+			f.ReplicaFlushedLSN = f.SnapshotCpLSN
+			f.CatchupStartLSN = f.SnapshotCpLSN
+			f.CatchupTargetLSN = evt.TargetLSN
+			f.PromotionBarrierLSN = evt.TargetLSN
+			return []Action{ActionStartCatchup}, nil
+		case EventRetentionLost, EventRebuildTooSlow, EventDisconnect:
+			f.State = StateNeedsRebuild
+			f.clearCatchup()
+			f.clearRebuild()
+			return []Action{ActionAbortRecovery}, nil
+		}
+	case StateCatchUpAfterBuild:
+		switch evt.Kind {
+		case EventCatchupProgress:
+			if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
+				return nil, invalid(f.State, evt.Kind)
+			}
+			f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
+			if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN {
+				f.State = StatePromotionHold
+				f.PromotionHoldUntil = evt.PromotionHoldTill
+				return []Action{ActionEnterPromotionHold}, nil
+			}
+			return nil, nil
+		case EventRetentionLost, EventCatchupTimeout, EventDisconnect:
+			f.State = StateNeedsRebuild
+			f.clearCatchup()
+			f.clearRebuild()
+			return []Action{ActionAbortRecovery}, nil
+		}
+	case StateFailed:
+		return nil, nil
+	}
+
+	return nil, invalid(f.State, evt.Kind)
+}
--- a/sw-block/prototype/fsmv2/events.go
+++ b/sw-block/prototype/fsmv2/events.go
@ -0,0 +1,37 @@
+package fsmv2
+
+type EventKind string
+
+const (
+	EventBootstrapComplete   EventKind = "BootstrapComplete"
+	EventDisconnect          EventKind = "Disconnect"
+	EventReconnectCatchup    EventKind = "ReconnectCatchup"
+	EventReconnectRebuild    EventKind = "ReconnectRebuild"
+	EventDurableProgress     EventKind = "DurableProgress"
+	EventCatchupProgress     EventKind = "CatchupProgress"
+	EventPromotionHealthy    EventKind = "PromotionHealthy"
+	EventStartRebuild        EventKind = "StartRebuild"
+	EventRebuildBaseApplied  EventKind = "RebuildBaseApplied"
+	EventRetentionLost       EventKind = "RetentionLost"
+	EventCatchupTimeout      EventKind = "CatchupTimeout"
+	EventRebuildTooSlow      EventKind = "RebuildTooSlow"
+	EventEpochChanged        EventKind = "EpochChanged"
+	EventFatal               EventKind = "Fatal"
+)
+
+type Event struct {
+	Kind EventKind
+
+	Epoch uint64
+	Now   uint64
+
+	ReplicaFlushedLSN uint64
+	TargetLSN         uint64
+	PromotionHoldTill uint64
+
+	SnapshotID    string
+	SnapshotCpLSN uint64
+
+	ReservationID string
+	ReservationTTL uint64
+}
--- a/sw-block/prototype/fsmv2/fsm.go
+++ b/sw-block/prototype/fsmv2/fsm.go
@ -0,0 +1,73 @@
+package fsmv2
+
+import "fmt"
+
+type State string
+
+const (
+	StateBootstrapping     State = "Bootstrapping"
+	StateInSync            State = "InSync"
+	StateLagging           State = "Lagging"
+	StateCatchingUp        State = "CatchingUp"
+	StatePromotionHold     State = "PromotionHold"
+	StateNeedsRebuild      State = "NeedsRebuild"
+	StateRebuilding        State = "Rebuilding"
+	StateCatchUpAfterBuild State = "CatchUpAfterRebuild"
+	StateFailed            State = "Failed"
+)
+
+type Action string
+
+const (
+	ActionNone               Action = "None"
+	ActionGrantSyncEligibility Action = "GrantSyncEligibility"
+	ActionRevokeSyncEligibility Action = "RevokeSyncEligibility"
+	ActionStartCatchup       Action = "StartCatchup"
+	ActionEnterPromotionHold Action = "EnterPromotionHold"
+	ActionStartRebuild       Action = "StartRebuild"
+	ActionAbortRecovery      Action = "AbortRecovery"
+	ActionFailReplica        Action = "FailReplica"
+)
+
+type FSM struct {
+	State State
+	Epoch uint64
+
+	ReplicaFlushedLSN   uint64
+	CatchupStartLSN     uint64
+	CatchupTargetLSN    uint64
+	PromotionBarrierLSN uint64
+	PromotionHoldUntil  uint64
+
+	SnapshotID    string
+	SnapshotCpLSN uint64
+
+	RecoveryReservationID string
+	ReservationExpiry     uint64
+}
+
+func New(epoch uint64) *FSM {
+	return &FSM{State: StateBootstrapping, Epoch: epoch}
+}
+
+func (f *FSM) IsSyncEligible() bool {
+	return f.State == StateInSync
+}
+
+func (f *FSM) clearCatchup() {
+	f.CatchupStartLSN = 0
+	f.CatchupTargetLSN = 0
+	f.PromotionBarrierLSN = 0
+	f.PromotionHoldUntil = 0
+	f.RecoveryReservationID = ""
+	f.ReservationExpiry = 0
+}
+
+func (f *FSM) clearRebuild() {
+	f.SnapshotID = ""
+	f.SnapshotCpLSN = 0
+}
+
+func invalid(state State, kind EventKind) error {
+	return fmt.Errorf("fsmv2: invalid event %s in state %s", kind, state)
+}
--- a/sw-block/prototype/fsmv2/fsm_test.go
+++ b/sw-block/prototype/fsmv2/fsm_test.go
@ -0,0 +1,95 @@
+package fsmv2
+
+import "testing"
+
+func mustApply(t *testing.T, f *FSM, evt Event) []Action {
+	t.Helper()
+	actions, err := f.Apply(evt)
+	if err != nil {
+		t.Fatalf("apply %s: %v", evt.Kind, err)
+	}
+	return actions
+}
+
+func TestFSMBootstrapToInSync(t *testing.T) {
+	f := New(7)
+	mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 10})
+	if f.State != StateInSync || !f.IsSyncEligible() || f.ReplicaFlushedLSN != 10 {
+		t.Fatalf("unexpected bootstrap result: state=%s eligible=%v lsn=%d", f.State, f.IsSyncEligible(), f.ReplicaFlushedLSN)
+	}
+}
+
+func TestFSMCatchupPromotionHoldFlow(t *testing.T) {
+	f := New(3)
+	mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 5})
+	mustApply(t, f, Event{Kind: EventDisconnect})
+	mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 5, TargetLSN: 20, ReservationID: "r1", ReservationTTL: 100})
+	if f.State != StateCatchingUp {
+		t.Fatalf("expected catching up, got %s", f.State)
+	}
+	mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 20, PromotionHoldTill: 30})
+	if f.State != StatePromotionHold {
+		t.Fatalf("expected promotion hold, got %s", f.State)
+	}
+	mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 29})
+	if f.State != StatePromotionHold {
+		t.Fatalf("hold exited too early: %s", f.State)
+	}
+	mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 30})
+	if f.State != StateInSync || !f.IsSyncEligible() {
+		t.Fatalf("expected insync after hold, got %s eligible=%v", f.State, f.IsSyncEligible())
+	}
+}
+
+func TestFSMRebuildFlow(t *testing.T) {
+	f := New(11)
+	mustApply(t, f, Event{Kind: EventDisconnect})
+	mustApply(t, f, Event{Kind: EventReconnectRebuild})
+	if f.State != StateNeedsRebuild {
+		t.Fatalf("expected needs rebuild, got %s", f.State)
+	}
+	mustApply(t, f, Event{Kind: EventStartRebuild, SnapshotID: "snap-1", SnapshotCpLSN: 100, ReservationID: "rr", ReservationTTL: 200})
+	if f.State != StateRebuilding {
+		t.Fatalf("expected rebuilding, got %s", f.State)
+	}
+	mustApply(t, f, Event{Kind: EventRebuildBaseApplied, TargetLSN: 140})
+	if f.State != StateCatchUpAfterBuild || f.ReplicaFlushedLSN != 100 {
+		t.Fatalf("unexpected rebuild-base state=%s lsn=%d", f.State, f.ReplicaFlushedLSN)
+	}
+	mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 140, PromotionHoldTill: 150})
+	mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 150})
+	if f.State != StateInSync || f.SnapshotID != "snap-1" {
+		t.Fatalf("expected insync after rebuild, got state=%s snapshot=%q", f.State, f.SnapshotID)
+	}
+}
+
+func TestFSMEpochChangeAbortsRecovery(t *testing.T) {
+	f := New(1)
+	mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 1})
+	mustApply(t, f, Event{Kind: EventDisconnect})
+	mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "r1", ReservationTTL: 99})
+	mustApply(t, f, Event{Kind: EventEpochChanged, Epoch: 2})
+	if f.State != StateLagging || f.RecoveryReservationID != "" || f.IsSyncEligible() {
+		t.Fatalf("unexpected state after epoch change: state=%s reservation=%q eligible=%v", f.State, f.RecoveryReservationID, f.IsSyncEligible())
+	}
+}
+
+func TestFSMReservationLostNeedsRebuild(t *testing.T) {
+	f := New(5)
+	mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 9})
+	mustApply(t, f, Event{Kind: EventDisconnect})
+	mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 9, TargetLSN: 15, ReservationID: "r2", ReservationTTL: 80})
+	mustApply(t, f, Event{Kind: EventRetentionLost})
+	if f.State != StateNeedsRebuild {
+		t.Fatalf("expected needs rebuild after reservation lost, got %s", f.State)
+	}
+}
+
+func TestFSMDurableProgressWhileInSync(t *testing.T) {
+	f := New(2)
+	mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 4})
+	mustApply(t, f, Event{Kind: EventDurableProgress, ReplicaFlushedLSN: 8})
+	if f.ReplicaFlushedLSN != 8 || f.State != StateInSync {
+		t.Fatalf("unexpected in-sync durable progress: state=%s lsn=%d", f.State, f.ReplicaFlushedLSN)
+	}
+}
--- a/sw-block/prototype/fsmv2/fsmv2.test.exe
+++ b/sw-block/prototype/fsmv2/fsmv2.test.exe
--- a/sw-block/prototype/run-tests.ps1
+++ b/sw-block/prototype/run-tests.ps1
@ -0,0 +1,37 @@
+param(
+    [string[]]$Packages = @(
+        './sw-block/prototype/fsmv2',
+        './sw-block/prototype/volumefsm',
+        './sw-block/prototype/distsim'
+    )
+)
+
+$ErrorActionPreference = 'Stop'
+$root = Split-Path -Parent (Split-Path -Parent $PSScriptRoot)
+Set-Location $root
+
+$cacheDir = Join-Path $root '.gocache_v2'
+$tmpDir = Join-Path $root '.gotmp_v2'
+New-Item -ItemType Directory -Force -Path $cacheDir,$tmpDir | Out-Null
+$env:GOCACHE = $cacheDir
+$env:GOTMPDIR = $tmpDir
+
+foreach ($pkg in $Packages) {
+    $name = Split-Path $pkg -Leaf
+    $out = Join-Path $root ("sw-block\\prototype\\{0}\\{0}.test.exe" -f $name)
+    Write-Host "==> building $pkg"
+    go test -c -o $out $pkg
+    if (!(Test-Path $out)) {
+        throw "go test -c build failed for $pkg"
+    }
+    if ($LASTEXITCODE -ne 0) {
+        Write-Warning "go test -c reported a non-zero exit code for $pkg, but the test binary was produced. Continuing."
+    }
+    Write-Host "==> running $out"
+    cmd /c "cd /d $root && $out -test.v -test.count=1"
+    if ($LASTEXITCODE -ne 0) {
+        throw "test binary failed for $pkg"
+    }
+}
+
+Write-Host "Done."
--- a/sw-block/prototype/volumefsm/events.go
+++ b/sw-block/prototype/volumefsm/events.go
@ -0,0 +1,148 @@
+package volumefsm
+
+import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
+
+type EventKind string
+
+const (
+	EventWriteCommitted     EventKind = "WriteCommitted"
+	EventCheckpointAdvanced EventKind = "CheckpointAdvanced"
+	EventBarrierCompleted   EventKind = "BarrierCompleted"
+	EventBootstrapReplica   EventKind = "BootstrapReplica"
+	EventReplicaDisconnect  EventKind = "ReplicaDisconnect"
+	EventReplicaReconnect   EventKind = "ReplicaReconnect"
+	EventReplicaNeedsRebuild EventKind = "ReplicaNeedsRebuild"
+	EventReplicaCatchupProgress EventKind = "ReplicaCatchupProgress"
+	EventReplicaPromotionHealthy EventKind = "ReplicaPromotionHealthy"
+	EventReplicaStartRebuild EventKind = "ReplicaStartRebuild"
+	EventReplicaRebuildBaseApplied EventKind = "ReplicaRebuildBaseApplied"
+	EventReplicaReservationLost EventKind = "ReplicaReservationLost"
+	EventReplicaCatchupTimeout EventKind = "ReplicaCatchupTimeout"
+	EventReplicaRebuildTooSlow EventKind = "ReplicaRebuildTooSlow"
+	EventPrimaryLeaseLost   EventKind = "PrimaryLeaseLost"
+	EventPromoteReplica     EventKind = "PromoteReplica"
+)
+
+type Event struct {
+	Kind      EventKind
+	ReplicaID string
+
+	LSN             uint64
+	CheckpointLSN   uint64
+	ReplicaFlushedLSN uint64
+	TargetLSN       uint64
+	Now             uint64
+	HoldUntil       uint64
+	SnapshotID      string
+	SnapshotCpLSN   uint64
+	ReservationID   string
+	ReservationTTL  uint64
+}
+
+func (m *Model) Apply(evt Event) error {
+	switch evt.Kind {
+	case EventWriteCommitted:
+		if evt.LSN > m.HeadLSN {
+			m.HeadLSN = evt.LSN
+		} else {
+			m.HeadLSN++
+		}
+		return nil
+	case EventCheckpointAdvanced:
+		if evt.CheckpointLSN > m.CheckpointLSN {
+			m.CheckpointLSN = evt.CheckpointLSN
+		}
+		return nil
+	case EventPrimaryLeaseLost:
+		m.PrimaryState = PrimaryLost
+		m.Epoch++
+		for _, r := range m.Replicas {
+			_, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch})
+			if err != nil {
+				return err
+			}
+		}
+		return nil
+	case EventPromoteReplica:
+		m.PrimaryID = evt.ReplicaID
+		m.PrimaryState = PrimaryServing
+		m.Epoch++
+		for _, r := range m.Replicas {
+			_, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch})
+			if err != nil {
+				return err
+			}
+		}
+		return nil
+	}
+
+	r := m.Replica(evt.ReplicaID)
+	if r == nil {
+		return nil
+	}
+
+	var fEvt fsmv2.Event
+	switch evt.Kind {
+	case EventBarrierCompleted:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventDurableProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN}
+	case EventBootstrapReplica:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventBootstrapComplete, ReplicaFlushedLSN: evt.ReplicaFlushedLSN}
+	case EventReplicaDisconnect:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventDisconnect}
+	case EventReplicaReconnect:
+		if evt.ReservationID != "" {
+			fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectCatchup, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, TargetLSN: evt.TargetLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL}
+		} else {
+			fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild}
+		}
+	case EventReplicaNeedsRebuild:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild}
+	case EventReplicaCatchupProgress:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, PromotionHoldTill: evt.HoldUntil}
+	case EventReplicaPromotionHealthy:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventPromotionHealthy, Now: evt.Now}
+	case EventReplicaStartRebuild:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventStartRebuild, SnapshotID: evt.SnapshotID, SnapshotCpLSN: evt.SnapshotCpLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL}
+	case EventReplicaRebuildBaseApplied:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildBaseApplied, TargetLSN: evt.TargetLSN}
+	case EventReplicaReservationLost:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventRetentionLost}
+	case EventReplicaCatchupTimeout:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupTimeout}
+	case EventReplicaRebuildTooSlow:
+		fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildTooSlow}
+	default:
+		return nil
+	}
+	_, err := r.FSM.Apply(fEvt)
+	return err
+}
+
+func (m *Model) EvaluateReconnect(replicaID string, flushedLSN, targetLSN uint64) (RecoveryDecision, error) {
+	decision := m.Planner.PlanReconnect(replicaID, flushedLSN, targetLSN)
+	r := m.Replica(replicaID)
+	if r == nil {
+		return decision, nil
+	}
+	switch decision.Disposition {
+	case RecoveryCatchup:
+		err := m.Apply(Event{
+			Kind:              EventReplicaReconnect,
+			ReplicaID:         replicaID,
+			ReplicaFlushedLSN: flushedLSN,
+			TargetLSN:         targetLSN,
+			ReservationID:     decision.ReservationID,
+			ReservationTTL:    decision.ReservationTTL,
+		})
+		return decision, err
+	default:
+		if r.FSM.State == fsmv2.StateNeedsRebuild {
+			return decision, nil
+		}
+		err := m.Apply(Event{
+			Kind:      EventReplicaNeedsRebuild,
+			ReplicaID: replicaID,
+		})
+		return decision, err
+	}
+}
--- a/sw-block/prototype/volumefsm/format.go
+++ b/sw-block/prototype/volumefsm/format.go
@ -0,0 +1,38 @@
+package volumefsm
+
+import (
+	"fmt"
+	"sort"
+	"strings"
+)
+
+func FormatSnapshot(s Snapshot) string {
+	ids := make([]string, 0, len(s.Replicas))
+	for id := range s.Replicas {
+		ids = append(ids, id)
+	}
+	sort.Strings(ids)
+
+	parts := []string{
+		fmt.Sprintf("step=%s", s.Step),
+		fmt.Sprintf("epoch=%d", s.Epoch),
+		fmt.Sprintf("primary=%s/%s", s.PrimaryID, s.PrimaryState),
+		fmt.Sprintf("head=%d", s.HeadLSN),
+		fmt.Sprintf("write=%t:%s", s.WriteGate.Allowed, s.WriteGate.Reason),
+		fmt.Sprintf("ack=%t:%s", s.AckGate.Allowed, s.AckGate.Reason),
+	}
+	for _, id := range ids {
+		r := s.Replicas[id]
+		parts = append(parts, fmt.Sprintf("%s=%s@%d", id, r.State, r.FlushedLSN))
+	}
+	return strings.Join(parts, " ")
+}
+
+func FormatTrace(trace []Snapshot) string {
+	lines := make([]string, 0, len(trace))
+	for _, s := range trace {
+		lines = append(lines, FormatSnapshot(s))
+	}
+	return strings.Join(lines, "\n")
+}
+
--- a/sw-block/prototype/volumefsm/model.go
+++ b/sw-block/prototype/volumefsm/model.go
@ -0,0 +1,142 @@
+package volumefsm
+
+import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
+
+type Mode string
+
+const (
+	ModeBestEffort Mode = "best_effort"
+	ModeSyncAll    Mode = "sync_all"
+	ModeSyncQuorum Mode = "sync_quorum"
+)
+
+type PrimaryState string
+
+const (
+	PrimaryServing  PrimaryState = "serving"
+	PrimaryDraining PrimaryState = "draining"
+	PrimaryLost     PrimaryState = "lost"
+)
+
+type Replica struct {
+	ID string
+	FSM *fsmv2.FSM
+}
+
+type Model struct {
+	Epoch        uint64
+	PrimaryID    string
+	PrimaryState PrimaryState
+	Mode         Mode
+
+	HeadLSN       uint64
+	CheckpointLSN uint64
+
+	RequiredReplicaIDs []string
+	Replicas           map[string]*Replica
+	Planner            RecoveryPlanner
+}
+
+func New(primaryID string, mode Mode, epoch uint64, replicaIDs ...string) *Model {
+	m := &Model{
+		Epoch:        epoch,
+		PrimaryID:    primaryID,
+		PrimaryState: PrimaryServing,
+		Mode:         mode,
+		Replicas:     make(map[string]*Replica, len(replicaIDs)),
+		Planner:      StaticRecoveryPlanner{},
+	}
+	for _, id := range replicaIDs {
+		m.Replicas[id] = &Replica{ID: id, FSM: fsmv2.New(epoch)}
+		m.RequiredReplicaIDs = append(m.RequiredReplicaIDs, id)
+	}
+	return m
+}
+
+func (m *Model) Replica(id string) *Replica {
+		return m.Replicas[id]
+}
+
+func (m *Model) SyncEligibleCount() int {
+	count := 0
+	for _, id := range m.RequiredReplicaIDs {
+		r := m.Replicas[id]
+		if r != nil && r.FSM.IsSyncEligible() {
+			count++
+		}
+	}
+	return count
+}
+
+func (m *Model) DurableReplicaCount(targetLSN uint64) int {
+	count := 0
+	for _, id := range m.RequiredReplicaIDs {
+		r := m.Replicas[id]
+		if r != nil && r.FSM.IsSyncEligible() && r.FSM.ReplicaFlushedLSN >= targetLSN {
+			count++
+		}
+	}
+	return count
+}
+
+func (m *Model) Quorum() int {
+	rf := len(m.RequiredReplicaIDs) + 1
+	return rf/2 + 1
+}
+
+func (m *Model) CanServeWrite() bool {
+	return m.WriteAdmission().Allowed
+}
+
+type AdmissionDecision struct {
+	Allowed bool
+	Reason  string
+}
+
+func (m *Model) WriteAdmission() AdmissionDecision {
+	if m.PrimaryState != PrimaryServing {
+		return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"}
+	}
+	switch m.Mode {
+	case ModeBestEffort:
+		return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"}
+	case ModeSyncAll:
+		if m.SyncEligibleCount() == len(m.RequiredReplicaIDs) {
+			return AdmissionDecision{Allowed: true, Reason: "all_replicas_sync_eligible"}
+		}
+		return AdmissionDecision{Allowed: false, Reason: "required_replica_not_in_sync"}
+	case ModeSyncQuorum:
+		if 1+m.SyncEligibleCount() >= m.Quorum() {
+			return AdmissionDecision{Allowed: true, Reason: "quorum_sync_eligible"}
+		}
+		return AdmissionDecision{Allowed: false, Reason: "quorum_not_available"}
+	default:
+		return AdmissionDecision{Allowed: false, Reason: "unknown_mode"}
+	}
+}
+
+func (m *Model) CanAcknowledgeLSN(targetLSN uint64) bool {
+	return m.AckAdmission(targetLSN).Allowed
+}
+
+func (m *Model) AckAdmission(targetLSN uint64) AdmissionDecision {
+	if m.PrimaryState != PrimaryServing {
+		return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"}
+	}
+	switch m.Mode {
+	case ModeBestEffort:
+		return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"}
+	case ModeSyncAll:
+		if m.DurableReplicaCount(targetLSN) == len(m.RequiredReplicaIDs) {
+			return AdmissionDecision{Allowed: true, Reason: "all_replicas_durable"}
+		}
+		return AdmissionDecision{Allowed: false, Reason: "required_replica_not_durable"}
+	case ModeSyncQuorum:
+		if 1+m.DurableReplicaCount(targetLSN) >= m.Quorum() {
+			return AdmissionDecision{Allowed: true, Reason: "quorum_durable"}
+		}
+		return AdmissionDecision{Allowed: false, Reason: "durable_quorum_not_available"}
+	default:
+		return AdmissionDecision{Allowed: false, Reason: "unknown_mode"}
+	}
+}
--- a/sw-block/prototype/volumefsm/model_test.go
+++ b/sw-block/prototype/volumefsm/model_test.go
@ -0,0 +1,421 @@
+package volumefsm
+
+import (
+	"strings"
+	"testing"
+
+	fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
+)
+
+type scriptedPlanner struct {
+	decision RecoveryDecision
+}
+
+func (s scriptedPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision {
+	return s.decision
+}
+
+func mustApply(t *testing.T, m *Model, evt Event) {
+	t.Helper()
+	if err := m.Apply(evt); err != nil {
+		t.Fatalf("apply %s: %v", evt.Kind, err)
+	}
+}
+
+func TestModelSyncAllBlocksOnLaggingReplica(t *testing.T) {
+	m := New("p1", ModeSyncAll, 1, "r1", "r2")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1})
+	if !m.CanServeWrite() {
+		t.Fatal("sync_all should serve when all replicas are in sync")
+	}
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
+	if m.CanServeWrite() {
+		t.Fatal("sync_all should block when one required replica lags")
+	}
+}
+
+func TestModelSyncQuorumSurvivesOneLaggingReplica(t *testing.T) {
+	m := New("p1", ModeSyncQuorum, 1, "r1", "r2")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
+	if !m.CanServeWrite() {
+		t.Fatal("sync_quorum should still serve with primary + one in-sync replica")
+	}
+}
+
+func TestModelCatchupFlowRestoresEligibility(t *testing.T) {
+	m := New("p1", ModeSyncAll, 1, "r1")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10})
+	mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 10, ReservationID: "res-1", ReservationTTL: 100})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp {
+		t.Fatalf("expected catching up, got %s", got)
+	}
+	mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 10, HoldUntil: 20})
+	mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 20})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync {
+		t.Fatalf("expected in sync, got %s", got)
+	}
+	if !m.CanServeWrite() {
+		t.Fatal("sync_all should serve after replica returns to in-sync")
+	}
+}
+
+func TestModelLongGapRebuildFlow(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap1", SnapshotCpLSN: 100, ReservationID: "rebuild-1", ReservationTTL: 200})
+	mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 130})
+	mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 130, HoldUntil: 150})
+	mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 150})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync {
+		t.Fatalf("expected in sync after rebuild, got %s", got)
+	}
+}
+
+func TestModelPrimaryLeaseLostFencesRecovery(t *testing.T) {
+	m := New("p1", ModeSyncAll, 1, "r1")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "res-2", ReservationTTL: 100})
+	mustApply(t, m, Event{Kind: EventPrimaryLeaseLost})
+	if m.PrimaryState != PrimaryLost {
+		t.Fatalf("expected lost primary, got %s", m.PrimaryState)
+	}
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging {
+		t.Fatalf("expected lagging after fencing, got %s", got)
+	}
+}
+
+func TestModelPromoteReplicaChangesEpoch(t *testing.T) {
+	m := New("p1", ModeSyncQuorum, 1, "r1", "r2")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 10})
+	oldEpoch := m.Epoch
+	mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"})
+	if m.PrimaryID != "r1" {
+		t.Fatalf("expected promoted primary r1, got %s", m.PrimaryID)
+	}
+	if m.Epoch != oldEpoch+1 {
+		t.Fatalf("expected epoch increment, got %d want %d", m.Epoch, oldEpoch+1)
+	}
+}
+
+func TestModelSyncQuorumWithThreeReplicasMixedStates(t *testing.T) {
+	m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 1})
+
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
+
+	if !m.CanServeWrite() {
+		t.Fatal("sync_quorum should serve with primary + two in-sync replicas out of RF=4")
+	}
+}
+
+func TestModelFailoverFencesMixedReplicaStates(t *testing.T) {
+	m := New("p1", ModeSyncQuorum, 10, "r1", "r2", "r3")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 8})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 8})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 6})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
+	mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r2", ReplicaFlushedLSN: 8, TargetLSN: 12, ReservationID: "catch-r2", ReservationTTL: 100})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r3"})
+	mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r3"})
+	mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r3", SnapshotID: "snap-x", SnapshotCpLSN: 6, ReservationID: "rebuild-r3", ReservationTTL: 200})
+
+	mustApply(t, m, Event{Kind: EventPrimaryLeaseLost})
+	mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"})
+
+	if m.PrimaryID != "r1" {
+		t.Fatalf("expected r1 promoted, got %s", m.PrimaryID)
+	}
+	if got := m.Replica("r2").FSM.State; got != fsmv2.StateLagging {
+		t.Fatalf("expected r2 fenced back to lagging, got %s", got)
+	}
+	if got := m.Replica("r3").FSM.State; got != fsmv2.StateLagging {
+		t.Fatalf("expected r3 fenced back to lagging, got %s", got)
+	}
+}
+
+func TestModelRebuildInterruptedByEpochChange(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-2", SnapshotCpLSN: 100, ReservationID: "rebuild-2", ReservationTTL: 200})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateRebuilding {
+		t.Fatalf("expected rebuilding, got %s", got)
+	}
+
+	mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging {
+		t.Fatalf("expected lagging after epoch change fencing, got %s", got)
+	}
+}
+
+func TestModelReservationLostDuringCatchupAfterRebuild(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"})
+	mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-3", SnapshotCpLSN: 50, ReservationID: "rebuild-3", ReservationTTL: 200})
+	mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 80})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchUpAfterBuild {
+		t.Fatalf("expected catch-up-after-rebuild, got %s", got)
+	}
+
+	mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild {
+		t.Fatalf("expected needs rebuild after reservation loss, got %s", got)
+	}
+}
+
+func TestModelSyncAllBarrierAcknowledgeTargetLSN(t *testing.T) {
+	m := New("p1", ModeSyncAll, 1, "r1", "r2")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5})
+	mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10})
+
+	if m.CanAcknowledgeLSN(10) {
+		t.Fatal("sync_all should not acknowledge target LSN before barriers advance replica durability")
+	}
+
+	mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10})
+	if m.CanAcknowledgeLSN(10) {
+		t.Fatal("sync_all should still wait for second replica durability")
+	}
+
+	mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 10})
+	if !m.CanAcknowledgeLSN(10) {
+		t.Fatal("sync_all should acknowledge once all required replicas are durable at target LSN")
+	}
+}
+
+func TestModelSyncQuorumBarrierAcknowledgeTargetLSN(t *testing.T) {
+	m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5})
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 5})
+	mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 9})
+
+	if m.CanAcknowledgeLSN(9) {
+		t.Fatal("sync_quorum should not acknowledge before any replica reaches target durability")
+	}
+
+	mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 9})
+	if m.CanAcknowledgeLSN(9) {
+		t.Fatal("sync_quorum should still wait because RF=4 quorum needs primary + two durable replicas")
+	}
+	mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 9})
+	if !m.CanAcknowledgeLSN(9) {
+		t.Fatal("sync_quorum should acknowledge with primary + two durable replicas in RF=4")
+	}
+}
+
+func TestModelWriteAdmissionReasons(t *testing.T) {
+	m := New("p1", ModeSyncAll, 1, "r1")
+	dec := m.WriteAdmission()
+	if dec.Allowed || dec.Reason != "required_replica_not_in_sync" {
+		t.Fatalf("unexpected admission before bootstrap: %+v", dec)
+	}
+
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
+	dec = m.WriteAdmission()
+	if !dec.Allowed || dec.Reason != "all_replicas_sync_eligible" {
+		t.Fatalf("unexpected admission after bootstrap: %+v", dec)
+	}
+}
+
+func TestModelEvaluateReconnectUsesPlanner(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+
+	decision, err := m.EvaluateReconnect("r1", 2, 8)
+	if err != nil {
+		t.Fatalf("evaluate reconnect: %v", err)
+	}
+	if decision.Disposition != RecoveryCatchup {
+		t.Fatalf("expected catchup decision, got %+v", decision)
+	}
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp {
+		t.Fatalf("expected catching up, got %s", got)
+	}
+}
+
+func TestModelEvaluateReconnectNeedsRebuildFromPlanner(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	m.Planner = scriptedPlanner{decision: RecoveryDecision{
+		Disposition: RecoveryNeedsRebuild,
+		Reason:      "payload_not_resolvable",
+	}}
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+
+	decision, err := m.EvaluateReconnect("r1", 2, 8)
+	if err != nil {
+		t.Fatalf("evaluate reconnect: %v", err)
+	}
+	if decision.Disposition != RecoveryNeedsRebuild || decision.Reason != "payload_not_resolvable" {
+		t.Fatalf("unexpected decision: %+v", decision)
+	}
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild {
+		t.Fatalf("expected needs rebuild, got %s", got)
+	}
+}
+
+func TestModelEvaluateReconnectCarriesRecoveryClasses(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	m.Planner = scriptedPlanner{decision: RecoveryDecision{
+		Disposition:   RecoveryCatchup,
+		ReservationID: "extent-resv",
+		ReservationTTL: 42,
+		Reason:        "extent_payload_resolvable",
+		Classes:       []RecoveryClass{RecoveryClassExtentReferenced},
+	}}
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 3})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+
+	decision, err := m.EvaluateReconnect("r1", 3, 9)
+	if err != nil {
+		t.Fatalf("evaluate reconnect: %v", err)
+	}
+	if len(decision.Classes) != 1 || decision.Classes[0] != RecoveryClassExtentReferenced {
+		t.Fatalf("unexpected recovery classes: %+v", decision.Classes)
+	}
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp {
+		t.Fatalf("expected catching up, got %s", got)
+	}
+	if got := m.Replica("r1").FSM.RecoveryReservationID; got != "extent-resv" {
+		t.Fatalf("expected reservation extent-resv, got %q", got)
+	}
+}
+
+func TestModelEvaluateReconnectCanChangeOverTime(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+
+	m.Planner = scriptedPlanner{decision: RecoveryDecision{
+		Disposition:   RecoveryCatchup,
+		ReservationID: "resv-1",
+		ReservationTTL: 10,
+		Reason:        "temporarily_recoverable",
+		Classes:       []RecoveryClass{RecoveryClassWALInline},
+	}}
+	decision, err := m.EvaluateReconnect("r1", 4, 12)
+	if err != nil {
+		t.Fatalf("first evaluate reconnect: %v", err)
+	}
+	if decision.Disposition != RecoveryCatchup {
+		t.Fatalf("expected catchup on first evaluation, got %+v", decision)
+	}
+
+	mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"})
+	if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild {
+		t.Fatalf("expected needs rebuild after reservation loss, got %s", got)
+	}
+
+	m.Planner = scriptedPlanner{decision: RecoveryDecision{
+		Disposition: RecoveryNeedsRebuild,
+		Reason:      "recoverability_expired",
+	}}
+	decision, err = m.EvaluateReconnect("r1", 4, 12)
+	if err != nil {
+		t.Fatalf("second evaluate reconnect: %v", err)
+	}
+	if decision.Reason != "recoverability_expired" {
+		t.Fatalf("unexpected second decision: %+v", decision)
+	}
+}
+
+func TestRunScenarioProducesStateTrace(t *testing.T) {
+	m := New("p1", ModeSyncAll, 1, "r1")
+	trace, err := RunScenario(m, []ScenarioStep{
+		{Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}},
+		{Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}},
+		{Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}},
+	})
+	if err != nil {
+		t.Fatalf("run scenario: %v", err)
+	}
+	if len(trace) != 4 {
+		t.Fatalf("expected 4 snapshots, got %d", len(trace))
+	}
+	last := trace[len(trace)-1]
+	if last.HeadLSN != 10 {
+		t.Fatalf("expected head 10, got %d", last.HeadLSN)
+	}
+	if got := last.Replicas["r1"].FlushedLSN; got != 10 {
+		t.Fatalf("expected replica flushed 10, got %d", got)
+	}
+	if !last.AckGate.Allowed {
+		t.Fatalf("expected ack gate allowed at final step, got %+v", last.AckGate)
+	}
+}
+
+func TestScriptedRecoveryPlannerChangesDecisionOverTime(t *testing.T) {
+	m := New("p1", ModeBestEffort, 1, "r1")
+	m.Planner = &ScriptedRecoveryPlanner{
+		Decisions: []RecoveryDecision{
+			{
+				Disposition:   RecoveryCatchup,
+				ReservationID: "resv-a",
+				ReservationTTL: 10,
+				Reason:        "recoverable_now",
+				Classes:       []RecoveryClass{RecoveryClassWALInline},
+			},
+			{
+				Disposition: RecoveryNeedsRebuild,
+				Reason:      "recoverability_expired",
+			},
+		},
+	}
+	mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4})
+	mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
+
+	first, err := m.EvaluateReconnect("r1", 4, 12)
+	if err != nil {
+		t.Fatalf("first reconnect: %v", err)
+	}
+	if first.Disposition != RecoveryCatchup {
+		t.Fatalf("unexpected first decision: %+v", first)
+	}
+	mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"})
+
+	second, err := m.EvaluateReconnect("r1", 4, 12)
+	if err != nil {
+		t.Fatalf("second reconnect: %v", err)
+	}
+	if second.Disposition != RecoveryNeedsRebuild || second.Reason != "recoverability_expired" {
+		t.Fatalf("unexpected second decision: %+v", second)
+	}
+}
+
+func TestFormatTraceIncludesReplicaStatesAndGates(t *testing.T) {
+	m := New("p1", ModeSyncAll, 1, "r1")
+	trace, err := RunScenario(m, []ScenarioStep{
+		{Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}},
+		{Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}},
+		{Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}},
+	})
+	if err != nil {
+		t.Fatalf("run scenario: %v", err)
+	}
+	got := FormatTrace(trace)
+	wantParts := []string{
+		"step=bootstrap",
+		"write=true:all_replicas_sync_eligible",
+		"step=barrier10",
+		"ack=true:all_replicas_durable",
+		"r1=InSync@10",
+	}
+	for _, part := range wantParts {
+		if !strings.Contains(got, part) {
+			t.Fatalf("trace missing %q:\n%s", part, got)
+		}
+	}
+}
--- a/sw-block/prototype/volumefsm/recovery.go
+++ b/sw-block/prototype/volumefsm/recovery.go
@ -0,0 +1,70 @@
+package volumefsm
+
+type RecoveryClass string
+
+const (
+	RecoveryClassWALInline        RecoveryClass = "wal_inline"
+	RecoveryClassExtentReferenced RecoveryClass = "extent_referenced"
+)
+
+type RecoveryDisposition string
+
+const (
+	RecoveryCatchup     RecoveryDisposition = "catchup"
+	RecoveryNeedsRebuild RecoveryDisposition = "needs_rebuild"
+)
+
+type RecoveryDecision struct {
+	Disposition  RecoveryDisposition
+	ReservationID string
+	ReservationTTL uint64
+	Reason       string
+	Classes      []RecoveryClass
+}
+
+type RecoveryPlanner interface {
+	PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision
+}
+
+// StaticRecoveryPlanner is the minimal default planner for the prototype.
+// If targetLSN >= flushedLSN and the caller provided a real target, reconnect
+// is treated as catch-up; otherwise rebuild is required.
+type StaticRecoveryPlanner struct{}
+
+func (StaticRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision {
+	if targetLSN > flushedLSN {
+		return RecoveryDecision{
+			Disposition:   RecoveryCatchup,
+			ReservationID: replicaID + "-resv",
+			ReservationTTL: 100,
+			Reason:        "static_recoverable_window",
+			Classes:       []RecoveryClass{RecoveryClassWALInline},
+		}
+	}
+	return RecoveryDecision{
+		Disposition: RecoveryNeedsRebuild,
+		Reason:      "static_no_recoverable_window",
+	}
+}
+
+// ScriptedRecoveryPlanner returns pre-seeded reconnect decisions in order.
+// Once the scripted list is exhausted, the last decision is reused.
+type ScriptedRecoveryPlanner struct {
+	Decisions []RecoveryDecision
+	index     int
+}
+
+func (s *ScriptedRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision {
+	if len(s.Decisions) == 0 {
+		return RecoveryDecision{
+			Disposition: RecoveryNeedsRebuild,
+			Reason:      "scripted_no_decision",
+		}
+	}
+	if s.index >= len(s.Decisions) {
+		return s.Decisions[len(s.Decisions)-1]
+	}
+	d := s.Decisions[s.index]
+	s.index++
+	return d
+}
--- a/sw-block/prototype/volumefsm/scenario.go
+++ b/sw-block/prototype/volumefsm/scenario.go
@ -0,0 +1,61 @@
+package volumefsm
+
+import (
+	"fmt"
+
+	fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
+)
+
+type ScenarioStep struct {
+	Name  string
+	Event Event
+}
+
+type ReplicaSnapshot struct {
+	State      fsmv2.State
+	FlushedLSN uint64
+}
+
+type Snapshot struct {
+	Step         string
+	Epoch        uint64
+	PrimaryID    string
+	PrimaryState PrimaryState
+	HeadLSN      uint64
+	WriteGate    AdmissionDecision
+	AckGate      AdmissionDecision
+	Replicas     map[string]ReplicaSnapshot
+}
+
+func (m *Model) Snapshot(step string) Snapshot {
+	replicas := make(map[string]ReplicaSnapshot, len(m.Replicas))
+	for id, r := range m.Replicas {
+		replicas[id] = ReplicaSnapshot{
+			State:      r.FSM.State,
+			FlushedLSN: r.FSM.ReplicaFlushedLSN,
+		}
+	}
+	return Snapshot{
+		Step:         step,
+		Epoch:        m.Epoch,
+		PrimaryID:    m.PrimaryID,
+		PrimaryState: m.PrimaryState,
+		HeadLSN:      m.HeadLSN,
+		WriteGate:    m.WriteAdmission(),
+		AckGate:      m.AckAdmission(m.HeadLSN),
+		Replicas:     replicas,
+	}
+}
+
+func RunScenario(m *Model, steps []ScenarioStep) ([]Snapshot, error) {
+	trace := make([]Snapshot, 0, len(steps)+1)
+	trace = append(trace, m.Snapshot("initial"))
+	for _, step := range steps {
+		if err := m.Apply(step.Event); err != nil {
+			return trace, fmt.Errorf("scenario step %q: %w", step.Name, err)
+		}
+		trace = append(trace, m.Snapshot(step.Name))
+	}
+	return trace, nil
+}
+
--- a/sw-block/prototype/volumefsm/volumefsm.test.exe
+++ b/sw-block/prototype/volumefsm/volumefsm.test.exe
--- a/sw-block/test/README.md
+++ b/sw-block/test/README.md
@ -0,0 +1,17 @@
+# V2 Test Reference
+
+This directory holds V2-facing test reference material copied from the project test database.
+
+Files:
+
+- `test_db.md`
+  - copied from `learn/projects/sw-block/test/test_db.md`
+  - full block-service test inventory
+- `v2_selected.md`
+  - V2-focused working subset
+  - includes the currently selected simulator-relevant cases and the 4 Phase 13 V2-boundary tests
+
+Use:
+
+- `learn/projects/sw-block/test/test_db.md` as the project-wide source inventory
+- `sw-block/test/v2_selected.md` as the active V2 reference/worklist
--- a/sw-block/test/test_db.md
+++ b/sw-block/test/test_db.md
--- a/sw-block/test/test_db_v2.md
+++ b/sw-block/test/test_db_v2.md
@ -0,0 +1,105 @@
+# V2 Test Database
+
+Date: 2026-03-27
+Status: working subset
+
+## Purpose
+
+This is the V2-focused review subset derived from:
+
+- `sw-block/test/test_db.md`
+- `learn/projects/sw-block/phases/phase13_test.md`
+- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
+
+Use this file to review and track the tests that most directly help:
+
+- V2 protocol design
+- simulator coverage
+- V1 / V1.5 / V2 comparison
+- V2 acceptance boundaries
+
+This is intentionally much smaller than the full `test_db.md`.
+
+## Review Codes
+
+### Status
+
+- `picked`
+- `reviewed`
+- `mapped`
+
+### Sim
+
+- `sim_core`
+- `sim_reduced`
+- `real_only`
+- `v2_boundary`
+- `sim_not_needed_yet`
+
+## V2 Boundary Tests
+
+| # | Test Name | File | Line | Level | Status | Sim | Notes |
+|---|---|---|---|---|---|---|---|
+| 1 | `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` |  | `unit` | `picked` | `v2_boundary` | V1/V1.5 sender identity loss; should become V2 acceptance case |
+| 2 | `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` |  | `unit` | `picked` | `v2_boundary` | `NeedsRebuild` must remain sticky under stable per-replica sender identity |
+| 3 | `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` |  | `unit` | `picked` | `v2_boundary` | Catch-up correctness depends on identity continuity and proper recovery ownership |
+| 4 | `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` |  | `unit` | `picked` | `v2_boundary` | Multiple reconnect cycles are a V2 sender-loop / recovery-session acceptance target |
+
+## Core Protocol Tests
+
+| # | Test Name | File | Line | Level | Status | Sim | Notes |
+|---|---|---|---|---|---|---|---|
+| 1 | `TestRecovery` | `recovery_test.go` |  | `unit` | `picked` | `sim_core` | Crash recovery correctness is fundamental to block protocol reasoning |
+| 2 | `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Durable-progress truth; barrier must count flushed progress, not send progress |
+| 3 | `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Progress monotonicity invariant |
+| 4 | `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Only eligible replica states count for strict durability |
+| 5 | `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Epoch fencing on barrier path |
+| 6 | `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` |  | `unit` | `picked` | `sim_core` | `sync_all` strictness during degraded state |
+| 7 | `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` |  | `unit` | `picked` | `sim_core` | End-to-end strict replication contract |
+| 8 | `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Contrasts best_effort vs strict modes |
+| 9 | `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | No false durability from degraded shipper |
+| 10 | `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` |  | `unit` | `picked` | `sim_core` | Availability semantics under strict mode |
+| 11 | `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` |  | `unit` | `picked` | `sim_core` | Recoverability after degraded shipper |
+| 12 | `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Short-gap catch-up |
+| 13 | `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Catch-up vs rebuild boundary |
+| 14 | `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Recovery fencing during catch-up |
+| 15 | `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Recovery data correctness |
+| 16 | `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Replay idempotence |
+| 17 | `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | State-machine correctness during recovery |
+| 18 | `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` |  | `unit` | `picked` | `sim_core` | Rebuild lifecycle closure |
+| 19 | `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` |  | `unit` | `picked` | `sim_core` | Rebuild fencing |
+| 20 | `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` |  | `unit` | `picked` | `sim_core` | Safe rebuild failure behavior |
+| 21 | `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Retention rule under lag |
+| 22 | `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Retention timeout boundary |
+| 23 | `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` |  | `unit` | `picked` | `sim_core` | Retention budget boundary |
+| 24 | `TestComponent_FailoverPromote` | `component_test.go` |  | `component` | `picked` | `sim_core` | Core failover baseline |
+| 25 | `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` |  | `component` | `picked` | `sim_core` | Strict-mode failover |
+| 26 | `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` |  | `component` | `picked` | `sim_core` | Restart/rejoin lifecycle |
+| 27 | `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` |  | `unit` | `picked` | `sim_core` | Promotion safety and stale candidate rejection |
+| 28 | `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` |  | `unit` | `picked` | `sim_core` | Multi-promotion lineage |
+| 29 | `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` |  | `unit` | `picked` | `sim_core` | Mode normalization |
+| 30 | `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` |  | `component` | `picked` | `sim_core` | Best-effort contract |
+| 31 | `CP13-8 T4a: sync_all blocks during outage` | `manual` |  | `integration` | `picked` | `sim_core` | Strict outage semantics |
+
+## Reduced / Supporting Tests
+
+| # | Test Name | File | Line | Level | Status | Sim | Notes |
+|---|---|---|---|---|---|---|---|
+| 1 | `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` |  | `unit` | `picked` | `sim_reduced` | Advisory WAL-head recovery shape |
+| 2 | `testRecoverNoSuperblockPersist` | `recovery_test.go` |  | `unit` | `picked` | `sim_reduced` | Recovery despite optimized persist behavior |
+| 3 | `TestQAGroupCommitter` | `blockvol_qa_test.go` |  | `unit` | `picked` | `sim_reduced` | Commit batching semantics |
+| 4 | `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` |  | `unit` | `picked` | `sim_reduced` | Backpressure behavior |
+| 5 | `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` |  | `unit` | `picked` | `sim_reduced` | Idempotent flush shape |
+| 6 | `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` |  | `unit` | `picked` | `sim_reduced` | Progress initialization after rebuild |
+| 7 | `TestComponent_ManualPromote` | `component_test.go` |  | `component` | `picked` | `sim_reduced` | Manual control-path shape |
+| 8 | `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` |  | `unit` | `picked` | `sim_reduced` | Heartbeat observability |
+| 9 | `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` |  | `unit` | `picked` | `sim_reduced` | Control-plane visibility |
+| 10 | `TestComponent_ExpandThenFailover` | `component_test.go` |  | `component` | `picked` | `sim_reduced` | State continuity across operations |
+| 11 | `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` |  | `component` | `picked` | `sim_reduced` | Default mode behavior |
+| 12 | `CP13-8 T4b: recovery after restart` | `manual` |  | `integration` | `picked` | `sim_reduced` | Recovery-time shape and control-plane/local-reconnect interaction |
+
+## Notes
+
+- This file is the actionable V2 subset, not the master inventory.
+- If `tester` later finalizes a broader 70-case picked set, expand this file from that selection.
+- The 4 V2-boundary tests must remain present even if they fail on V1/V1.5.
--- a/sw-block/test/v2_selected.md
+++ b/sw-block/test/v2_selected.md
@ -0,0 +1,115 @@
+# V2-Selected Test Worklist
+
+Date: 2026-03-27
+Status: working
+
+## Purpose
+
+This is the V2-facing subset of the larger block-service test database.
+
+Sources:
+
+- `sw-block/test/test_db.md`
+- `learn/projects/sw-block/phases/phase13_test.md`
+- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
+
+This file is for:
+
+- tests that should help V2 design and simulator work
+- explicit inclusion of the 4 Phase 13 V2-boundary failures
+- a working set that `tester`, `sw`, and design can refine further
+
+## Current Inclusion Rule
+
+Include tests that are:
+
+- `sim_core`
+- `sim_reduced`
+- `v2_boundary`
+
+Prefer tests that directly inform:
+
+- barriers and durability truth
+- catch-up vs rebuild
+- failover / promotion
+- WAL retention / tail-chasing
+- mode semantics
+- endpoint / identity / reassignment behavior
+
+## Phase 13 V2-Boundary Tests
+
+These must stay visible in the V2 worklist:
+
+| Test | File | Why It Matters To V2 |
+|---|---|---|
+| `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | Sender identity and reconnect ownership |
+| `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | `NeedsRebuild` must remain sticky and identity-safe |
+| `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | Catch-up must preserve data correctness under identity continuity |
+| `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | Multiple reconnect cycles require stable per-replica sender ownership |
+
+## High-Value V2 Working Set
+
+This is the current distilled working set from `phase13_test.md`.
+
+| Test | File | Current Result | Mapping | Why It Helps V2 |
+|---|---|---|---|---|
+| `TestRecovery` | `recovery_test.go` | PASS | `sim_core` | Crash recovery correctness |
+| `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | PASS | `sim_core` | Barrier truth / durable progress |
+| `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Monotonic progress invariant |
+| `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-gated strict durability |
+| `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | Epoch fencing |
+| `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` | PASS | `sim_core` | `sync_all` strictness during outage |
+| `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` | PASS | `sim_core` | End-to-end strict replication |
+| `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` | PASS | `sim_core` | Mode difference vs strict sync |
+| `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` | PASS | `sim_core` | No false durability |
+| `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | PASS | `sim_core` | Availability semantics |
+| `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | PASS | `sim_core` | Recoverability after degraded shipper |
+| `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | PASS | `sim_core` | Short-gap catch-up |
+| `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Catch-up vs rebuild boundary |
+| `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery fencing |
+| `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery data correctness |
+| `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` | PASS | `sim_core` | Replay idempotence |
+| `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-machine correctness |
+| `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild lifecycle |
+| `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild fencing |
+| `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` | PASS | `sim_core` | No partial/unsafe rebuild success |
+| `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention rule |
+| `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention timeout boundary |
+| `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention budget boundary |
+| `TestComponent_FailoverPromote` | `component_test.go` | PASS | `sim_core` | Failover baseline |
+| `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` | PASS | `sim_core` | Strict-mode failover |
+| `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` | PASS | `sim_core` | Restart/rejoin lifecycle |
+| `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Promotion safety |
+| `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Multi-promotion lineage |
+| `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` | PASS | `sim_core` | Mode normalization |
+| `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` | PASS | `sim_core` | Best-effort contract |
+| `CP13-8 T4a: sync_all blocks during outage` | `manual` | PASS | `sim_core` | Strict outage semantics |
+| `CP13-8 T4b: recovery after restart` | `manual` | PASS | `sim_reduced` | Recovery-time shape |
+
+## Reduced / Supporting Cases To Keep In View
+
+| Test | File | Current Result | Mapping | Why It Helps V2 |
+|---|---|---|---|---|
+| `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` | PASS | `sim_reduced` | Advisory WAL-head recovery shape |
+| `testRecoverNoSuperblockPersist` | `recovery_test.go` | PASS | `sim_reduced` | Recoverability despite optimized persist behavior |
+| `TestQAGroupCommitter` | `blockvol_qa_test.go` | PASS | `sim_reduced` | Commit batching semantics |
+| `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` | PASS | `sim_reduced` | Backpressure behavior |
+| `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` | PASS | `sim_reduced` | Idempotent flush shape |
+| `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Progress initialization |
+| `TestComponent_ManualPromote` | `component_test.go` | PASS | `sim_reduced` | Manual control-path shape |
+| `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Heartbeat observability |
+| `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Control-plane visibility |
+| `TestComponent_ExpandThenFailover` | `component_test.go` | PASS | `sim_reduced` | Cross-operation state continuity |
+| `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` | PASS | `sim_reduced` | Default mode behavior |
+
+## Working Note
+
+`phase13_test.md` currently contains the mapped subset from the real test inventory.
+
+This V2 copy is intentionally narrower:
+
+- preserve the core tests that define the protocol story
+- preserve the 4 V2-boundary tests explicitly
+- keep a smaller reduced set for supporting invariants
+
+If `tester` finalizes a broader 70-case working set, extend this file rather than editing the full copied database directly.