fix: stable ServerID in assignments, fail-closed on missing identity, wire into ProcessAssignments

Finding 1: Identity no longer address-derived - ReplicaAddr.ServerID field added (stable server identity from registry) - BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path) - ControlBridge uses ServerID, NOT address, for ReplicaID - Missing ServerID → replica skipped (fail closed), logged Finding 2: Wired into real ProcessAssignments - BlockService.v2Bridge field initialized in StartBlockService - ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment BEFORE existing V1 processing (parallel, not replacing yet) - Logged at glog V(1) Finding 3: Fail-closed on missing identity - Empty ServerID in ReplicaAddrs → replica skipped with log - Empty ReplicaServerID in scalar path → no replica created - Test: MissingServerID_FailsClosed verifies both paths 7 tests: StableServerID, AddressChange_IdentityPreserved, MultiReplica_StableServerIDs, MissingServerID_FailsClosed, EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 day ago · 46ef79ce35
55 changed files with 9024 additions and 270 deletions
--- a/sw-block/.private/phase/phase-04-decisions.md
+++ b/sw-block/.private/phase/phase-04-decisions.md
@ -1,7 +1,7 @@
 # Phase 04 Decisions

 Date: 2026-03-27
-Status: initial
+Status: complete

 ## First Slice Decision

@ -95,3 +95,106 @@ It is:
 - recovery outcome branching
 - assignment-intent orchestration
 - prototype-level end-to-end recovery flow
+
+## Accepted P2 Refinements
+
+### Recovery boundary
+
+Recovery classification must use a lineage-safe boundary, not a raw primary WAL head.
+
+So:
+
+- handshake outcome classification uses committed/safe recovery boundary
+- stale or divergent extra tail must not be treated as zero-gap by default
+
+### Stale assignment fencing
+
+Assignment intent must not create current live sessions from stale epoch input.
+
+So:
+
+- stale assignment epoch is rejected
+- assignment result distinguishes:
+  - created
+  - superseded
+  - failed
+
+### Phase discipline on outcome classification
+
+The outcome API must respect execution entry rules.
+
+So:
+
+- handshake-with-outcome requires valid connecting phase before acting
+
+## P3 Direction
+
+The next prototype step is:
+
+- minimal historical-data model
+- recoverability proof
+- explicit safe-boundary / divergent-tail handling
+
+## Accepted P3 Refinements
+
+### Recoverability proof
+
+The historical-data prototype must prove why catch-up is allowed.
+
+So:
+
+- recoverability now checks retained start, end within head, and contiguous coverage
+- rebuild fallback is backed by executable unrecoverability
+
+### Historical state after recycling
+
+Retained-prefix modeling needs a base state, not only remaining WAL entries.
+
+So:
+
+- tail advance captures a base snapshot
+- historical state reconstruction uses snapshot + retained WAL
+
+### Divergent tail handling
+
+Replica-ahead state must not collapse directly to `InSync`.
+
+So:
+
+- divergent tail requires explicit truncation
+- completion is gated on recorded truncation when required
+
+## P4 Direction
+
+The next prototype step is:
+
+- prototype scenario closure
+- acceptance-criteria to prototype traceability
+- explicit expression of the 4 V2-boundary cases against `enginev2`
+
+## Accepted P4 Refinements
+
+### Prototype scenario closure
+
+The prototype must stop being only a set of local mechanisms.
+
+So:
+
+- acceptance criteria are mapped to prototype evidence
+- key V2-boundary scenarios are expressed directly against `enginev2`
+- prototype behavior is reviewable scenario-by-scenario
+
+### Phase 04 completion decision
+
+Phase 04 has now met its intended prototype scope:
+
+- ownership
+- execution gating
+- outcome branching
+- minimal historical-data model
+- prototype scenario closure
+
+So:
+
+- no broad new Phase 04 work should be added
+- next work should move to `Phase 4.5` gate-hardening
--- a/sw-block/.private/phase/phase-04-log.md
+++ b/sw-block/.private/phase/phase-04-log.md
@ -1,7 +1,7 @@
 # Phase 04 Log

 Date: 2026-03-27
-Status: active
+Status: complete

 ## 2026-03-27

@ -40,7 +40,37 @@ Status: active
  - attach/supersede now establish ownership only
  - handshake range validation added
  - enginev2 tests increased to 46 passing
- Next phase focus narrowed to P2:
-  - recovery outcome branching
-  - assignment-intent orchestration
-  - prototype end-to-end recovery flow
+- Phase 04 P2 delivered and accepted:
+  - outcome branching added:
+    - `OutcomeZeroGap`
+    - `OutcomeCatchUp`
+    - `OutcomeNeedsRebuild`
+  - assignment-intent orchestration added
+  - stale assignment epoch now rejected
+  - assignment result now distinguishes created / superseded / failed
+  - end-to-end prototype recovery tests added
+  - zero-gap classification tightened:
+    - exact equality to committed boundary only
+    - replica-ahead is not zero-gap
+  - enginev2 tests increased to 63 passing
+- Phase 04 P3 delivered and accepted:
+  - `WALHistory` added as minimal historical-data model
+  - recoverability proof strengthened:
+    - retained start
+    - end within head
+    - contiguous coverage
+  - base snapshot added for correct `StateAt()` after tail advance
+  - divergent-tail truncation made explicit in sender/session execution
+  - WAL-backed prototype recovery tests added
+  - enginev2 tests increased to 83 passing
+- Phase 04 P4 delivered and accepted:
+  - acceptance criteria mapped to prototype evidence
+  - V2-boundary scenarios expressed against `enginev2`
+  - prototype scenario closure achieved
+  - enginev2 tests increased to 95 passing
+- Phase 04 is now complete for its intended prototype scope.
+- Next recommended phase:
+  - `Phase 4.5`
+  - tighten bounded `CatchUp`
+  - formalize `Rebuild`
+  - strengthen crash-consistency / recoverability / liveness proof
--- a/sw-block/.private/phase/phase-04.md
+++ b/sw-block/.private/phase/phase-04.md
@ -1,7 +1,7 @@
 # Phase 04

 Date: 2026-03-27
-Status: active
+Status: complete
 Purpose: start the first standalone V2 implementation slice under `sw-block/`, centered on per-replica sender ownership and explicit recovery-session ownership

 ## Goal
@ -93,6 +93,7 @@ Delivered in this phase so far:
 - execution APIs implemented:
  - `BeginConnect`
  - `RecordHandshake`
+  - `RecordHandshakeWithOutcome`
  - `BeginCatchUp`
  - `RecordCatchUpProgress`
  - `CompleteSessionByID`
@ -101,15 +102,42 @@ Delivered in this phase so far:
  - zero-gap handshake fast path allowed
 - attach/supersede now establish ownership only
 - sender-group orchestration tests added
+- recovery outcome branching implemented:
+  - `OutcomeZeroGap`
+  - `OutcomeCatchUp`
+  - `OutcomeNeedsRebuild`
+- assignment-intent orchestration implemented:
+  - reconcile + recovery target session creation
+  - stale assignment epoch rejected
+  - created/superseded/failed outcomes distinguished
+- P2 data-boundary correction accepted:
+  - zero-gap now requires exact equality to committed boundary
+  - replica-ahead is not zero-gap
+- minimal historical-data prototype implemented:
+  - `WALHistory`
+  - retained-prefix / recycled-range semantics
+  - executable recoverability proof
+  - base snapshot for historical state after tail advance
+- explicit safe-boundary handling implemented:
+  - divergent tail requires truncation before `InSync`
+  - truncation recorded via sender-owned execution API
+- WAL-backed prototype tests added:
+  - catch-up recovery with data verification
+  - rebuild fallback with proof of unrecoverability
+  - truncate-then-`InSync` with committed-boundary verification
 - current `enginev2` test state at latest review:
-  - 46 tests passing
-
-Next focus for `sw`:
-
- continue Phase 04 beyond execution gating:
-  - recovery outcome branching
-  - sender-group orchestration from assignment intent
-  - prototype-level end-to-end recovery flow
+-  - 95 tests passing
+- prototype scenario closure completed:
+  - acceptance criteria mapped to prototype evidence
+  - V2-boundary scenarios expressed against `enginev2`
+  - small end-to-end prototype harness added
+
+Next phase:
+
+- `Phase 4.5`
+  - bounded `CatchUp`
+  - first-class `Rebuild`
+  - crash-consistency / recoverability / liveness proof hardening
 - do not integrate into V1 production tree yet

 ### P1
@ -141,6 +169,39 @@ Next focus for `sw`:
 - completion / invalidation
 - rebuild escalation

+### P3
+
+10. add minimal historical-data prototype
+- retained prefix/window
+- minimal recoverability state
+- explicit "why catch-up is allowed" proof
+
+11. make safe-boundary data handling explicit
+- divergent tail cleanup / truncate rule
+- or equivalent explicit boundary handling before `InSync`
+
+12. strengthen recoverability/rebuild tests
+- executable proof of:
+- recoverable gap
+- unrecoverable gap
+- rebuild fallback boundary
+
+### P4
+
+13. close prototype scenario coverage
+- map key acceptance criteria onto `enginev2` scenarios/tests
+- make prototype evidence reviewable scenario-by-scenario
+
+14. express the 4 V2-boundary cases against the prototype
+- changed-address identity-preserving recovery
+- `NeedsRebuild` persistence
+- catch-up without overwriting safe data
+- repeated disconnect/reconnect cycles
+
+15. add one small prototype harness if needed
+- enough to show assignment -> recovery -> outcome flow end-to-end
+- no product/backend integration yet
+
 ## Exit Criteria

 Phase 04 is done when:
@ -151,3 +212,5 @@ Phase 04 is done when:
 4. endpoint update and epoch invalidation are tested
 5. sender-owned execution flow is validated
 6. recovery outcome branching exists at prototype level
+7. minimal historical-data / recoverability model exists at prototype level
+8. prototype scenario closure is achieved for key V2 acceptance cases
--- a/sw-block/.private/phase/phase-05-decisions.md
+++ b/sw-block/.private/phase/phase-05-decisions.md
@ -0,0 +1,94 @@
+# Phase 05 Decisions
+
+## Decision 1: Real V2 engine work lives under `sw-block/engine/replication/`
+
+The first real engine slice is established under:
+
+- `sw-block/engine/replication/`
+
+This keeps V2 separate from:
+
+- `sw-block/prototype/`
+- `weed/storage/blockvol/`
+
+## Decision 2: Slice 1 is accepted
+
+Accepted scope:
+
+1. stable per-replica sender identity
+2. stable recovery-session identity
+3. stale authority fencing
+4. endpoint / epoch invalidation
+5. ownership registry
+
+## Decision 3: Stable identity must not be address-shaped
+
+The engine registry is now keyed by stable `ReplicaID`, not mutable endpoint address.
+
+This is a required structural break from the V1/V1.5 identity-loss pattern.
+
+## Decision 4: Slice 2 is accepted
+
+Accepted scope:
+
+1. connect / handshake / catch-up flow
+2. zero-gap / catch-up / needs-rebuild branching
+3. stale execution rejection during active recovery
+4. bounded catch-up semantics in engine path
+5. rebuild execution shell
+
+## Decision 5: Slice 3 owns real recoverability inputs
+
+Slice 3 should be the point where:
+
+1. recoverable vs unrecoverable gap uses real engine inputs
+2. trusted-base / rebuild-source decision uses real engine data inputs
+3. truncation / safe-boundary handling is tied to real engine state
+4. historical correctness at recovery target is validated from engine inputs
+
+## Decision 6: Slice 3 is accepted
+
+Accepted scope:
+
+1. real engine recoverability input path
+2. trusted-base / rebuild-source decision from engine data inputs
+3. truncation / safe-boundary handling tied to engine state
+4. recoverability gating without overclaiming full historical reconstruction in engine
+
+## Decision 7: Slice 3 should replace carried-forward heuristics where appropriate
+
+In particular:
+
+1. simple rebuild-source heuristics carried from prototype should not become permanent engine policy
+2. Slice 3 should tighten these decisions against real engine recoverability inputs
+
+## Decision 8: Slice 4 is the engine integration closure slice
+
+Next focus:
+
+1. real assignment/control intent entry path
+2. engine observability / debug surface
+3. focused integration tests for V2-boundary cases
+4. validation against selected real failure classes from `learn/projects/sw-block/` and `weed/storage/block*`
+
+## Decision 9: Slice 4 is accepted
+
+Accepted scope:
+
+1. real orchestrator entry path
+2. assignment/update-driven recovery through that path
+3. engine observability / causal recovery logging
+4. diagnosable V2-boundary integration tests
+
+## Decision 10: Phase 05 is complete
+
+Reason:
+
+1. ownership core is accepted
+2. recovery execution core is accepted
+3. data / recoverability core is accepted
+4. integration closure is accepted
+
+Next:
+
+- `Phase 06` broader engine implementation stage
--- a/sw-block/.private/phase/phase-05-log.md
+++ b/sw-block/.private/phase/phase-05-log.md
@ -0,0 +1,78 @@
+# Phase 05 Log
+
+## 2026-03-29
+
+### Opened
+
+`Phase 05` opened as:
+
+- V2 engine planning + Slice 1 ownership core
+
+### Accepted
+
+1. engine module location
+   - `sw-block/engine/replication/`
+
+2. Slice 1 ownership core
+   - stable per-replica sender identity
+   - stable recovery-session identity
+   - sender/session fencing
+   - endpoint / epoch invalidation
+   - ownership registry
+
+3. Slice 1 identity correction
+   - registry now keyed by stable `ReplicaID`
+   - mutable `Endpoint` separated from identity
+   - real changed-`DataAddr` preservation covered by test
+
+4. Slice 1 encapsulation
+   - mutable sender/session authority state no longer exposed directly
+   - snapshot/read-only inspection path in place
+
+5. Slice 2 recovery execution core
+   - connect / handshake / catch-up flow
+   - explicit zero-gap / catch-up / needs-rebuild branching
+   - stale execution rejection during active recovery
+   - bounded catch-up semantics
+   - rebuild execution shell
+
+6. Slice 2 validation
+   - corrected tester summary accepted
+   - `12` ownership tests + `18` recovery tests = `30` total
+   - Slice 2 accepted for progression to Slice 3 planning
+
+7. Slice 3 data / recoverability core
+   - `RetainedHistory` introduced as engine-level recoverability input
+   - history-driven sender APIs added for handshake and rebuild-source selection
+   - trusted-base decision now requires both checkpoint trust and replayable tail
+   - truncation remains a completion gate / protocol boundary
+
+8. Slice 3 validation
+   - corrected tester summary accepted
+   - `12` ownership tests + `18` recovery tests + `18` recoverability tests = `48` total
+   - accepted boundary:
+     - engine proves historical-correctness prerequisites
+     - simulator retains stronger historical reconstruction proof
+   - Slice 3 accepted for progression to Slice 4 planning
+
+9. Slice 4 integration closure
+   - `RecoveryOrchestrator` added as integrated engine entry path
+   - assignment/update-driven recovery is exercised through orchestrator
+   - observability surface added:
+     - `RegistryStatus`
+     - `SenderStatus`
+     - `SessionSnapshot`
+     - `RecoveryLog`
+   - causal recovery logging now covers invalidation, escalation, truncation, completion, rebuild transitions
+
+10. Slice 4 validation
+   - corrected tester summary accepted
+   - `12` ownership tests + `18` recovery tests + `18` recoverability tests + `11` integration tests = `59` total
+   - Slice 4 accepted
+   - `Phase 05` accepted as complete
+
+### Next
+
+1. `Phase 06` planning
+2. broader engine implementation stage
+3. real-engine integration against selected `weed/storage/block*` constraints and failure classes
--- a/sw-block/.private/phase/phase-05.md
+++ b/sw-block/.private/phase/phase-05.md
@ -0,0 +1,356 @@
+# Phase 05
+
+Date: 2026-03-29
+Status: complete
+Purpose: begin the real V2 engine track under `sw-block/` by moving from prototype proof to the first engine slice
+
+## Why This Phase Exists
+
+The project has now completed:
+
+1. V2 design/FSM closure
+2. V2 protocol/simulator validation
+3. Phase 04 prototype closure
+4. Phase 4.5 evidence hardening
+
+So the next step is no longer:
+
+- extend prototype breadth
+
+The next step is:
+
+- start disciplined real V2 engine work
+
+## Phase Goal
+
+Start the real V2 engine line under `sw-block/` with:
+
+1. explicit engine module location
+2. Slice 1 ownership-core boundaries
+3. first engine ownership-core implementation
+4. engine-side validation tied back to accepted prototype invariants
+
+## Relationship To Previous Phases
+
+`Phase 05` is built on:
+
+- `sw-block/design/v2-engine-readiness-review.md`
+- `sw-block/design/v2-engine-slicing-plan.md`
+- `sw-block/.private/phase/phase-04.md`
+- `sw-block/.private/phase/phase-4.5.md`
+
+This is a new implementation phase.
+
+It is not:
+
+1. more prototype expansion
+2. V1 integration
+3. backend redesign
+
+## Scope
+
+### In scope
+
+1. choose real V2 engine module location under `sw-block/`
+2. define Slice 1 file/module boundaries
+3. write short engine ownership-core spec
+4. start Slice 1 implementation:
+   - stable per-replica sender object
+   - stable recovery-session object
+   - session identity fencing
+   - endpoint / epoch invalidation
+   - ownership registry / sender-group equivalent
+5. add focused engine-side ownership/fencing tests
+
+### Out of scope
+
+1. Smart WAL expansion
+2. full storage/backend redesign
+3. full rebuild-source decision logic
+4. V1 production integration
+5. performance work
+6. full product integration
+
+## Planned Slices
+
+### P0: Engine Planning Setup
+
+1. choose real V2 engine module location under `sw-block/`
+2. define Slice 1 file/module boundaries
+3. write ownership-core spec
+4. map 3-5 acceptance scenarios to Slice 1 expectations
+
+Status:
+
+- accepted
+- engine module location chosen: `sw-block/engine/replication/`
+- Slice 1 boundaries are explicit enough to start implementation
+
+### P1: Slice 1 Ownership Core
+
+1. implement stable per-replica sender object
+2. implement stable recovery-session object
+3. implement sender/session identity fencing
+4. implement endpoint / epoch invalidation
+5. implement ownership registry
+
+Status:
+
+- accepted
+- stable `ReplicaID` is now explicit and separate from mutable `Endpoint`
+- engine registry is keyed by stable identity, not address-shaped strings
+- real changed-`DataAddr` preservation is covered by test
+
+### P2: Slice 1 Validation
+
+1. engine-side tests for ownership/fencing
+2. changed-address case
+3. stale-session rejection case
+4. epoch-bump invalidation case
+5. traceability back to accepted prototype behavior
+
+Status:
+
+- accepted
+- Slice 1 ownership/fencing tests are in place and passing
+- acceptance/gate mapping is strong enough to move to Slice 2
+
+### P3: Slice 2 Planning Setup
+
+1. define Slice 2 boundaries explicitly
+2. distinguish Slice 2 core from carried-forward prototype support
+3. map Slice 2 engine expectations from accepted prototype evidence
+4. prepare Slice 2 validation targets
+
+Status:
+
+- accepted
+- Slice 2 recovery execution core is implemented and validated
+- corrected tester summary accepted:
+  - `12` ownership tests
+  - `18` recovery tests
+  - `30` total
+
+### P4: Slice 3 Planning Setup
+
+1. define Slice 3 boundaries explicitly
+2. connect recovery decisions to real engine recoverability inputs
+3. make trusted-base / rebuild-source decision use real engine data inputs
+4. prepare Slice 3 validation targets
+
+Status:
+
+- accepted
+- Slice 3 data / recoverability core is implemented and validated
+- corrected tester summary accepted:
+  - `12` ownership tests
+  - `18` recovery tests
+  - `18` recoverability tests
+  - `48` total
+- important boundary preserved:
+  - engine proves historical-correctness prerequisites
+  - full historical reconstruction proof remains simulator-side
+
+## Slice 3 Guardrails
+
+Slice 3 is the point where V2 must move from:
+
+- recovery automaton is coherent
+
+to:
+
+- recovery basis is provable
+
+So Slice 3 must stay tight.
+
+### Guardrail 1: No optimistic watermark in place of recoverability proof
+
+Do not accept:
+
+- loose head/tail watermarks
+- "looks retained enough"
+- heuristic recoverability
+
+Slice 3 should prove:
+
+1. why a gap is recoverable
+2. why a gap is unrecoverable
+
+### Guardrail 2: No current extent state pretending to be historical correctness
+
+Do not accept:
+
+- current extent image as substitute for target-LSN truth
+- checkpoint/base state that leaks newer state into older historical queries
+
+Slice 3 should prove historical correctness at the actual recovery target.
+
+### Guardrail 3: No `snapshot + tail` without trusted-base proof
+
+Do not accept:
+
+- "snapshot exists" as sufficient
+
+Require:
+
+1. trusted base exists
+2. trusted base covers the required base state
+3. retained tail can be replayed continuously from that base to the target
+
+If not, recovery must use:
+
+- `FullBase`
+
+### Guardrail 4: Truncation is protocol boundary, not cleanup policy
+
+Do not treat truncation as:
+
+- optional cleanup
+- post-recovery tidying
+
+Treat truncation as:
+
+1. divergent tail removal
+2. explicit safe-boundary restoration
+3. prerequisite for safe `InSync` / recovery completion where applicable
+
+### P5: Slice 4 Planning Setup
+
+1. define Slice 4 boundaries explicitly
+2. connect engine control/recovery core to real assignment/control intent entry path
+3. add engine observability / debug surface for ownership and recovery failures
+4. prepare integration validation against V2-boundary failure classes
+
+Status:
+
+- accepted
+- Slice 4 integration closure is implemented and validated
+- corrected tester summary accepted:
+  - `12` ownership tests
+  - `18` recovery tests
+  - `18` recoverability tests
+  - `11` integration tests
+  - `59` total
+
+## Slice 4 Guardrails
+
+Slice 4 should close integration, not just add an entry point and some logs.
+
+### Guardrail 1: Entry path must actually drive recovery
+
+Do not accept:
+
+- tests that manually push sender/session state while only pretending to use integration entry points
+
+Require:
+
+1. real assignment/control intent entry path
+2. session creation / invalidation / restart triggered through that path
+3. recovery flow driven from that path, not only from unit-level helper calls
+
+### Guardrail 2: Changed-address must survive the real entry path
+
+Do not accept:
+
+- changed-address correctness proven only at local object level
+
+Require:
+
+1. stable `ReplicaID` survives real assignment/update entry path
+2. endpoint update invalidates old session correctly
+3. new recovery session is created correctly on updated endpoint
+
+### Guardrail 3: Observability must show protocol causality
+
+Do not accept:
+
+- only state snapshots
+- only phase dumps
+
+Require observability that can explain:
+
+1. why recovery entered `NeedsRebuild`
+2. why a session was superseded
+3. why a completion or progress update was rejected
+4. why endpoint / epoch change caused invalidation
+
+### Guardrail 4: Failure replay must be explainable
+
+Do not accept:
+
+- a replay that reproduces failure but cannot explain the cause from engine observability
+
+Require:
+
+1. selected failure-class replays through the real entry path
+2. observability sufficient to explain the control/recovery decision
+3. reviewability against key V2-boundary failures
+
+## Exit Criteria
+
+Phase 05 Slice 1 is done when:
+
+1. the real V2 engine module location is chosen
+2. Slice 1 boundaries are explicit
+3. engine ownership core exists under `sw-block/`
+4. engine-side ownership/fencing tests pass
+5. Slice 1 evidence is reviewable against prototype expectations
+
+This bar is now met.
+
+Phase 05 Slice 2 is done when:
+
+1. engine-side recovery execution flow exists
+2. zero-gap / catch-up / needs-rebuild branching is explicit
+3. stale execution is rejected during active recovery
+4. bounded catch-up semantics are enforced in engine path
+5. rebuild execution shell is validated
+
+This bar is now met.
+
+Phase 05 Slice 3 is done when:
+
+1. recoverable vs unrecoverable gap uses real engine recoverability inputs
+2. trusted-base / rebuild-source decision uses real engine data inputs
+3. truncation / safe-boundary handling is tied to real engine state
+4. history-driven engine APIs exist for recovery decisions
+5. Slice 3 validation is reviewable without overclaiming full historical reconstruction
+
+This bar is now met.
+
+Phase 05 Slice 4 is done when:
+
+1. real assignment/control intent entry path exists
+2. changed-address recovery works through the real entry path
+3. observability explains protocol causality, not only state snapshots
+4. selected V2-boundary failures are replayable and diagnosable through engine integration tests
+
+This bar is now met.
+
+## Assignment For `sw`
+
+Phase 05 is now complete.
+
+Next phase:
+
+- `Phase 06` broader engine implementation stage
+
+## Assignment For `tester`
+
+Phase 05 validation is complete.
+
+Next phase:
+
+- `Phase 06` engine implementation validation against real-engine constraints and failure classes
+
+## Management Rule
+
+`Phase 05` should stay narrow.
+
+It should start the engine line with:
+
+1. ownership
+2. fencing
+3. validation
+
+It should not try to absorb later slices early.
--- a/sw-block/.private/phase/phase-06-decisions.md
+++ b/sw-block/.private/phase/phase-06-decisions.md
@ -0,0 +1,68 @@
+# Phase 06 Decisions
+
+## Decision 1: Phase 06 is broader engine implementation, not new design
+
+The protocol shape and engine core contracts were already accepted.
+
+Phase 06 implemented around them.
+
+## Decision 2: Phase 06 must connect to real constraints
+
+This phase explicitly used:
+
+1. `learn/projects/sw-block/` for failure gates and test lineage
+2. `weed/storage/block*` for real implementation constraints
+
+without importing V1 structure as the V2 design template.
+
+## Decision 3: Phase 06 should replace key synchronous conveniences
+
+The accepted Slice 4 convenience flows were sufficient for closure work, but broader engine work required real step boundaries.
+
+This is now satisfied via planner/executor separation.
+
+## Decision 4: Phase 06 ends with a runnable engine stage decision
+
+Result:
+
+- yes, the project now has a broader runnable engine stage that is ready to proceed to real-system integration / product-path work
+
+## Decision 5: Phase 06 P0 is accepted
+
+Accepted scope:
+
+1. adapter/module boundaries
+2. convenience-flow classification
+3. initial real-engine stage framing
+
+## Decision 6: Phase 06 P1 is accepted
+
+Accepted scope:
+
+1. storage/control adapter interfaces
+2. `RecoveryDriver` planner/resource-acquisition layer
+3. full-base and WAL retention resource contracts
+4. fail-closed preconditions on planning paths
+
+## Decision 7: Phase 06 P2 is accepted
+
+Accepted scope:
+
+1. explicit planner/executor split on top of `RecoveryPlan`
+2. executor-owned cleanup symmetry on success/failure/cancellation
+3. plan-bound rebuild execution with no policy re-derivation at execute time
+4. synchronous orchestrator completion helpers remain test-only convenience
+
+## Decision 8: Phase 06 P3 is accepted
+
+Accepted scope:
+
+1. selected real failure classes validated through the engine path
+2. cross-layer engine/storage proof validation
+3. diagnosable failure when proof or resource acquisition cannot be established
+
+## Decision 9: Phase 06 is complete
+
+Next step:
+
+- `Phase 07` real-system integration / product-path decision
--- a/sw-block/.private/phase/phase-06-log.md
+++ b/sw-block/.private/phase/phase-06-log.md
@ -0,0 +1,51 @@
+# Phase 06 Log
+
+## 2026-03-30
+
+### Opened
+
+`Phase 06` opened as:
+
+- broader engine implementation stage
+
+### Starting basis
+
+1. `Phase 05`: complete
+2. engine core and integration closure accepted
+3. next work moves from slice proof to broader runnable engine stage
+
+### Accepted
+
+1. Phase 06 P0
+   - adapter/module boundaries defined
+   - convenience flows explicitly classified
+
+2. Phase 06 P1
+   - storage/control adapter surfaces defined
+   - `RecoveryDriver` added as planner/resource-acquisition layer
+   - full-base rebuild now has explicit resource contract
+   - WAL pin contract tied to actual recovery need
+   - driver preconditions fail closed
+
+3. Phase 06 P2
+   - explicit planner/executor split accepted
+   - executor owns release symmetry on success, failure, and cancellation
+   - rebuild execution now consumes plan-bound source/target values
+   - tester final validation accepted with reduced-but-sufficient rebuild failure-path coverage
+
+4. Phase 06 P3
+   - selected real failure classes validated through the engine path
+   - changed-address restart now uses plan cancellation and re-plan flow
+   - stale execution is caught through the executor-managed loop
+   - cross-layer trusted-base / replayable-tail proof path validated end-to-end
+   - rebuild planning failures now clean up sessions and remain diagnosable
+
+### Closed
+
+`Phase 06` closed as complete.
+
+### Next
+
+1. Phase 07 real-system integration / product-path decision
+2. service-slice integration against real control/storage surroundings
+3. first product-path gating decision
--- a/sw-block/.private/phase/phase-06.md
+++ b/sw-block/.private/phase/phase-06.md
@ -0,0 +1,193 @@
+# Phase 06
+
+Date: 2026-03-30
+Status: complete
+Purpose: move from validated engine slices to the first broader runnable V2 engine stage
+
+## Why This Phase Exists
+
+`Phase 05` established and validated:
+
+1. ownership core
+2. recovery execution core
+3. recoverability/data gating core
+4. integration closure
+
+What still does not exist is a broader engine stage that can run with:
+
+1. real control-plane inputs
+2. real persistence/backing inputs
+3. non-trivial execution loops instead of only synchronous convenience paths
+
+So `Phase 06` exists to turn the accepted engine shape into the first broader runnable engine stage.
+
+Phase 06 must connect the accepted engine core to real control and real storage truth, not just wrap current abstractions with adapters.
+
+## Phase Goal
+
+Build the first broader V2 engine stage without reopening protocol shape.
+
+This phase should focus on:
+
+1. real engine adapters around the accepted core
+2. asynchronous or stepwise execution paths where Slice 4 used synchronous helpers
+3. real retained-history / checkpoint input plumbing
+4. validation against selected real failure classes and real implementation constraints
+
+## Overall Roadmap
+
+Completed:
+
+1. Phase 01-03: design + simulator
+2. Phase 04: prototype closure
+3. Phase 4.5: evidence hardening
+4. Phase 05: engine slice closure
+5. Phase 06: broader engine implementation stage
+
+Next:
+
+1. Phase 07: real-system integration / product-path decision
+
+This roadmap should stay strict:
+
+- no return to broad prototype expansion
+- no uncontrolled engine sprawl
+
+## Scope
+
+### In scope
+
+1. control-plane adapter into `sw-block/engine/replication/`
+2. retained-history / checkpoint adapter into engine recoverability APIs
+3. replacement of synchronous convenience flows with explicit engine steps where needed
+4. engine error taxonomy and observability tightening
+5. validation against selected real failure classes from:
+   - `learn/projects/sw-block/`
+   - `weed/storage/block*`
+
+### Out of scope
+
+1. Smart WAL expansion
+2. full backend redesign
+3. performance optimization as primary goal
+4. V1 replacement rollout
+5. full product integration
+
+## Phase 06 Items
+
+### P0: Engine Stage Plan
+
+Status:
+
+- accepted
+- module boundaries now explicit:
+  - `adapter.go`
+  - `driver.go`
+  - `orchestrator.go` classification
+- convenience flows are now classified as:
+  - test-only convenience wrapper
+  - stepwise engine task
+  - planner/executor split
+
+### P1: Control / History Adapters
+
+Status:
+
+- accepted
+- `StorageAdapter` boundary exists and is exercised by tests
+- full-base rebuild now has a real pin/release contract
+- WAL pinning is tied to actual recovery contract, not loose watermark use
+- planner fails closed on missing sender / missing session / wrong session kind
+
+### P2: Execution Driver
+
+Status:
+
+- accepted
+- executor now owns resource lifecycle on success / failure / cancellation
+- catch-up execution is stepwise and budget-checked per progress step
+- rebuild execution consumes plan-bound source/target values and does not re-derive policy at execute time
+- `CompleteCatchUp` / `CompleteRebuild` remain test-only convenience wrappers
+- tester validation accepted with reduced-but-sufficient rebuild failure-path coverage
+
+### P3: Validation Against Real Failure Classes
+
+Status:
+
+- accepted
+- changed-address restart now validated through planner/executor path with plan cancellation
+- stale epoch/session during active execution now validated through the executor-managed loop
+- cross-layer trusted-base / replayable-tail proof path validated end-to-end
+- rebuild fallback and pin-failure cleanup now fail closed and are diagnosable
+
+## Guardrails
+
+### Guardrail 1: Do not reopen protocol shape
+
+Phase 06 implemented around accepted engine slices and did not reopen:
+
+1. sender/session authority model
+2. bounded catch-up contract
+3. recoverability/truncation boundary
+
+### Guardrail 2: Do not let adapters smuggle V1 structure back in
+
+V1 code and docs remain:
+
+1. constraints
+2. failure gates
+3. integration references
+
+not the V2 architecture template.
+
+### Guardrail 3: Prefer explicit engine steps over synchronous convenience
+
+Key convenience helpers remain test-only. Real engine work now has explicit planner/executor boundaries.
+
+### Guardrail 4: Keep evidence quality high
+
+Phase 06 improved:
+
+1. cross-layer traceability
+2. diagnosability
+3. real-failure validation
+
+without growing protocol surface.
+
+### Guardrail 5: Do not fake storage truth with metadata-only adapters
+
+Phase 06 now requires:
+
+1. trusted base to come from storage-side truth
+2. replayable tail to be grounded in retention state
+3. observable rejection when those proofs cannot be established
+
+## Exit Criteria
+
+Phase 06 is done when:
+
+1. engine has real control/history adapters into the accepted core
+2. engine has real storage/base adapters into the accepted core
+3. key synchronous convenience paths are explicitly classified or replaced by real engine steps where necessary
+4. selected real failure classes are validated against the engine stage
+5. at least one cross-layer storage/engine proof path is validated end-to-end
+6. engine observability remains good enough to explain recovery causality
+
+Status:
+
+- met
+
+## Closeout
+
+`Phase 06` is complete.
+
+It established:
+
+1. a broader runnable engine stage around the accepted Phase 05 core
+2. real planner/executor/resource contracts
+3. validated failure-class behavior through the engine path
+4. diagnosable proof rejection and cleanup behavior
+
+Next step:
+
+- `Phase 07` real-system integration / product-path decision
--- a/sw-block/.private/phase/phase-07-decisions.md
+++ b/sw-block/.private/phase/phase-07-decisions.md
@ -0,0 +1,119 @@
+# Phase 07 Decisions
+
+## Decision 1: Phase 07 is real-system integration, not protocol redesign
+
+The V2 protocol shape, engine core, and broader runnable engine stage are already accepted.
+
+Phase 07 should integrate them into a real-system service slice.
+
+## Decision 2: Phase 07 should make the first product-path decision
+
+This phase should not only integrate a service slice.
+
+It should also decide:
+
+1. what the first product path is
+2. what remains before pre-production hardening
+
+## Decision 3: Phase 07 must preserve accepted V2 boundaries
+
+Phase 07 should preserve:
+
+1. narrow catch-up semantics
+2. rebuild as the formal recovery path
+3. trusted-base / replayable-tail proof boundaries
+4. stable identity / fenced execution / diagnosable failure handling
+
+## Decision 4: Phase 07 P0 service-slice direction is set
+
+Current direction:
+
+1. first service slice = `RF=2` block volume primary + one replica
+2. engine remains in `sw-block/engine/replication/`
+3. current bridge work starts in `sw-block/bridge/blockvol/`
+4. deferred real blockvol-side bridge target = `weed/storage/blockvol/v2bridge/`
+5. stable identity mapping is explicit:
+   - `ReplicaID = <volume-name>/<server-id>`
+6. `blockvol` executes I/O but does not own recovery policy
+
+## Decision 5: Phase 07 P1 is accepted with explicit scope limits
+
+Accepted `P1` coverage is:
+
+1. real reader mapping from `BlockVol` state
+2. real retention hold / release wiring into the flusher retention floor
+3. one real WAL catch-up scan path through `v2bridge`
+4. direct real-adapter tests under `weed/storage/blockvol/v2bridge/`
+
+This acceptance means:
+
+1. the real bridge path is now integrated and evidenced
+2. `P1` is not yet acceptance proof of general post-checkpoint catch-up viability
+
+Not accepted as part of `P1`:
+
+1. snapshot transfer execution
+2. full-base transfer execution
+3. WAL truncation execution
+4. master-side confirmed failover / control-intent integration
+
+## Decision 6: Interim committed-truth limitation remains active
+
+`Phase 07 P1` is accepted with an explicit carry-forward limitation:
+
+1. interim `CommittedLSN = CheckpointLSN` is a service-slice mapping, not final V2 protocol truth
+2. post-checkpoint catch-up semantics are therefore narrower than final V2 intent
+3. later `Phase 07` work must not overclaim this limitation as solved until commit truth is separated from checkpoint truth
+
+## Decision 7: Phase 07 P2 is accepted with scoped replay claims
+
+Accepted `P2` coverage is:
+
+1. real service-path replay for changed-address restart
+2. stale epoch / stale session invalidation through the integrated path
+3. unrecoverable-gap / needs-rebuild replay with diagnosable proof
+4. explicit replay of the post-checkpoint boundary under the interim model
+
+Not accepted as part of `P2`:
+
+1. general integrated engine-driven post-checkpoint catch-up semantics
+2. real control-plane delivery from master heartbeat into the bridge
+3. rebuild execution beyond the already-deferred executor stubs
+
+## Decision 8: Phase 07 now moves to product-path choice, not more bridge-shape proof
+
+With `P0`, `P1`, and `P2` accepted, the next step is:
+
+1. choose the first product path from accepted service-slice evidence
+2. define what remains before pre-production hardening
+3. keep unresolved limits explicit rather than hiding them behind broader claims
+
+## Decision 7: Phase 07 P2 must replay the interim limitation explicitly
+
+`Phase 07 P2` should not only replay happy-path or ordinary failure-path integration.
+
+It should also include one explicit replay where:
+
+1. the live bridge path is exercised after checkpoint truth has advanced
+2. the observed catch-up limitation is diagnosed as a consequence of the interim mapping
+3. the result is not overclaimed as proof of final V2 post-checkpoint catch-up semantics
+
+## Decision 10: Phase 07 P3 is accepted and Phase 07 is complete
+
+The first V2 product path is now explicitly chosen as:
+
+1. `RF=2`
+2. `sync_all`
+3. existing master / volume-server heartbeat path
+4. V2 engine owns recovery policy
+5. `v2bridge` provides real storage truth
+
+This decision is accepted with explicit non-claims:
+
+1. not production-ready
+2. no real master-side control delivery proof yet
+3. no full rebuild execution proof yet
+4. no general post-checkpoint catch-up proof yet
+5. no full integrated engine -> executor -> `v2bridge` catch-up proof yet
+
+Phase 07 is therefore complete, and the next phase is pre-production hardening.
--- a/sw-block/.private/phase/phase-07-log.md
+++ b/sw-block/.private/phase/phase-07-log.md
@ -0,0 +1,63 @@
+# Phase 07 Log
+
+## 2026-03-30
+
+### Opened
+
+`Phase 07` opened as:
+
+- real-system integration / product-path decision
+
+### Starting basis
+
+1. `Phase 06`: complete
+2. broader runnable engine stage accepted
+3. next work moves from engine-stage validation to real-system service-slice integration
+
+### Delivered
+
+1. Phase 07 P0
+   - service-slice plan defined
+   - implementation slice proposal delivered
+   - bridge layer introduced as:
+     - `sw-block/bridge/blockvol/` for current bridge work
+     - `weed/storage/blockvol/v2bridge/` as the deferred real integration target
+   - stable identity mapping made explicit:
+     - `ReplicaID = <volume-name>/<server-id>`
+   - engine / blockvol policy boundary made explicit
+   - initial bridge tests delivered (`8`)
+2. Phase 07 P1
+   - real blockvol reader integrated via `weed/storage/blockvol/v2bridge/reader.go`
+   - real pinner integrated via `weed/storage/blockvol/v2bridge/pinner.go`
+   - one real catch-up executor path integrated via `weed/storage/blockvol/v2bridge/executor.go`
+   - direct real-adapter tests delivered in:
+     - `weed/storage/blockvol/v2bridge/bridge_test.go`
+   - accepted with explicit carry-forward:
+     - interim `CommittedLSN = CheckpointLSN` limits post-checkpoint catch-up semantics and is not final V2 commit truth
+     - acceptance is for the real integrated bridge path, not for general post-checkpoint catch-up viability
+3. Phase 07 P2
+   - real service-path failure replay accepted
+   - accepted replay set includes:
+     - changed-address restart
+     - stale epoch / stale session invalidation
+     - unrecoverable-gap / needs-rebuild replay
+     - explicit post-checkpoint boundary replay
+   - evidence kept explicitly scoped:
+     - real `v2bridge` WAL-scan execution proven
+     - general integrated post-checkpoint catch-up semantics not overclaimed under the interim model
+4. Phase 07 P3
+   - product-path decision accepted
+   - first product path chosen as:
+     - `RF=2`
+     - `sync_all`
+     - existing master / volume-server heartbeat path
+     - V2 engine recovery ownership with `v2bridge` real storage truth
+   - pre-hardening prerequisites made explicit
+   - intentional deferrals and non-claims recorded
+   - `Phase 07` completed
+
+### Next
+
+1. Phase 08 pre-production hardening
+2. real master/control delivery integration
+3. integrated catch-up / rebuild execution closure
--- a/sw-block/.private/phase/phase-07.md
+++ b/sw-block/.private/phase/phase-07.md
@ -0,0 +1,220 @@
+# Phase 07
+
+Date: 2026-03-30
+Status: complete
+Purpose: connect the broader runnable V2 engine stage to a real-system service slice and decide the first product path
+
+## Why This Phase Exists
+
+`Phase 06` completed the broader runnable engine stage:
+
+1. planner/executor/resource contracts are real
+2. selected real failure classes are validated through the engine path
+3. cross-layer trusted-base / replayable-tail proof path is validated
+
+What still does not exist is a real-system slice where the engine runs inside actual service boundaries with real control/storage surroundings.
+
+So `Phase 07` exists to answer:
+
+1. how the engine runs as a real subsystem
+2. what the first product path should be
+3. what integration risks remain before pre-production hardening
+
+## Phase Goal
+
+Establish a real-system integration slice for the V2 engine and make the first product-path decision without reopening protocol shape.
+
+## Scope
+
+### In scope
+
+1. service-slice integration around `sw-block/engine/replication/`
+2. real control-plane / lifecycle entry path into the engine
+3. real storage-side adapter hookup into existing system boundaries
+4. selected real-system failure replay and diagnosis
+5. explicit product-path decision framing
+
+### Out of scope
+
+1. broad performance optimization
+2. Smart WAL expansion
+3. full V1 replacement rollout
+4. broad backend redesign
+5. production rollout itself
+
+## Phase 07 Items
+
+### P0: Service-Slice Plan
+
+1. define the first real-system service slice that will host the engine
+2. define adapter/module boundaries at the service boundary
+3. choose the concrete integration path to exercise first
+4. identify which current adapters are still mock/test-only and must be replaced first
+5. make the first-slice identity/epoch mapping explicit
+6. treat `blockvol` as execution backend only, not recovery-policy owner
+
+Status:
+
+- delivered
+- planning artifact:
+  - `sw-block/design/phase-07-service-slice-plan.md`
+- implementation slice proposal:
+  - engine core: `sw-block/engine/replication/`
+  - bridge adapters: `sw-block/bridge/blockvol/`
+  - real blockvol integration target: `weed/storage/blockvol/v2bridge/` (`P1`)
+- adapter replacement order:
+  - `control_adapter.go` (`P0`) done
+  - `storage_adapter.go` (`P0`) done
+  - `executor_bridge.go` (`P1`) deferred
+  - `observe_adapter.go` (`P1`) deferred
+- first-slice identity mapping is explicit:
+  - `ReplicaID = <volume-name>/<server-id>`
+  - not derived from any address field
+- engine / blockvol boundary is explicit:
+  - bridge maps intent and state
+  - `blockvol` executes I/O
+  - `blockvol` does not own recovery policy
+- service-slice validation gaps called out for `P1`:
+  - real blockvol field mapping
+  - real pin/release lifecycle against reclaim/GC
+  - assignment timing vs engine session lifecycle
+  - executor bridge into real WAL/snapshot work
+
+### P1: Real Entry-Path Integration
+
+1. connect real control/lifecycle events into the engine entry path
+2. connect real storage/base/recoverability signals into the engine adapters
+3. preserve accepted engine authority/execution/recoverability contracts
+
+Status:
+
+- accepted
+- real integration now established for:
+  - reader via `weed/storage/blockvol/v2bridge/reader.go`
+  - pinner via `weed/storage/blockvol/v2bridge/pinner.go`
+  - catch-up executor path via `weed/storage/blockvol/v2bridge/executor.go`
+- direct real-adapter tests now exist in:
+  - `weed/storage/blockvol/v2bridge/bridge_test.go`
+- accepted scope is explicit:
+  - real reader
+  - real retention hold / release
+  - real WAL catch-up scan path
+  - direct real bridge evidence for the integrated path
+- still deferred:
+  - `TransferSnapshot`
+  - `TransferFullBase`
+  - `TruncateWAL`
+  - control intent from confirmed failover / master-side integration
+- carry-forward limitation:
+  - under interim `CommittedLSN = CheckpointLSN`, this slice proves a real bridge path, not general post-checkpoint catch-up viability
+  - post-checkpoint catch-up semantics therefore remain narrower than final V2 intent and do not represent final V2 commit semantics
+
+### P2: Real-System Failure Replay
+
+1. replay selected real failure classes against the integrated service slice
+2. confirm diagnosability from logs/status
+3. identify any remaining mismatch between engine-stage assumptions and real system behavior
+
+Status:
+
+- accepted
+- real service-path replay now accepted for:
+  - changed-address restart
+  - stale epoch / stale session invalidation
+  - unrecoverable-gap / needs-rebuild replay
+  - explicit post-checkpoint boundary replay under the interim model
+- accepted with scoped limitation:
+  - real `v2bridge` WAL-scan execution is proven
+  - full integrated engine-driven catch-up semantics are not overclaimed under interim `CommittedLSN = CheckpointLSN`
+- control-plane delivery remains simulated via direct `AssignmentIntent` construction
+- carry-forward remains explicit:
+  - post-checkpoint catch-up semantics are still narrower than final V2 intent
+
+### P3: Product-Path Decision
+
+1. choose the first product path for V2
+2. define what remains before pre-production hardening
+3. record what is still intentionally deferred
+
+Status:
+
+- accepted
+- first product path chosen:
+  - `RF=2`
+  - `sync_all`
+  - existing master / volume-server heartbeat path
+  - V2 engine owns recovery policy
+  - `v2bridge` provides real storage truth
+- proposal is evidence-grounded and explicitly bounded by accepted `P0/P1/P2` evidence
+- pre-hardening prerequisites are explicit:
+  - real master control delivery
+  - full integrated engine -> executor -> `v2bridge` catch-up chain
+  - separation of committed truth from checkpoint truth
+  - rebuild execution (`snapshot` / `full-base` / `truncation`)
+  - pinner / flusher behavior under concurrent load
+- intentionally deferred:
+  - `RF>2`
+  - Smart WAL optimizations
+  - `best_effort` background recovery
+  - performance tuning
+  - full V1 replacement
+- non-claims remain explicit:
+  - not production-ready
+  - no end-to-end rebuild proof yet
+  - no general post-checkpoint catch-up proof
+  - no real master heartbeat/control delivery proof yet
+  - no full integrated engine -> executor -> `v2bridge` catch-up proof yet
+
+## Guardrails
+
+### Guardrail 1: Do not re-import V1 structure as the design owner
+
+Use `weed/storage/block*` and `learn/projects/sw-block/` as constraints and validation sources, not as the architecture template.
+
+### Guardrail 2: Keep catch-up narrow and rebuild explicit
+
+Do not use integration work as an excuse to widen catch-up semantics or blur rebuild as the formal recovery path.
+
+### Guardrail 3: Prefer real entry paths over test-only wrappers
+
+The integrated slice should exercise real service boundaries, not only internal engine helpers.
+
+### Guardrail 4: Observability must explain causality
+
+Integrated logs/status must explain:
+
+1. why rebuild was required
+2. why proof was rejected
+3. why execution was cancelled or invalidated
+4. why a product-path integration failed
+
+### Guardrail 5: Stable identity must not collapse back to address shape
+
+For the first slice, `ReplicaID` must be derived from master/block-registry identity, not current endpoint addresses.
+
+### Guardrail 6: `blockvol` executes I/O but does not own recovery policy
+
+The service bridge may translate engine decisions into concrete blockvol actions, but it must not re-decide:
+
+1. zero-gap / catch-up / rebuild
+2. trusted-base validity
+3. replayable-tail sufficiency
+4. rebuild fallback requirement
+
+## Exit Criteria
+
+Phase 07 is done when:
+
+1. one real-system service slice is integrated with the engine
+2. selected real-system failure classes are replayed through that slice
+3. diagnosability is sufficient for service-slice debugging
+4. the first product path is explicitly chosen
+5. the remaining work to pre-production hardening is clear
+
+## Assignment For `sw`
+
+Next tasks move to `Phase 08`.
+
+## Assignment For `tester`
+
+Next tasks move to `Phase 08`.
--- a/sw-block/.private/phase/phase-08-decisions.md
+++ b/sw-block/.private/phase/phase-08-decisions.md
@ -0,0 +1,78 @@
+# Phase 08 Decisions
+
+## Decision 1: Phase 08 is pre-production hardening, not protocol rediscovery
+
+The accepted V2 product path from `Phase 07` is the basis.
+
+`Phase 08` should harden that path rather than reopen accepted protocol shape.
+
+## Decision 2: The first hardening priorities are control delivery and execution closure
+
+The most important remaining gaps are:
+
+1. real master/control delivery into the bridge/engine path
+2. integrated engine -> executor -> `v2bridge` catch-up execution closure
+3. first rebuild execution path for the chosen product path
+
+## Decision 3: Carry-forward limitations remain explicit until closed
+
+Phase 08 must keep explicit:
+
+1. committed truth is still not separated from checkpoint truth
+2. rebuild execution is still incomplete
+3. current control delivery is still simulated
+
+## Decision 4: Phase 08 P0 is accepted
+
+The hardening plan is sufficiently specified to begin implementation work.
+
+In particular, `P0` now fixes:
+
+1. the committed-truth gate decision requirement
+2. the unified replay requirement after control and execution closure
+3. the need for at least one real failover / reassignment validation target
+## Decision 5: The committed-truth limitation must become a hardening gate
+
+Phase 08 must explicitly decide one of:
+
+1. `CommittedLSN != CheckpointLSN` separation is mandatory before a production-candidate phase
+2. the first candidate path is intentionally bounded to the currently proven pre-checkpoint replay behavior
+
+It must not remain only a documented carry-forward.
+
+## Decision 6: Unified-path replay is required after control and execution closure
+
+Once real control delivery and integrated execution closure land, `Phase 08` must replay the accepted failure-class set again on the unified live path.
+
+This prevents independent closure of:
+
+1. control delivery
+2. execution closure
+
+without proving that they behave correctly together.
+
+## Decision 7: Real failover / reassignment validation is mandatory for the chosen path
+
+Because the chosen product path depends on the existing master / volume-server heartbeat path, at least one real failover / promotion / reassignment cycle must be a named hardening target in `Phase 08`.
+
+## Decision 8: Phase 08 should reuse the existing Seaweed control/runtime path, not invent a new one
+
+For the first hardening path, implementation should preferentially reuse:
+
+1. existing master / heartbeat / assignment delivery
+2. existing volume-server assignment receive/apply path
+3. existing `blockvol` runtime and `v2bridge` storage/runtime hooks
+
+This reuse is about:
+
+1. control-plane reality
+2. storage/runtime reality
+3. execution-path reality
+
+It is not permission to inherit old policy semantics as V2 truth.
+
+The hard rule remains:
+
+1. engine owns recovery policy
+2. bridge translates confirmed control/storage truth
+3. `blockvol` executes I/O
--- a/sw-block/.private/phase/phase-08-log.md
+++ b/sw-block/.private/phase/phase-08-log.md
@ -0,0 +1,21 @@
+# Phase 08 Log
+
+## 2026-03-31
+
+### Opened
+
+`Phase 08` opened as:
+
+- pre-production hardening
+
+### Starting basis
+
+1. `Phase 07`: complete
+2. first V2 product path chosen
+3. remaining gaps are integration and hardening gaps, not protocol-discovery gaps
+
+### Next
+
+1. Phase 08 P0 accepted
+2. Phase 08 P1 real master/control delivery integration
+3. Phase 08 P2 integrated execution closure
--- a/sw-block/.private/phase/phase-08.md
+++ b/sw-block/.private/phase/phase-08.md
@ -0,0 +1,254 @@
+# Phase 08
+
+Date: 2026-03-31
+Status: active
+Purpose: convert the accepted Phase 07 product path into a pre-production-hardening program without reopening accepted V2 protocol shape
+
+## Why This Phase Exists
+
+`Phase 07` completed:
+
+1. a real service-slice integration around the V2 engine
+2. real storage-truth bridge evidence through `v2bridge`
+3. selected real-system failure replay
+4. the first explicit product-path decision
+
+What still does not exist is a pre-production-ready system path. The remaining work is no longer protocol discovery. It is closing the operational and integration gaps between the accepted product path and a hardened deployment candidate.
+
+## Phase Goal
+
+Harden the first accepted V2 product path until the remaining gap to a production candidate is explicit, bounded, and implementation-driven.
+
+## Scope
+
+### In scope
+
+1. real master/control delivery into the engine service path
+2. integrated engine -> executor -> `v2bridge` execution closure
+3. rebuild execution closure for the accepted product path
+4. operational/debuggability hardening
+5. concurrency/load validation around retention and recovery
+
+### Out of scope
+
+1. new protocol redesign
+2. `RF>2` coordination
+3. Smart WAL optimization work
+4. broad performance tuning beyond validation needed for hardening
+5. full V1 replacement rollout
+
+## Phase 08 Items
+
+### P0: Hardening Plan
+
+1. convert the accepted `Phase 07` product path into a hardening plan
+2. define the minimum pre-production gates
+3. order the remaining integration closures by risk
+4. make an explicit gate decision on committed truth vs checkpoint truth:
+   - either separate `CommittedLSN` from `CheckpointLSN` before a production-candidate phase
+   - or explicitly bound the first candidate path to the currently proven pre-checkpoint replay behavior
+
+Status:
+
+- planning package accepted in this phase doc
+- first hardening priorities are fixed as:
+  - real master/control delivery
+  - integrated engine -> executor -> `v2bridge` catch-up execution chain
+  - first rebuild execution path
+- the committed-truth carry-forward is now a required hardening gate, not just a note:
+  - either separate `CommittedLSN` from `CheckpointLSN` before a production-candidate phase
+  - or explicitly bound the first candidate path to the currently proven pre-checkpoint replay behavior
+- at least one real failover / promotion / reassignment cycle is a required hardening target
+- once `P1` and `P2` land, the accepted failure-class set must be replayed again on the newly unified live path
+- the validation oracle for `Phase 08` is expected to reject overclaiming around:
+  - catch-up semantics
+  - rebuild execution
+  - master/control delivery
+  - candidate-path readiness vs production readiness
+- accepted
+
+### P1: Real Control Delivery
+
+1. connect real master/heartbeat assignment delivery into the bridge
+2. replace direct `AssignmentIntent` construction for the first live path
+3. preserve stable identity and fenced authority through the real control path
+4. include at least one real failover / promotion / reassignment validation target on the chosen `sync_all` path
+
+Technical focus:
+
+- keep the control-path split explicit:
+  - master confirms assignment / epoch / role
+  - bridge translates confirmed control truth into engine intent
+  - engine owns sender/session/recovery policy
+  - `blockvol` does not re-decide recovery policy
+- preserve the identity rule through the live path:
+  - `ReplicaID = <volume>/<server>`
+  - endpoint change updates location but must not recreate logical identity
+- preserve the fencing rule through the live path:
+  - stale epoch must invalidate old authority
+  - stale session must not mutate current lineage
+  - address change must invalidate the old live session before the new path proceeds
+- treat failover / promotion / reassignment as control-truth events first, not storage-side heuristics
+
+Implementation route (`reuse map`):
+
+- reuse directly as the first hardening carrier:
+  - `weed/server/master_grpc_server.go`
+  - `weed/server/volume_grpc_client_to_master.go`
+  - `weed/server/volume_server_block.go`
+  - `weed/server/master_block_registry.go`
+  - `weed/server/master_block_failover.go`
+- reuse as storage/runtime execution reality:
+  - `weed/storage/blockvol/blockvol.go`
+  - `weed/storage/blockvol/replica_apply.go`
+  - `weed/storage/blockvol/replica_barrier.go`
+  - `weed/storage/blockvol/v2bridge/`
+- preserve the V2 boundary while reusing these files:
+  - reuse transport/control/runtime reality
+  - do not inherit old policy semantics as V2 truth
+  - keep engine as the recovery-policy owner
+  - keep `blockvol` as the I/O executor
+
+Expectation note:
+
+- the `P1` tester expectation is already embedded in this phase doc under:
+  - `P1 / Validation focus`
+  - `P1 / Reject if`
+- do not grow a separate long template unless `P1` scope expands materially
+
+Validation focus:
+
+- prove live assignment delivery into the bridge/engine path
+- prove stable `ReplicaID` across address refresh on the live path
+- prove stale epoch / stale session invalidation through the live path
+- prove at least one real failover / promotion / reassignment cycle on the chosen `sync_all` path
+- prove the resulting logs explain:
+  - why reassignment happened
+  - why a session was invalidated
+  - which epoch / identity / endpoint drove the transition
+
+Reject if:
+
+- address-shaped identity reappears anywhere in the control path
+- bridge starts re-deriving catch-up vs rebuild policy from convenience inputs
+- old epoch or old session can still mutate after the new control truth arrives
+- failover / reassignment is claimed without a real replay target
+- delivery claims general production readiness rather than control-path closure
+
+### P2: Execution Closure
+
+1. close the live engine -> executor -> `v2bridge` execution chain
+2. make catch-up execution evidence integrated rather than split across layers
+3. close the first rebuild execution path required by the product path
+
+### P3: Hardening Validation
+
+1. validate diagnosability under the live integrated path
+2. validate retention/pinner behavior under concurrent load
+3. replay the accepted failure-class set again on the newly unified live path after `P1` and `P2` land
+4. confirm the remaining gap to a production candidate
+
+Validation focus:
+
+- prove the chosen path through a real control-delivery path
+- prove the live engine -> executor -> `v2bridge` execution chain as one path, not split evidence
+- prove the first rebuild execution path required by the chosen product path
+- prove at least one real failover / promotion / reassignment cycle
+- prove concurrent retention/pinner behavior does not break recovery guarantees
+
+Reject if:
+
+- catch-up semantics are overclaimed beyond the currently proven boundary
+- rebuild is claimed as supported without real execution closure
+- master/control delivery is claimed as real without the live path in place
+- `CommittedLSN` vs `CheckpointLSN` remains an unclassified note instead of a gate decision
+- `P1` and `P2` land independently but the accepted failure-class set is not replayed again on the unified live path
+
+## Guardrails
+
+### Guardrail 1: Do not reopen accepted V2 protocol truths casually
+
+`Phase 08` is a hardening phase. New work should preserve the accepted protocol truth set unless a real contradiction is demonstrated.
+
+### Guardrail 2: Keep product-path claims evidence-bound
+
+Do not claim more than the hardened path actually proves. Distinguish:
+
+1. live integrated path
+2. hardened product path
+3. production candidate
+
+### Guardrail 3: Identity and policy boundaries remain hard rules
+
+1. `ReplicaID` must remain stable and never collapse to address shape
+2. engine decides recovery policy
+3. bridge translates intent/state
+4. `blockvol` executes I/O only
+
+### Guardrail 4: Carry-forward limitations must remain explicit until closed
+
+Especially:
+
+1. committed truth vs checkpoint truth
+2. rebuild execution coverage
+3. real master/control delivery coverage
+
+### Guardrail 5: The committed-truth carry-forward must become a gate, not a note
+
+Before the next phase, `Phase 08` must decide one of:
+
+1. committed-truth separation is mandatory before a production-candidate phase
+2. the first candidate path is intentionally bounded to the currently proven pre-checkpoint replay behavior
+
+It must not remain an unclassified carry-forward.
+
+## Exit Criteria
+
+Phase 08 is done when:
+
+1. the first product path runs through a real control delivery path
+2. the critical execution chain is integrated and validated
+3. rebuild execution for the chosen path is no longer just detected but executed
+4. at least one real failover / reassignment cycle is replayed through the live control path
+5. the accepted failure-class set is replayed again on the unified live path
+6. operational/debug evidence is sufficient for pre-production use
+7. the remaining gap to a production candidate is small and explicit
+
+## Assignment For `sw`
+
+Next tasks:
+
+1. drive `Phase 08 P1` as real master/control delivery integration
+2. replace direct `AssignmentIntent` construction for the first live path
+3. preserve through the real control path:
+   - stable `ReplicaID`
+   - epoch fencing
+   - address-change invalidation
+4. include at least one real failover / promotion / reassignment validation target
+5. keep acceptance claims scoped:
+   - real control delivery path
+   - not yet general production readiness
+6. keep explicit carry-forwards:
+   - `CommittedLSN != CheckpointLSN` still unresolved
+   - integrated catch-up execution chain still incomplete
+   - rebuild execution still incomplete
+
+## Assignment For `tester`
+
+Next tasks:
+
+1. use the accepted `Phase 08` plan framing as the `P1` validation oracle
+2. validate real control delivery for:
+   - live assignment delivery
+   - stable identity through the control path
+   - stale epoch/session invalidation
+   - at least one real failover / reassignment cycle
+3. keep the no-overclaim rule active around:
+   - catch-up semantics
+   - rebuild execution
+   - master/control delivery
+4. keep the committed-truth gate explicit:
+   - still unresolved in `P1`
+5. prepare `P2` follow-up expectations for:
+   - integrated engine -> executor -> `v2bridge` execution closure
+   - unified replay after `P1` and `P2`
--- a/sw-block/.private/phase/phase-4.5-decisions.md
+++ b/sw-block/.private/phase/phase-4.5-decisions.md
@ -0,0 +1,59 @@
+# Phase 4.5 Decisions
+
+## Decision 1: Phase 4.5 remains a bounded hardening phase
+
+It is not a new architecture line and must not expand into broad feature work.
+
+Purpose:
+
+1. tighten recovery boundaries
+2. strengthen crash-consistency / recoverability proof
+3. clear the path for engine planning
+
+## Decision 2: `sw` Phase 4.5 P0 is accepted
+
+Accepted basis:
+
+1. bounded `CatchUp` now changes prototype behavior
+2. `FrozenTargetLSN` is intrinsic to the session contract
+3. `Rebuild` is a first-class sender-owned execution path
+4. rebuild and catch-up are execution-path exclusive
+
+## Decision 3: `tester` crash-consistency simulator strengthening is accepted
+
+Accepted basis:
+
+1. checkpoint semantics are explicit
+2. recoverability after restart is no longer collapsed into a single loose watermark
+3. crash-consistency invariants are executable and passing
+
+## Decision 4: Remaining Phase 4.5 work is evidence hardening, not primitive-building
+
+Completed focus:
+
+1. `A5-A8` prototype + simulator double evidence
+2. predicate exploration for dangerous states
+3. adversarial search over crash-consistency / liveness states
+
+Remaining optional work:
+
+4. any low-priority cleanup that improves clarity without reopening design
+
+## Decision 5: After Phase 4.5, the project should move to engine-planning readiness review
+
+Unless new blocking flaws appear, the next major decision after `4.5` should be:
+
+1. real V2 engine planning
+2. engine slicing plan
+
+not another broad prototype phase
+
+## Decision 6: Phase 4.5 is complete
+
+Reason:
+
+1. bounded `CatchUp` is semantic in the prototype
+2. `Rebuild` is first-class in the prototype
+3. crash-consistency / restart-recoverability are materially stronger in the simulator
+4. `A5-A8` evidence is materially stronger on both prototype and simulator sides
+5. adversarial search found and helped fix a real correctness bug, validating the proof style
--- a/sw-block/.private/phase/phase-4.5-log.md
+++ b/sw-block/.private/phase/phase-4.5-log.md
@ -0,0 +1,33 @@
+# Phase 4.5 Log
+
+## 2026-03-29
+
+### Accepted
+
+1. `sw` `Phase 4.5 P0`
+   - bounded `CatchUp` budget is semantic in `enginev2`
+   - `FrozenTargetLSN` is a real session invariant
+   - `Rebuild` is wired into sender execution and is exclusive from catch-up
+   - rebuild completion goes through `CompleteRebuild`, not generic session completion
+
+2. `tester` crash-consistency simulator strengthening
+   - storage-state split introduced and accepted
+   - checkpoint/restart boundary made explicit
+   - recoverability upgraded from watermark-style logic to checkpoint + contiguous WAL replayability proof
+   - core invariant tests for crash consistency now pass
+
+3. `tester` evidence hardening and adversarial exploration
+   - grouped simulator evidence for `A5-A8`
+   - danger predicates added
+   - adversarial search added and passing
+   - adversarial search found a real `StateAt(lsn)` historical-state bug
+   - `StateAt(lsn)` corrected so newer checkpoint/base state does not leak into older historical queries
+
+4. `Phase 4.5` closeout judgment
+   - prototype and simulator evidence are now strong enough to stop expanding `4.5`
+   - next major step should move to engine-readiness review and engine slicing
+
+### Remaining open work
+
+1. low-priority cleanup
+   - remove or consolidate redundant frozen-target bookkeeping if no longer needed
--- a/sw-block/.private/phase/phase-4.5-reason.md
+++ b/sw-block/.private/phase/phase-4.5-reason.md
@ -0,0 +1,397 @@
+# Phase 4.5 Reason
+
+Date: 2026-03-27
+Status: proposal for dev manager decision
+Purpose: explain why a narrow V2 fine-tuning step should follow the main Phase 04 slice, without reopening the core ownership/fencing direction
+
+## 1. Why This Note Exists
+
+`Phase 04` has already produced strong progress on the first standalone V2 slice:
+
+- per-replica sender identity
+- one active recovery session per replica per epoch
+- endpoint / epoch invalidation
+- sender-owned execution APIs
+- explicit recovery outcome branching
+- minimal historical-data prototype
+
+This is good progress and should continue.
+
+However, recent review and discussion show that the next risk is no longer:
+
+- ownership ambiguity
+- stale completion acceptance
+- scattered local recovery authority
+
+The next risk is different:
+
+- `CatchUp` may become too broad, too long-lived, and too resource-heavy
+- simulator proof is still weaker than desired on crash-consistency and recoverability boundaries
+- the project may accidentally carry V1.5-style "keep trying to catch up" assumptions into V2 engine work
+
+So this note proposes:
+
+- **do not interrupt the main Phase 04 work**
+- **do not reopen core V2 ownership/fencing architecture**
+- **add a narrow fine-tuning step immediately after Phase 04 main closure**
+
+This note is for the dev manager to decide implementation sequencing.
+
+## 2. Current Basis
+
+This proposal is grounded in the following current documents:
+
+- `sw-block/.private/phase/phase-04.md`
+- `sw-block/design/v2-prototype-roadmap-and-gates.md`
+- `sw-block/design/v2-acceptance-criteria.md`
+- `sw-block/design/v2-detailed-algorithm.zh.md`
+
+In particular:
+
+- `phase-04.md` shows that Phase 04 is correctly centered on sender/session ownership and recovery execution authority
+- `v2-prototype-roadmap-and-gates.md` shows that design proof is high, but data/recovery proof and prototype end-to-end proof are still low
+- `v2-acceptance-criteria.md` already requires stronger proof for:
+  - `A5` non-convergent catch-up escalation
+  - `A6` explicit recoverability boundary
+  - `A7` historical correctness
+  - `A8` durability-mode correctness
+- `v2-detailed-algorithm.zh.md` Section 17 now argues for a direction tightening:
+  - keep the V2 core
+  - narrow `CatchUp`
+  - elevate `Rebuild`
+  - defer higher-complexity expansion
+
+## 3. Main Judgment
+
+### 3.1 What should NOT change
+
+The following V2 core should remain stable:
+
+- `CommittedLSN` as the external safe boundary
+- durable progress as sync truth
+- one sender per replica
+- one active recovery session per replica per epoch
+- stale epoch / stale endpoint / stale session fencing
+- explicit `ZeroGap / CatchUp / NeedsRebuild`
+
+This is the architecture that most clearly separates V2 from V1.5.
+
+### 3.2 What SHOULD be fine-tuned
+
+The following should be tightened before engine planning:
+
+1. `CatchUp` should be narrowed to a short-gap, bounded, budgeted path
+2. `Rebuild` should be treated as a formal primary recovery path, not only a fallback embarrassment
+3. `recover -> keepup` handoff should be made more explicit
+4. simulator should prove recoverability and crash-consistency more directly
+
+## 4. Algorithm Thinking Behind The Fine-Tune
+
+This section summarizes the reasoning already captured in:
+
+- `sw-block/design/v2-detailed-algorithm.zh.md`
+
+Especially Section 17:
+
+- `V2` is still the right direction
+- but V2 should be tightened from:
+  - "make WAL recovery increasingly smart"
+  - to:
+  - "make block truth boundaries hard, keep `CatchUp` cheap and bounded, and use formal `Rebuild` when recovery becomes too complex"
+
+### 4.1 First-principles view
+
+From block first principles, the hardest truths are:
+
+1. when `write` becomes real
+2. what `flush/fsync ACK` truly promises
+3. whether acknowledged boundaries survive failover
+4. how replicas rejoin without corrupting lineage
+
+These are more fundamental than:
+
+- volume product shape
+- control-plane surface
+- recovery cleverness for its own sake
+
+So the project should optimize for:
+
+- clearer truth boundaries
+- not for maximal catch-up cleverness
+
+### 4.2 Mayastor-style product insight
+
+The useful first-principles lesson from Mayastor-like product thinking is:
+
+- not every lagging replica is worth indefinite low-cost chase
+- `Rebuild` can be a formal product path, not a shameful fallback
+- block products benefit from explicit lifecycle objects and formal rebuild flow
+
+This does NOT replace the V2 core concerns:
+
+- `flush ACK` truth
+- committed-prefix failover safety
+- stale authority fencing
+
+But it does suggest a correction:
+
+- do not let `CatchUp` become an over-smart general answer to all recovery
+
+### 4.3 Proposed V2 fine-tuned interpretation
+
+The fine-tuned interpretation of V2 should be:
+
+- `CatchUp` is for short-gap, clearly recoverable, bounded recovery
+- `Rebuild` is for long-gap, high-cost, unstable, or non-convergent recovery
+- recovery session is a bounded contract, not a long-running rescue thread
+- `> H0` live WAL must not silently turn one recovery session into an endless chase
+
+## 5. Specific Fine-Tune Adjustments
+
+### 5.1 Narrow `CatchUp`
+
+`CatchUp` should explicitly require:
+
+- short outage
+- bounded target `H0`
+- clear recoverability
+- bounded reservation
+- bounded time
+- bounded resource cost
+- bounded convergence expectation
+
+`CatchUp` should explicitly stop when:
+
+- target drifts too long without convergence
+- replay progress stalls
+- recoverability proof is lost
+- retention cost becomes unreasonable
+- session budget expires
+
+### 5.2 Elevate `Rebuild`
+
+`Rebuild` should be treated as a first-class path when:
+
+- lag is too large
+- catch-up does not converge
+- recoverability is no longer stable
+- complexity of continued catch-up exceeds its product value
+
+The intended model becomes:
+
+- short gap -> `CatchUp`
+- long gap / unstable / non-convergent -> `Rebuild`
+
+This should be interpreted more strictly than a simple routing rule:
+
+- `CatchUp` is not a general recovery framework
+- `CatchUp` is a relaxed form of `KeepUp`
+- it should stay limited to short-gap, bounded, clearly recoverable WAL replay
+- it only makes sense while the replica's current base is still trustworthy enough to continue from
+
+By contrast:
+
+- `Rebuild` is the more general recovery framework
+- it restores the replica from a trusted base toward a frozen target boundary
+- `full rebuild` and `partial rebuild` are not different protocols; they are different base/transfer choices under the same rebuild contract
+
+So the intended product shape is:
+
+- use `CatchUp` when replay debt is small and clearly cheaper than rebuild
+- use `Rebuild` when correctness, boundedness, or product simplicity would otherwise be compromised
+
+And the correctness anchor for both `full` and `partial` rebuild should remain explicit:
+
+- freeze `TargetLSN`
+- pin the snapshot/base used for recovery
+- only then optimize transfer volume using `snapshot + tail`, `bitmap`, or similar mechanisms
+
+### 5.3 Clarify `recover -> keepup` handoff
+
+Phase 04 already aims to prove a clean handoff between normal sender and recovery session.
+
+The fine-tune should make the next step more explicit:
+
+- one recovery session only owns `(R, H0]`
+- session completion releases recovery debt
+- replica should not silently stay in "quasi-recovery"
+- re-entry to `KeepUp` / `InSync` should remain explicit, ideally with `PromotionHold` or equivalent stabilization logic
+
+### 5.4 Keep Smart WAL deferred
+
+No fine-tune should broaden Smart WAL scope at this point.
+
+Reason:
+
+- Smart WAL multiplies recoverability, GC, payload-availability, and reservation complexity
+- the current priority is to harden the simpler V2 replication contract first
+
+So the rule remains:
+
+- no Smart WAL expansion beyond what minimal proof work might later require
+
+## 6. Simulation Strengthening Requirements
+
+This is the highest-value part of the fine-tune.
+
+Current simulator strength is already good on:
+
+- epoch fencing
+- stale traffic rejection
+- promotion candidate rules
+- ownership / session invalidation
+- basic `CatchUp / NeedsRebuild` classification
+
+Current simulator weakness is still significant on:
+
+- crash-consistency around extent / checkpoint / replay boundaries
+- `ACK` boundary versus recoverable boundary
+- `CatchUp` liveness / convergence
+
+### 6.1 Required new modeling direction
+
+The simulator should stop collapsing these states together:
+
+- received but not durable
+- WAL durable but not yet fully materialized
+- extent-visible but not yet checkpoint-safe
+- checkpoint-safe base image
+- restart-recoverable read state
+
+Suggested explicit storage-state split:
+
+- `ReceivedLSN`
+- `WALDurableLSN`
+- `ExtentAppliedLSN`
+- `CheckpointLSN`
+- `RecoverableLSNAfterRestart`
+
+### 6.2 Required new invariants
+
+The simulator should explicitly check at least:
+
+1. `AckedFlushLSN <= RecoverableLSNAfterRestart`
+2. visible state must have recoverable backing
+3. `CatchUp` cannot remain non-convergent indefinitely
+4. promotion candidate must still possess recoverable committed prefix
+
+### 6.3 Required new scenario classes
+
+Priority scenarios to add:
+
+1. `ExtentAheadOfCheckpoint_CrashRestart_ReadBoundary`
+2. `AckedFlush_MustBeRecoverableAfterCrash`
+3. `UnackedVisibleExtent_MustNotSurviveAsCommittedTruth`
+4. `CatchUpChasingMovingHead_EscalatesOrConverges`
+5. `CheckpointGCBreaksRecoveryProof`
+
+### 6.4 Required simulator style upgrade
+
+The simulator should move beyond only hand-authored examples and also support:
+
+- dangerous-state predicates
+- adversarial random exploration guided by those predicates
+
+Examples:
+
+- `acked_flush_lost`
+- `extent_exposes_unrecoverable_state`
+- `catchup_livelock`
+- `rebuild_required_but_not_escalated`
+
+## 7. Relationship To Acceptance Criteria
+
+This fine-tune is not a separate architecture line.
+
+It is mainly intended to make the project satisfy the existing acceptance set more convincingly:
+
+- `A5` explicit escalation from non-convergent catch-up
+- `A6` recoverability boundary as a real rule, not hopeful policy
+- `A7` historical correctness against snapshot + tail rebuild
+- `A8` strict durability mode semantics
+
+So this fine-tune is a strengthening of the current V2 proof path, not a new branch.
+
+## 8. Recommended Sequencing
+
+### Option A: pause Phase 04 and reopen design now
+
+Not recommended.
+
+Why:
+
+- Phase 04 has strong momentum
+- its core ownership/fencing work is correct
+- pausing it now would blur scope and waste recent closure
+
+### Option B: finish Phase 04, then add a narrow `4.5`
+
+Recommended.
+
+Why:
+
+- Phase 04 can finish its intended ownership / orchestration / minimal-history closure
+- `4.5` can then tighten recovery strategy without destabilizing the slice
+- the project avoids carrying "too-smart catch-up" assumptions into later engine planning
+
+Recommended sequence:
+
+1. finish Phase 04 main closure
+2. immediately start `Phase 4.5`
+3. use `4.5` to tighten:
+   - bounded `CatchUp`
+   - formal `Rebuild`
+   - crash-consistency and recoverability simulator proof
+4. then re-evaluate Gate 4 / Gate 5
+
+## 9. Scope Of A Possible Phase 4.5
+
+If the dev manager chooses to implement a `4.5` step, its scope should be:
+
+### In scope
+
+- tighten algorithm wording and boundaries from `v2-detailed-algorithm.zh.md`
+- formalize bounded `CatchUp`
+- formalize `Rebuild` as first-class path
+- strengthen simulator state model and invariants
+- add targeted crash-consistency and liveness scenarios
+- improve prototype traceability against `A5-A8`
+
+### Out of scope
+
+- Smart WAL expansion
+- real storage engine redesign
+- V1 production integration
+- frontend/wire protocol
+- performance optimization as primary goal
+
+## 10. Decision Requested From Dev Manager
+
+Please decide:
+
+1. whether `Phase 04` should continue to normal closure without interruption
+2. whether a narrow `Phase 4.5` should immediately follow
+3. whether the simulator strengthening work should be treated as mandatory for Gate 4 / Gate 5 credibility
+
+Recommended decision:
+
+- **Yes**: finish `Phase 04`
+- **Yes**: add `Phase 4.5` as a bounded fine-tuning step
+- **Yes**: treat crash-consistency / recoverability / liveness simulator strengthening as required, not optional
+
+## 11. Bottom Line
+
+The project does not need a new direction.
+
+It needs:
+
+- a slightly tighter interpretation of V2
+- a stronger recoverability/crash-consistency simulator
+- a clearer willingness to use formal `Rebuild` instead of over-extending `CatchUp`
+
+So the practical recommendation is:
+
+- **keep the V2 core**
+- **finish Phase 04**
+- **add a narrow Phase 4.5**
+- **strengthen simulator proof before engine planning**
--- a/sw-block/.private/phase/phase-4.5.md
+++ b/sw-block/.private/phase/phase-4.5.md
@ -0,0 +1,356 @@
+# Phase 4.5
+
+Date: 2026-03-29
+Status: complete
+Purpose: harden Gate 4 / Gate 5 credibility after Phase 04 by tightening bounded `CatchUp`, elevating `Rebuild` as a first-class path, and strengthening crash-consistency / recoverability proof
+
+## Related Plan
+
+Strategic phase:
+
+- `sw-block/.private/phase/phase-4.5.md`
+
+Simulator implementation plan:
+
+- `learn/projects/sw-block/design/phase-05-crash-consistency-simulation.md`
+
+Use them together:
+
+- `Phase 4.5` defines the gate-hardening purpose and priorities
+- `phase-05-crash-consistency-simulation.md` is the detailed simulator implementation plan
+
+## Why This Phase Exists
+
+Phase 04 has already established:
+
+1. per-replica sender identity
+2. one active recovery session per replica per epoch
+3. stale authority fencing
+4. sender-owned execution APIs
+5. assignment-intent orchestration
+6. minimal historical-data prototype
+7. prototype scenario closure
+
+The next risk is no longer ownership structure.
+
+The next risk is:
+
+1. `CatchUp` becoming too broad, too long-lived, or too optimistic
+2. `Rebuild` remaining underspecified even though it will likely become a common path
+3. simulator proof still being weaker than desired on crash-consistency and restart-recoverability
+
+So `Phase 4.5` exists to harden the decision gate before real engine planning.
+
+## Relationship To Phase 04
+
+`Phase 4.5` is not a new architecture line.
+
+It is a narrow hardening step after normal Phase 04 closure.
+
+It should:
+
+- keep the V2 core
+- not reopen sender/session ownership architecture
+- strengthen recovery boundaries and proof quality
+
+## Main Questions
+
+1. how narrow should `CatchUp` be?
+2. when must recovery escalate to `Rebuild`?
+3. what exactly is the `Rebuild` source of truth?
+4. what does restart-recoverable / crash-consistent state mean in the simulator?
+
+## Core Decisions To Drive
+
+### 1. Bounded CatchUp
+
+`CatchUp` should be explicitly bounded by:
+
+1. target range
+2. retention proof
+3. time budget
+4. progress budget
+5. resource budget
+
+It should stop and escalate when:
+
+1. target drifts too long
+2. progress stalls
+3. recoverability proof is lost
+4. retention cost becomes unreasonable
+5. session budget expires
+
+### 2. Rebuild Is First-Class
+
+`Rebuild` is not an embarrassment path.
+
+It is the formal path for:
+
+1. long gap
+2. unstable recoverability
+3. non-convergent catch-up
+4. excessive replay cost
+5. restart-recoverability uncertainty
+
+### 3. Rebuild Source Model
+
+To address the concern that tightening `CatchUp` makes `Rebuild` too dominant:
+
+`Rebuild` should be split conceptually into two modes:
+
+1. **Snapshot + Tail**
+   - preferred path
+   - use a dated but internally consistent base snapshot/checkpoint
+   - then apply retained WAL tail up to the committed recovery boundary
+
+2. **Full Base Rebuild**
+   - fallback path
+   - used when no acceptable snapshot/base image exists
+   - more expensive and slower
+
+Decision boundary:
+
+- use `Snapshot + Tail` when a trusted snapshot/checkpoint/base exists that covers the required base state
+- use `Full Base Rebuild` when no such trusted base exists
+
+So "rebuild" should not mean only:
+
+- copy everything from scratch
+
+It should usually mean:
+
+- re-establish a trustworthy base image
+- then catch up from that base to the committed boundary
+
+This keeps `Rebuild` practical even if `CatchUp` becomes narrower.
+
+### 4. Safe Recovery Truth
+
+The simulator should explicitly separate:
+
+1. `ReceivedLSN`
+2. `WALDurableLSN`
+3. `ExtentAppliedLSN`
+4. `CheckpointLSN`
+5. `RecoverableLSNAfterRestart`
+
+This is needed so that:
+
+- `ACK` truth
+- visible-state truth
+- crash-restart truth
+
+do not collapse into one number.
+
+## Priority
+
+### P0
+
+1. document bounded `CatchUp` rule
+2. document `Rebuild` modes:
+   - snapshot + tail
+   - full base rebuild
+3. define escalation conditions from `CatchUp` to `Rebuild`
+
+Status:
+
+- accepted on both prototype and simulator sides
+- prototype: bounded `CatchUp` is semantic, target-frozen, budget-enforced, and rebuild is a sender-owned exclusive path
+- simulator: crash-consistency state split, checkpoint-safe restart boundary, and core invariants are in place
+
+### P1
+
+4. strengthen simulator state model with crash-consistency split:
+   - `ReceivedLSN`
+   - `WALDurableLSN`
+   - `ExtentAppliedLSN`
+   - `CheckpointLSN`
+   - `RecoverableLSNAfterRestart`
+
+5. add explicit invariants:
+   - `AckedFlushLSN <= RecoverableLSNAfterRestart`
+   - visible state must have recoverable backing
+   - promotion candidate must possess recoverable committed prefix
+
+Status:
+
+- accepted on the simulator side
+- remaining work is no longer basic state split; it is stronger traceability and adversarial exploration
+
+### P2
+
+6. add targeted scenarios:
+   - `ExtentAheadOfCheckpoint_CrashRestart_ReadBoundary`
+   - `AckedFlush_MustBeRecoverableAfterCrash`
+   - `UnackedVisibleExtent_MustNotSurviveAsCommittedTruth`
+   - `CatchUpChasingMovingHead_EscalatesOrConverges`
+   - `CheckpointGCBreaksRecoveryProof`
+
+Status:
+
+- baseline targeted scenarios accepted
+- predicate-guided/adversarial exploration remains open
+
+### P3
+
+7. make prototype traceability stronger for:
+   - `A5`
+   - `A6`
+   - `A7`
+   - `A8`
+
+8. decide whether Gate 4 / Gate 5 are now credible enough for engine planning
+
+Status:
+
+- partially complete
+- Gate 4 / Gate 5 are materially stronger
+- remaining work is to make `A5-A8` double evidence more explicit and reviewable
+
+## Scope
+
+### In scope
+
+1. bounded `CatchUp`
+2. first-class `Rebuild`
+3. snapshot + tail rebuild model
+4. crash-consistency simulator state split
+5. targeted liveness / recoverability scenarios
+
+### Out of scope
+
+1. Smart WAL expansion
+2. V1 production integration
+3. backend/storage engine redesign
+4. performance optimization as primary goal
+5. frontend/wire protocol work
+
+## Exit Criteria
+
+`Phase 4.5` is done when:
+
+1. `CatchUp` budget / escalation rule is explicit in docs and simulator
+2. `Rebuild` is explicitly modeled as:
+   - snapshot + tail preferred
+   - full base rebuild fallback
+3. simulator has explicit crash-consistency state split
+4. simulator has targeted crash / liveness scenarios for the listed risks
+5. acceptance items `A5-A8` have stronger executable proof, ideally with explicit prototype + simulator evidence pairs
+6. we can make a more credible decision on:
+   - real V2 engine planning
+   - or `V2.5` correction
+
+## Review Gates
+
+These are explicit review gates for `Phase 4.5`.
+
+### Gate 1: Bounded CatchUp Must Be Semantic
+
+It is not enough to add budget fields in docs or structs.
+
+To count as complete:
+
+1. timeout / budget exceed must force exit
+2. moving-head chase must not continue indefinitely
+3. escalation to `NeedsRebuild` must be explicit
+4. tests must prove those behaviors
+
+### Gate 2: State Split Must Change Decisions
+
+It is not enough to add more state names.
+
+To count as complete, the new crash-consistency state split must materially change:
+
+1. `ACK` legality
+2. restart recoverability judgment
+3. visible-state legality
+4. promotion-candidate legality
+
+### Gate 3: A5-A8 Need Double Evidence
+
+It is not enough for only prototype or only simulator to cover them.
+
+To count as complete, each of:
+
+- `A5`
+- `A6`
+- `A7`
+- `A8`
+
+should have:
+
+1. one prototype-side evidence path
+2. one simulator-side evidence path
+
+## Scope Discipline
+
+`Phase 4.5` must remain a bounded gate-hardening phase.
+
+It should stay focused on:
+
+1. tightening boundaries
+2. strengthening proof
+3. clearing the path for engine planning
+
+It should not turn into a broad new feature-expansion phase.
+
+## Current Status Summary
+
+Accepted now:
+
+1. `sw` `Phase 4.5 P0`
+   - bounded `CatchUp` is semantic, not documentary
+   - `FrozenTargetLSN` is a real session invariant
+   - `Rebuild` is an exclusive sender-owned execution path
+2. `tester` crash-consistency simulator strengthening
+   - checkpoint/restart boundary is explicit
+   - recoverability is no longer a single collapsed watermark
+   - core crash-consistency invariants are executable
+
+Open now:
+
+1. low-priority cleanup such as redundant frozen-target bookkeeping fields
+
+Completed since initial approval:
+
+1. `A5-A8` explicit double-evidence traceability materially strengthened
+2. predicate exploration / adversarial search added on simulator side
+3. crash-consistency random/adversarial search found and helped fix a real `StateAt(lsn)` historical-state bug
+
+## Assignment For `sw`
+
+Focus: prototype/control-path formalization
+
+Completed work:
+
+1. updated prototype traceability for:
+   - `A5`
+   - `A6`
+   - `A7`
+   - `A8`
+2. made rebuild-source decision evidence explicit in prototype tests:
+   - snapshot + tail chosen only when trusted base exists
+   - full base chosen when it does not
+3. added focused prototype evidence grouping for engine-planning review
+
+Remaining optional cleanup:
+
+4. optionally clean low-priority redundancy:
+   - `TargetLSNAtStart` if superseded by `FrozenTargetLSN`
+
+## Assignment For `tester`
+
+Focus: simulator/crash-consistency proof
+
+Completed work:
+
+1. wired simulator-side evidence explicitly into acceptance traceability for:
+   - `A5`
+   - `A6`
+   - `A7`
+   - `A8`
+2. added predicate exploration / adversarial search around the new crash-consistency model
+3. added danger predicates for major failure classes:
+   - acked flush lost
+   - visible unrecoverable state
+   - catch-up livelock / rebuild-required-but-not-escalated
--- a/sw-block/design/README.md
+++ b/sw-block/design/README.md
@ -1,6 +1,9 @@
 # V2 Design

 Current WAL V2 design set:
+- `v2-algorithm-overview.md`
+- `v2-algorithm-overview.zh.md`
+- `v2-detailed-algorithm.zh.md`
 - `wal-replication-v2.md`
 - `wal-replication-v2-state-machine.md`
 - `wal-replication-v2-orchestrator.md`
@ -15,12 +18,25 @@ Current WAL V2 design set:
 - `v2-open-questions.md`
 - `v2-first-slice-session-ownership.md`
 - `v2-prototype-roadmap-and-gates.md`
+- `v2-engine-readiness-review.md`
+- `v2-engine-slicing-plan.md`
+- `v2-protocol-truths.md`
+- `v2-production-roadmap.md`
+- `phase-07-service-slice-plan.md`
+- `agent_dev_process.md`

 These documents are the working design home for the V2 line.

 The original project-level copies under `learn/projects/sw-block/design/` remain as shared references for now.

 Execution note:
- active development tracking for the current simulator phase lives under:
+- active development tracking lives under `../.private/phase/`
+- key completed/current phase docs include:
  - `../.private/phase/phase-01.md`
  - `../.private/phase/phase-02.md`
+  - `../.private/phase/phase-03.md`
+  - `../.private/phase/phase-04.md`
+  - `../.private/phase/phase-4.5.md`
+  - `../.private/phase/phase-05.md`
+  - `../.private/phase/phase-06.md`
+  - `../.private/phase/phase-07.md`
--- a/sw-block/design/a5-a8-traceability.md
+++ b/sw-block/design/a5-a8-traceability.md
@ -0,0 +1,117 @@
+# A5-A8 Acceptance Traceability
+
+Date: 2026-03-29
+Status: Phase 4.5 evidence-hardening
+
+## Purpose
+
+Map each acceptance criterion to specific executable evidence.
+Two evidence layers:
+- **Simulator** (distsim): protocol-level proof
+- **Prototype** (enginev2): ownership/session-level proof
+
+---
+
+## A5: Non-Convergent Catch-Up Escalates Explicitly
+
+**Must prove**: tail-chasing or failed catch-up does not pretend success.
+
+**Pass condition**: explicit `CatchingUp → NeedsRebuild` transition.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| Tail-chasing converges or aborts | `TestS6_TailChasing_ConvergesOrAborts` | `cluster_test.go` | distsim | PASS |
+| Tail-chasing non-convergent → NeedsRebuild | `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS |
+| Catch-up timeout → NeedsRebuild | `TestP03_CatchupTimeout_EscalatesToNeedsRebuild` | `phase03_timeout_test.go` | distsim | PASS |
+| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS |
+| Flapping budget exceeded → NeedsRebuild | `TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS |
+| Catch-up converges or escalates (I3) | `TestI3_CatchUpConvergesOrEscalates` | `phase045_crash_test.go` | distsim | PASS |
+| Catch-up timeout in enginev2 | `TestE2E_NeedsRebuild_Escalation` | `p2_test.go` | enginev2 | PASS |
+
+**Verdict**: A5 is well-covered. Both simulator and prototype prove explicit escalation. No pretend-success path exists.
+
+---
+
+## A6: Recoverability Boundary Is Explicit
+
+**Must prove**: recoverable vs unrecoverable gap is decided explicitly.
+
+**Pass condition**: recovery aborts when reservation/payload availability is lost; rebuild is explicit fallback.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS |
+| WAL GC beyond replica → NeedsRebuild | `TestI5_CheckpointGC_PreservesAckedBoundary` | `phase045_crash_test.go` | distsim | PASS |
+| Rebuild from snapshot + tail | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS |
+| Smart WAL: resolvable → unresolvable | `TestP02_SmartWAL_RecoverableThenUnrecoverable` | `phase02_advanced_test.go` | distsim | PASS |
+| Time-varying payload availability | `TestP02_SmartWAL_TimeVaryingAvailability` | `phase02_advanced_test.go` | distsim | PASS |
+| RecoverableLSN is replayability proof | `RecoverableLSN()` in `storage.go` | `storage.go` | distsim | Implemented |
+| Handshake outcome: NeedsRebuild | `TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession` | `execution_test.go` | enginev2 | PASS |
+
+**Verdict**: A6 is covered. Recovery boundary is decided by explicit reservation + recoverability check, not by optimistic assumption. `RecoverableLSN()` verifies contiguous WAL coverage.
+
+---
+
+## A7: Historical Data Correctness Holds
+
+**Must prove**: recovered data for target LSN is historically correct; current extent cannot fake old history.
+
+**Pass condition**: snapshot + tail rebuild matches reference; current-extent reconstruction of old LSN fails correctness.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| Snapshot + tail matches reference | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS |
+| Historical state not reconstructable after GC | `TestA7_HistoricalState_NotReconstructableAfterGC` | `phase045_crash_test.go` | distsim | PASS |
+| `CanReconstructAt()` rejects faked history | `CanReconstructAt()` in `storage.go` | `storage.go` | distsim | Implemented |
+| Checkpoint does not leak applied state | `TestI2_CheckpointDoesNotLeakAppliedState` | `phase045_crash_test.go` | distsim | PASS |
+| Extent-referenced resolvable records | `TestExtentReferencedResolvableRecordsAreRecoverable` | `cluster_test.go` | distsim | PASS |
+| Extent-referenced unresolvable → rebuild | `TestExtentReferencedUnresolvableForcesRebuild` | `cluster_test.go` | distsim | PASS |
+| ACK'd flush recoverable after crash (I1) | `TestI1_AckedFlush_RecoverableAfterPrimaryCrash` | `phase045_crash_test.go` | distsim | PASS |
+
+**Verdict**: A7 is now covered with the Phase 4.5 crash-consistency additions. The critical gap ("current extent cannot fake old history") is proven by `CanReconstructAt()` + `TestA7_HistoricalState_NotReconstructableAfterGC`.
+
+---
+
+## A8: Durability Mode Semantics Are Correct
+
+**Must prove**: best_effort, sync_all, sync_quorum behave as intended under mixed replica states.
+
+**Pass condition**: sync_all strict, sync_quorum commits only with true durable quorum, invalid topology rejected.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| sync_quorum continues with one lagging | `TestSyncQuorumContinuesWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS |
+| sync_all blocks with one lagging | `TestSyncAllBlocksWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS |
+| sync_quorum mixed states | `TestSyncQuorumWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS |
+| sync_all mixed states | `TestSyncAllBlocksWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS |
+| Barrier timeout: sync_all blocked | `TestP03_BarrierTimeout_SyncAll_Blocked` | `phase03_timeout_test.go` | distsim | PASS |
+| Barrier timeout: sync_quorum commits | `TestP03_BarrierTimeout_SyncQuorum_StillCommits` | `phase03_timeout_test.go` | distsim | PASS |
+| Promotion uses RecoverableLSN | `EvaluateCandidateEligibility()` | `cluster.go` | distsim | Implemented |
+| Promoted replica has committed prefix (I4) | `TestI4_PromotedReplica_HasCommittedPrefix` | `phase045_crash_test.go` | distsim | PASS |
+
+**Verdict**: A8 is well-covered. sync_all is strict (blocks on lagging), sync_quorum uses true durable quorum (not connection count). Promotion now uses `RecoverableLSN()` for committed-prefix check.
+
+---
+
+## Summary
+
+| Criterion | Simulator Evidence | Prototype Evidence | Status |
+|-----------|-------------------|-------------------|--------|
+| A5 (catch-up escalation) | 6 tests | 1 test | **Strong** |
+| A6 (recoverability boundary) | 6 tests + RecoverableLSN() | 1 test | **Strong** |
+| A7 (historical correctness) | 7 tests + CanReconstructAt() | — | **Strong** (new in Phase 4.5) |
+| A8 (durability modes) | 7 tests + RecoverableLSN() | — | **Strong** |
+
+**Total executable evidence**: 26 simulator tests + 2 prototype tests + 2 new storage methods.
+
+All A5-A8 acceptance criteria have direct test evidence. No criterion depends solely on design-doc claims.
+
+---
+
+## Still Open (Not Blocking)
+
+| Item | Priority | Why not blocking |
+|------|----------|-----------------|
+| Predicate exploration / adversarial search | P2 | Manual scenarios already cover known failure classes |
+| Catch-up convergence under sustained load | P2 | I3 proves escalation; load-rate modeling is optimization |
+| A5-A8 in a single grouped runner view | P3 | Traceability doc serves as grouped evidence for now |
--- a/sw-block/design/agent_dev_process.md
+++ b/sw-block/design/agent_dev_process.md
@ -0,0 +1,304 @@
+# Agent Development Process
+
+Date: 2026-03-30
+Status: active
+Purpose: define the working split between `sw`, `tester`, and review/management roles so each phase and slice has a clear delivery path
+
+## Why This Exists
+
+The project is now beyond pure exploration.
+
+The expensive part is no longer only writing code.
+The expensive part is:
+
+1. delivery
+2. review
+3. fixes
+4. re-review
+
+So the process must reduce repeated full-stack review and make each role responsible for a distinct layer.
+
+## Roles
+
+### `manager`
+
+Primary role:
+
+- phase/plan owner
+
+Responsibilities:
+
+1. define the phase/slice direction and scope
+2. accept the planning package before coding starts
+3. decide whether a carry-forward is acceptable or must become a gate
+4. perform the final round review for overall logic, omissions, and product-path fit
+
+### `architect`
+
+Primary role:
+
+- plan and technical reviewer
+
+Responsibilities:
+
+1. review the plan before implementation starts
+2. tighten algorithm wording, scope edges, and expectation framing
+3. review technical correctness during implementation
+4. review API/state/resource/fail-closed behavior
+5. catch semantic drift, scope drift, and V1/V1.5 leakage
+
+### `sw`
+
+Primary role:
+
+- implementation owner
+
+Responsibilities:
+
+1. implement the accepted slice
+2. state changed contracts
+3. state fail-closed handling
+4. state resources acquired/released
+5. state carry-forward items
+6. add or update tests
+
+### `tester`
+
+Primary role:
+
+- evidence owner
+
+Responsibilities:
+
+1. define what the slice must prove before implementation starts
+2. maintain the failure-class checklist
+3. define reject conditions and required test level
+4. confirm that implementation claims are actually covered by evidence
+
+## Default Routine
+
+Each slice should follow this order:
+
+1. `manager` defines the plan direction
+2. `architect` reviews and tightens the plan / algorithm / expectation framing
+3. `tester` writes the expectation template
+4. `manager` accepts the package and records it in the phase docs
+5. `sw` implements and submits with the delivery template
+6. `architect` reviews the technical layer until clean enough
+7. `tester` performs validation and evidence closure
+8. `manager` performs round-two review for overall logic and omissions
+
+Urgent exception:
+
+- if early work already shows major scope drift, protocol contradiction, or V1/V1.5 leakage, architecture review may short-circuit before implementation grows further
+
+## Delivery Template For `sw`
+
+Each delivery should include:
+
+1. changed contracts
+2. fail-closed handling added
+3. resources acquired/released
+4. test inventory
+5. known carry-forward notes
+
+This template is required between:
+
+1. implementation
+2. implementation/fail-closed review
+
+It should accompany the delivery before reviewers start detailed review.
+
+Suggested format:
+
+```md
+Changed contracts:
+- ...
+
+Fail-closed handling:
+- ...
+
+Resources acquired/released:
+- ...
+
+Test inventory:
+- ...
+
+Carry-forward notes:
+- ...
+```
+
+## Phase Doc Usage
+
+Use the three phase documents differently:
+
+### `phase-xx.md`
+
+Use for:
+
+1. current execution direction
+2. current scope
+3. current guardrails
+4. current accepted status
+5. current assignments
+
+Keep it short and execution-oriented.
+
+### `phase-xx-log.md`
+
+Use for:
+
+1. detailed planning evolution
+2. review feedback
+3. carry-forward discussion
+4. open observations
+5. why wording or scope changed
+
+This document may be longer and more detailed.
+
+### `phase-xx-decisions.md`
+
+Use for:
+
+1. durable phase-level decisions
+2. accepted boundaries that later rounds should inherit
+3. gate decisions
+4. decisions that should not be re-argued without new evidence
+
+This document should stay compact and hold only the more important global decisions.
+
+## Expectation Template For `tester`
+
+Before or at slice start, `tester` should define:
+
+1. must-pass expectations
+2. failure-class checklist
+3. required test level for each behavior
+4. reject conditions
+
+`tester` should re-engage after technical review is mostly clean, to confirm final evidence closure before the manager's second-round review.
+
+Suggested format:
+
+```md
+Expectation:
+- ...
+
+Required level:
+- entry path / engine / unit
+
+Reject if:
+- ...
+
+Failure classes covered:
+- ...
+```
+
+## Review Checklist For `architect`
+
+Review these first:
+
+1. nil handling
+2. missing-resource handling
+3. wrong-state / wrong-kind rejection
+4. stale ID / stale authority rejection
+5. resource pin / release symmetry
+6. plan/execute/complete argument correctness
+7. fail-closed cleanup on partial failure
+
+## Failure-Class Checklist
+
+This checklist should be kept active across phases.
+
+Minimum recurring classes:
+
+1. changed-address restart
+2. stale epoch / stale session
+3. missing resource pin
+4. cleanup after failed plan
+5. replay range mis-derived
+6. false trusted-base selection
+7. truncation missing but completion attempted
+8. bounded catch-up not escalating
+
+## Process Rules
+
+### Rule 1: Do not wait until the end to define proof
+
+Each slice should begin with a statement of:
+
+1. what must be proven
+2. which failure classes must stay closed
+
+### Rule 2: Do not let convenience wrappers silently become model truth
+
+Any convenience flow must be explicitly classified as:
+
+1. test-only convenience
+2. stepwise engine task
+3. planner/executor split
+
+### Rule 3: Prefer evidence quality over object growth
+
+New work should preferentially improve:
+
+1. traceability
+2. diagnosability
+3. failure-class closure
+4. adapter contracts
+
+not just add:
+
+1. more structs
+2. more states
+3. more helper APIs
+
+### Rule 4: Use V1 as validation source, not architecture template
+
+Use:
+
+1. `learn/projects/sw-block/`
+2. `weed/storage/block*`
+
+for:
+
+1. constraints
+2. failure gates
+3. implementation reality
+
+Do not use them as the default V2 architecture template.
+
+### Rule 5: Reuse reality, not inherited semantics
+
+When later implementation reuses existing `Seaweed` / `V1` paths:
+
+1. reuse control-plane reality
+2. reuse storage/runtime reality
+3. reuse execution mechanisms
+
+but do not silently inherit:
+
+1. address-shaped identity
+2. old recovery classification semantics
+3. old committed-truth assumptions
+4. old failover authority assumptions
+
+Any such reuse should be reviewed explicitly as:
+
+1. safe reuse
+2. reuse with explicit boundary
+3. temporary carry-forward
+4. hard gate before later phases
+
+## Current Direction
+
+The project has moved from exploration-heavy work to evidence-first engine work.
+
+From `Phase 06` onward, the default is:
+
+1. plan first
+2. review plan before coding
+3. implement
+4. review technical layer
+5. close evidence
+6. do final manager review
--- a/sw-block/design/phase-07-service-slice-plan.md
+++ b/sw-block/design/phase-07-service-slice-plan.md
@ -0,0 +1,403 @@
+# Phase 07 Service-Slice Plan
+
+Date: 2026-03-30
+Status: draft
+Scope: `Phase 07 P0`
+
+## Purpose
+
+Define the first real-system service slice that will host the V2 engine, choose the first concrete integration path in the existing codebase, and map engine adapters onto real modules.
+
+This is a planning document. It does not claim the integration already works.
+
+## Decision
+
+The first service slice should be:
+
+- a single `blockvol` primary on a real volume server
+- with one replica target (`RF=2` path)
+- driven by the existing master heartbeat / assignment loop
+- using the V2 engine only for replication recovery ownership / planning / execution
+
+This is the narrowest real-system slice that still exercises:
+
+1. real assignment delivery
+2. real epoch and failover signals
+3. real volume-server lifecycle
+4. real WAL/checkpoint/base-image truth
+5. real changed-address / reconnect behavior
+
+It is narrow enough to avoid reopening the whole system, but real enough to stop hiding behind engine-local mocks.
+
+## Why This Slice
+
+This slice is the right first integration target because:
+
+1. `weed/server/master_grpc_server.go` already delivers block-volume assignments over heartbeat
+2. `weed/server/master_block_failover.go` already owns failover / promotion / pending rebuild decisions
+3. `weed/storage/blockvol/blockvol.go` already owns the current replication runtime (`shipperGroup`, receiver, WAL retention, checkpoint state)
+4. the existing V1/V1.5 failure history is concentrated in exactly this master <-> volume-server <-> blockvol path
+
+So this slice gives maximum validation value with minimum new surface.
+
+## First Concrete Integration Path
+
+The first integration path should be:
+
+1. master receives volume-server heartbeat
+2. master updates block registry and emits `BlockVolumeAssignment`
+3. volume server receives assignment
+4. block volume adapter converts assignment + local storage state into V2 engine inputs
+5. V2 engine drives sender/session/recovery state
+6. existing block-volume runtime executes the actual data-path work under engine decisions
+
+In code, that path starts here:
+
+- master side:
+  - `weed/server/master_grpc_server.go`
+  - `weed/server/master_block_failover.go`
+  - `weed/server/master_block_registry.go`
+- volume / storage side:
+  - `weed/storage/blockvol/blockvol.go`
+  - `weed/storage/blockvol/recovery.go`
+  - `weed/storage/blockvol/wal_shipper.go`
+  - assignment-handling code under `weed/storage/blockvol/`
+- V2 engine side:
+  - `sw-block/engine/replication/`
+
+## Service-Slice Boundaries
+
+### In-process placement
+
+The V2 engine should initially live:
+
+- in-process with the volume server / `blockvol` runtime
+- not in master
+- not as a separate service yet
+
+Reason:
+
+- the engine needs local access to storage truth and local recovery execution
+- master should remain control-plane authority, not recovery executor
+
+### Control-plane boundary
+
+Master remains authoritative for:
+
+1. epoch
+2. role / assignment
+3. promotion / failover decision
+4. replica membership
+
+The engine consumes these as control inputs. It does not replace master failover policy in `Phase 07`.
+
+### Control-Over-Heartbeat Upgrade Path
+
+For the first V2 product path, the recommended direction is:
+
+- reuse the existing master <-> volume-server heartbeat path as the control carrier
+- upgrade the block-specific control semantics carried on that path
+- do not immediately invent a separate control service or assignment channel
+
+Why:
+
+1. this is the real Seaweed path already carrying block assignments and confirmations today
+2. this gives the fastest route to a real integrated control path
+3. it preserves compatibility with existing Seaweed master/volume-server semantics while V2 hardens its own control truth
+
+Concretely, the current V1 path already provides:
+
+1. block assignments delivered in heartbeat responses from `weed/server/master_grpc_server.go`
+2. assignment application on the volume server in `weed/server/volume_grpc_client_to_master.go` and `weed/server/volume_server_block.go`
+3. assignment confirmation and address-change refresh driven by later heartbeats in `weed/server/master_grpc_server.go` and `weed/server/master_block_registry.go`
+4. immediate block heartbeat on selected shipper state changes in `weed/server/volume_grpc_client_to_master.go`
+
+What should be upgraded for V2 is not mainly the transport, but the control contract carried on it:
+
+1. stable `ReplicaID`
+2. explicit `Epoch`
+3. explicit role / assignment authority
+4. explicit apply/confirm semantics
+5. explicit stale assignment rejection
+6. explicit address-change refresh as endpoint change, not identity change
+
+Current cadence note:
+
+- the block volume heartbeat is periodic (`5 * sleepInterval`) with some immediate state-change heartbeats
+- this is acceptable as the first hardening carrier
+- it should not be assumed to be the final control responsiveness model
+
+Deferred design decision:
+
+- whether block control should eventually move beyond heartbeat-only carriage into a more explicit control/assignment channel should be decided only after the `Phase 08 P1` real control-delivery path exists and can be measured
+
+That later decision should be based on:
+
+1. failover / reassignment responsiveness
+2. assignment confirmation precision
+3. operational complexity
+4. whether heartbeat carriage remains too coarse for the block-control path
+
+Until then, the preferred direction is:
+
+- strengthen block control semantics over the existing heartbeat path
+- do not prematurely create a second control plane
+
+### Storage boundary
+
+`blockvol` remains authoritative for:
+
+1. WAL head / retention reality
+2. checkpoint/base-image reality
+3. actual catch-up streaming
+4. actual rebuild transfer / restore operations
+
+The engine consumes these as storage truth and recovery execution capabilities. It does not replace the storage backend in `Phase 07`.
+
+## First-Slice Identity Mapping
+
+This must be explicit in the first integration slice.
+
+For `RF=2` on the existing master / block registry path:
+
+- stable engine `ReplicaID` should be derived from:
+  - `<volume-name>/<replica-server-id>`
+- not from:
+  - `DataAddr`
+  - `CtrlAddr`
+  - heartbeat transport endpoint
+
+For this slice, the adapter should map:
+
+1. `ReplicaID`
+- from master/block-registry identity for the replica host entry
+
+2. `Endpoint`
+- from the current replica receiver/data/control addresses reported by the real runtime
+
+3. `Epoch`
+- from the confirmed master assignment for the volume
+
+4. `SessionKind`
+- from master-driven recovery intent / role transition outcome
+
+This is a hard first-slice requirement because address refresh must not collapse identity back into endpoint-shaped keys.
+
+## Adapter Mapping
+
+### 1. ControlPlaneAdapter
+
+Engine interface today:
+
+- `HandleHeartbeat(serverID, volumes)`
+- `HandleFailover(deadServerID)`
+
+Real mapping should be:
+
+- master-side source:
+  - `weed/server/master_grpc_server.go`
+  - `weed/server/master_block_failover.go`
+  - `weed/server/master_block_registry.go`
+- volume-server side sink:
+  - assignment receive/apply path in `weed/storage/blockvol/`
+
+Recommended real shape:
+
+- do not literally push raw heartbeat messages into the engine
+- instead introduce a thin adapter that converts confirmed master assignment state into:
+  - stable `ReplicaID`
+  - endpoint set
+  - epoch
+  - recovery target kind
+
+That keeps master as control owner and the engine as execution owner.
+
+Important note:
+
+- the adapter should treat heartbeat as the transport carrier, not as the final protocol shape
+- block-control semantics should be made explicit over that carrier
+- if a later phase concludes that heartbeat-only carriage is too coarse, that should be a separate design decision after the real hardening path is measured
+
+### 2. StorageAdapter
+
+Engine interface today:
+
+- `GetRetainedHistory()`
+- `PinSnapshot(lsn)` / `ReleaseSnapshot(pin)`
+- `PinWALRetention(startLSN)` / `ReleaseWALRetention(pin)`
+- `PinFullBase(committedLSN)` / `ReleaseFullBase(pin)`
+
+Real mapping should be:
+
+- retained history source:
+  - current WAL head/tail/checkpoint state from `weed/storage/blockvol/blockvol.go`
+  - recovery helpers in `weed/storage/blockvol/recovery.go`
+- WAL retention pin:
+  - existing retention-floor / replica-aware WAL retention machinery around `shipperGroup`
+- snapshot pin:
+  - existing snapshot/checkpoint artifacts in `blockvol`
+- full-base pin:
+  - explicit pinned full-extent export or equivalent consistent base handle from `blockvol`
+
+Important constraint:
+
+- `Phase 07` must not fake this by reconstructing `RetainedHistory` from tests or metadata alone
+
+### 3. Execution Driver / Executor hookup
+
+Engine side already has:
+
+- planner/executor split in `sw-block/engine/replication/driver.go`
+- stepwise executors in `sw-block/engine/replication/executor.go`
+
+Real mapping should be:
+
+- engine planner decides:
+  - zero-gap / catch-up / rebuild
+  - trusted-base requirement
+  - replayable-tail requirement
+- blockvol runtime performs:
+  - actual WAL catch-up transport
+  - actual snapshot/base transfer
+  - actual truncation / apply operations
+
+Recommended split:
+
+- engine owns contract and state transitions
+- blockvol adapter owns concrete I/O work
+
+## First-Slice Acceptance Rule
+
+For the first integration slice, this is a hard rule:
+
+- `blockvol` may execute recovery I/O
+- `blockvol` must not own recovery policy
+
+Concretely, `blockvol` must not decide:
+
+1. zero-gap vs catch-up vs rebuild
+2. trusted-base validity
+3. replayable-tail sufficiency
+4. whether rebuild fallback is required
+
+Those decisions must remain in the V2 engine.
+
+The bridge may translate engine decisions into concrete blockvol actions, but it must not re-decide recovery policy underneath the engine.
+
+## First Product Path
+
+The first product path should be:
+
+- `RF=2` block volume replication on the existing heartbeat/assignment loop
+- primary + one replica
+- failover / reconnect / changed-address handling
+- rebuild as the formal non-catch-up recovery path
+
+This is the right first path because it exercises the core correctness boundary without introducing N-replica coordination complexity too early.
+
+## What Must Be Replaced First
+
+Current engine-stage pieces that are still mock/test-only or too abstract:
+
+### Replace first
+
+1. `mockStorage` in engine tests
+- replace with a real `blockvol`-backed `StorageAdapter`
+
+2. synthetic control events in engine tests
+- replace with assignment-driven events from the real master/volume-server path
+
+3. convenience recovery completion wrappers
+- keep them test-only
+- real integration should use planner + executor + storage work loop
+
+### Can remain temporarily abstract in Phase 07 P0/P1
+
+1. `ControlPlaneAdapter` exact public shape
+- can remain thin while the integration path is being chosen
+
+2. async production scheduler details
+- executor can still be driven by a service loop before full background-task architecture is finalized
+
+## Recommended Concrete Modules
+
+### Engine stays here
+
+- `sw-block/engine/replication/`
+
+### First real adapter package should be added near blockvol
+
+Recommended initial location:
+
+- `weed/storage/blockvol/v2bridge/`
+
+Reason:
+
+- keeps V2 engine independent under `sw-block/`
+- keeps real-system glue close to blockvol storage truth
+- avoids copying engine logic into `weed/`
+
+Suggested contents:
+
+1. `control_adapter.go`
+- convert master assignment / local apply path into engine intents
+
+2. `storage_adapter.go`
+- expose retained history, pin/release, trusted-base export handles from real blockvol state
+
+3. `executor_bridge.go`
+- translate engine executor steps into actual blockvol recovery actions
+
+4. `observe_adapter.go`
+- map engine status/logs into service-visible diagnostics
+
+## First Failure Replay Set For Phase 07
+
+The first real-system replay set should be:
+
+1. changed-address restart
+- current risk: old identity/address coupling reappears in service glue
+
+2. stale epoch / stale result after failover
+- current risk: master and engine disagree on authority timing
+
+3. unreplayable-tail rebuild fallback
+- current risk: service glue over-trusts checkpoint/base availability
+
+4. plan/execution cleanup after resource failure
+- current risk: blockvol-side resource failures leave engine or service state dangling
+
+5. primary failover to replica with rebuild pending on old primary reconnect
+- current risk: old V1/V1.5 semantics leak back into reconnect handling
+
+## Non-Goals For This Slice
+
+Do not use `Phase 07` to:
+
+1. widen catch-up semantics
+2. add smart rebuild optimizations
+3. redesign all blockvol internals
+4. replace the full V1 runtime in one move
+5. claim production readiness
+
+## Deliverables For Phase 07 P0
+
+A good `P0` delivery should include:
+
+1. chosen service slice
+2. chosen integration path in the current repo
+3. adapter-to-module mapping
+4. list of test-only adapters to replace first
+5. first failure replay set
+6. explicit note of what remains outside this first slice
+
+## Short Form
+
+`Phase 07 P0` should start with:
+
+- engine in `sw-block/engine/replication/`
+- bridge in `weed/storage/blockvol/v2bridge/`
+- first real slice = blockvol primary + one replica on the existing master heartbeat / assignment path
+- `ReplicaID = <volume-name>/<replica-server-id>` for the first slice
+- `blockvol` executes I/O but does not own recovery policy
+- first product path = `RF=2` failover/reconnect/rebuild correctness
--- a/sw-block/design/v2-algorithm-overview.md
+++ b/sw-block/design/v2-algorithm-overview.md
@ -0,0 +1,686 @@
+# V2 Algorithm Overview
+
+Date: 2026-03-27
+Status: strategic design overview
+Audience: CEO / owner / technical leadership
+
+## Purpose
+
+This document explains the current V2 direction for `sw-block`:
+
+- what V2 is trying to solve
+- why V1 and V1.5 are not enough as the long-term architecture
+- why a WAL-based design is still worth pursuing
+- how V2 compares with major market and paper directions
+- how simulation and the real test runner systematically build confidence
+
+This is not a phase report and not a production-commitment document.
+
+It is the high-level technical rationale for the V2 line.
+
+## Relationship To Other Documents
+
+| Document | Role |
+|----------|------|
+| `v1-v15-v2-comparison.md` | Detailed comparison of the three lines |
+| `v2-acceptance-criteria.md` | Protocol validation bar |
+| `v2_scenarios.md` | Scenario backlog and simulator mapping |
+| `v2-open-questions.md` | Remaining algorithmic questions |
+| `protocol-development-process.md` | Method for protocol work |
+| `learn/projects/sw-block/algorithm_overview.md` | Current V1/V1.5 system review |
+| `learn/projects/sw-block/design/algorithm_survey.md` | Paper and vendor survey |
+| `learn/projects/sw-block/test/README.md` | Real test runner overview |
+| `learn/projects/sw-block/test/test-platform-review.md` | Test platform maturity and standalone direction |
+
+## 1. Executive Summary
+
+The current judgment is:
+
+- `V1` proved that the basic WAL-based replicated block model can work.
+- `V1.5` materially improved real recovery behavior and now has stronger operational evidence on real hardware.
+- `V2` exists because the next correctness problems should not be solved by incremental local fixes. They should be made explicit in the protocol itself.
+
+The central V2 idea is simple:
+
+- short-gap recovery should be explicit
+- stale authority should be explicitly fenced
+- catch-up vs rebuild should be an explicit decision
+- recovery ownership should be a protocol object, not an implementation accident
+
+`V2` is not yet a production engine. But it is already the stronger architectural direction.
+
+The correct strategic posture today is:
+
+- continue `V1.5` as the production line
+- continue `V2` as the long-term architecture line
+- continue WAL investigation because we now have a serious validation framework
+- if prototype evidence later shows a structural flaw, evolve to `V2.5` before heavy implementation
+
+## 2. The Real Problem V2 Tries To Solve
+
+At the frontend, a block service looks simple:
+
+- `write`
+- `flush` / `sync`
+- failover
+- recovery
+
+But the real difficulty is not the frontend verb set. The real difficulty is the asynchronous distributed boundary between:
+
+- local WAL append on the primary
+- durable progress on replicas
+- client-visible commit / sync truth
+- failover and promotion safety
+- recovery after lag, restart, endpoint change, or timeout
+
+This is the root reason V2 exists.
+
+The project has already learned that correctness problems in block storage do not usually come from the happy path. They come from:
+
+- a replica going briefly down and coming back
+- a replica coming back on a new address
+- a delayed stale barrier or stale reconnect result
+- a lagging node that is almost, but not quite, recoverable
+- a failover decision made on insufficient lineage information
+
+V2 is the attempt to make those cases first-class protocol behavior instead of post-hoc patching.
+
+## 3. Why V1 And V1.5 Are Not Enough
+
+This overview does not need a long retelling of `V1` and `V1.5`.
+
+What matters is their architectural limit.
+
+### What `V1` got right
+
+`V1` proved the basic shape:
+
+- ordered WAL
+- primary-replica replication
+- extent-backed storage
+- epoch and lease as the first fencing model
+
+### Why `V1` is not enough
+
+Its main shortcomings were:
+
+- short-gap recovery was too weak and too implicit
+- lagging replicas too easily fell into rebuild or long degraded states
+- changed-address restart was fragile
+- stale authority and stale results were not modeled as first-class protocol objects
+- the system did not cleanly separate:
+  - current WAL head
+  - committed prefix
+  - recoverable retained range
+  - stale or divergent replica tail
+
+### Why `V1.5` is still not enough
+
+`V1.5` fixed several real operational problems:
+
+- retained-WAL catch-up
+- same-address reconnect
+- `sync_all` correctness on real tests
+- rebuild fallback after unrecoverable gap
+- control-plane refresh after changed-address restart
+
+Those fixes matter, and they are why `V1.5` is the stronger production line today.
+
+But `V1.5` is still not the long-term architecture because its recovery model remains too incremental:
+
+- reconnect logic is still layered onto an older shipper model
+- recovery ownership was discovered as a bug class before it became a protocol object
+- catch-up vs rebuild became clearer, but still not clean enough as a top-level protocol contract
+- the system still looks too much like "repair V1" rather than "define the next replication model"
+
+### What `V2` changes
+
+`V2` is not trying to invent a completely different storage model.
+
+It is trying to make the critical parts explicit:
+
+- recovery ownership
+- lineage-safe recovery boundary
+- catch-up vs rebuild classification
+- per-replica sender authority
+- stale-result rejection
+- explicit recovery orchestration
+
+So the correct comparison is still:
+
+- `V1.5` is stronger operationally today
+- `V2` is stronger architecturally today
+
+That is not a contradiction. It is the right split between a current production line and the next architecture line.
+
+```mermaid
+flowchart TD
+    V1[V1]
+    V15[V1_5]
+    V2[V2]
+    realFailures[RealFailures]
+    realTests[RealHardwareEvidence]
+    simAndProto[SimulationAndPrototype]
+
+    V1 --> V15
+    V15 --> V2
+    realFailures --> V15
+    realFailures --> V2
+    V15 --> realTests
+    V2 --> simAndProto
+```
+
+## 4. How V2 Solves WAL And Extent Synchronization
+
+The core V2 question is not simply "do we keep WAL?"
+
+The real question is:
+
+**how do WAL and extent stay synchronized across primary and replica while preserving both stability and performance?**
+
+This is the center of the V2 design.
+
+### 4.1 The basic separation of roles
+
+V2 treats the storage path as two different but coordinated layers:
+
+- **WAL** is the ordered truth for recent history
+- **extent** is the stable materialized image
+
+WAL is used for:
+
+- strict write ordering
+- local crash recovery
+- short-gap replica catch-up
+- durable progress accounting through `LSN`
+
+Extent is used for:
+
+- stable read image
+- long-lived storage
+- checkpoint and base-image creation
+- long-gap recovery only through a real checkpoint/snapshot base, not through guessing from the current live extent
+
+This separation is the first stability rule:
+
+- do not ask current extent to behave like historical state
+- do not ask WAL to be the only long-range recovery mechanism forever
+
+### 4.2 Primary-replica synchronization model
+
+The intended V2 steady-state model is:
+
+1. primary allocates monotonic `LSN`
+2. primary appends ordered WAL locally
+3. primary enqueues the record to per-replica sender loops
+4. replicas receive in order and advance explicit progress
+5. barrier/sync uses **durable replica progress**, not optimistic send progress
+6. flusher later materializes WAL-backed dirty state into extent
+
+The local WAL-to-extent lifecycle can be understood as:
+
+```mermaid
+stateDiagram-v2
+    [*] --> WalAppended
+    WalAppended --> SenderQueued
+    SenderQueued --> ReplicaReceived
+    ReplicaReceived --> ReplicaDurable
+    ReplicaDurable --> SyncEligible
+    SyncEligible --> ExtentMaterialized
+    ExtentMaterialized --> CheckpointAdvanced
+
+    note right of WalAppended
+        Ordered local WAL exists
+        and defines the write LSN
+    end note
+
+    note right of ReplicaDurable
+        Replica durable progress
+        is now explicit
+    end note
+
+    note right of ExtentMaterialized
+        Flusher moves stable data
+        from WAL-backed dirty state
+        into extent
+    end note
+```
+
+The critical synchronization rule is:
+
+- **client-visible sync truth must follow durable replica progress**
+- not local send progress
+- not local WAL head
+- not "replica probably received it"
+
+This is why V2 uses a lineage-safe recovery target such as `CommittedLSN` instead of a looser notion like "current primary head."
+
+### 4.2.1 Sync mode and result model
+
+V2 also makes the sync-result logic more explicit.
+
+- `best_effort` should succeed after the primary has reached its local durability point, even if replicas are degraded.
+- `sync_all` should succeed only when all required replicas are durable through the target boundary.
+- `sync_quorum` should succeed only when a true durable quorum exists through the target boundary.
+
+This decision path can be presented as:
+
+```mermaid
+flowchart TD
+    writeReq[WriteAndSyncRequest]
+    localDurable[PrimaryLocalDurable]
+    barrierEval[EvaluateReplicaDurableProgress]
+    bestEffortAck[best_effortAck]
+    syncAllAck[sync_allAck]
+    syncQuorumAck[sync_quorumAck]
+    rejectOrBlock[RejectOrBlock]
+
+    writeReq --> localDurable
+    localDurable --> bestEffortAck
+    localDurable --> barrierEval
+
+    barrierEval -->|"allRequiredReplicasDurable"| syncAllAck
+    barrierEval -->|"durableQuorumExists"| syncQuorumAck
+    barrierEval -->|"notEnoughDurableReplicas"| rejectOrBlock
+```
+
+The key point is that sync success is no longer inferred from send progress or socket health.
+It is derived from explicit durable progress at the right safety boundary.
+
+### 4.3 Why this should be stable
+
+This model is designed to be stable because the dangerous ambiguities are separated:
+
+- **write ordering** is carried by WAL and `LSN`
+- **durability truth** is carried by barrier / flushed progress
+- **recovery ownership** is carried by sender + recovery attempt identity
+- **catch-up vs rebuild** is an explicit classification, not an accidental timeout side effect
+- **promotion safety** depends on committed prefix and lineage, not on whichever node looks newest
+
+In other words, V2 stability comes from reducing hidden coupling.
+
+The design tries to remove cases where one piece of state silently stands in for another.
+
+### 4.4 Why this can still be high-performance
+
+The performance argument is not that V2 is magically faster in all cases.
+
+The argument is narrower and more realistic:
+
+- keep the primary write path simple:
+  - ordered local WAL append
+  - enqueue to per-replica sender loops
+  - no heavy inline recovery logic in foreground writes
+- keep most complexity off the healthy hot path:
+  - sender ownership
+  - reconnect classification
+  - catch-up / rebuild decisions
+  - timeout and stale-result fencing
+  live mostly in recovery/control paths
+- use WAL for what it is good at:
+  - recent ordered delta
+  - short-gap replay
+- stop using WAL as the answer to every lag problem:
+  - long-gap recovery should move toward checkpoint/snapshot base plus tail replay
+
+So the V2 performance thesis is:
+
+- **healthy steady-state should remain close to V1.5**
+- **degraded/recovery behavior should become much cleaner**
+- **short-gap recovery should be cheaper than rebuild**
+- **long-gap recovery should stop forcing an unbounded WAL-retention tax**
+
+That is a much stronger and more believable claim than saying "V2 will just be faster."
+
+### 4.5 Why WAL is still worth choosing
+
+The reason to keep the WAL-based direction is that it gives the best foundation for this exact synchronization problem:
+
+- explicit order
+- explicit history
+- explicit committed prefix
+- explicit short-gap replay
+- explicit failover reasoning
+
+WAL is risky only if the design blurs:
+
+- local write acceptance
+- replica durable progress
+- committed boundary
+- recoverable retained history
+
+V2 exists precisely to stop blurring those things.
+
+So the current project position is:
+
+- WAL is not automatically safe
+- but WAL is still the most promising base for this block service
+- because the project now has enough real evidence, simulator coverage, and prototype work to investigate it rigorously
+
+## 5. Comparison With Market And Papers
+
+The current V2 direction is not chosen because other vendors are wrong. It is chosen because other directions solve different problems and carry different costs.
+
+### Ceph / RBD style systems
+
+Ceph-style block systems avoid this exact per-volume replicated WAL shape. They gain:
+
+- deep integration with object-backed distributed storage
+- mature placement and recovery machinery
+- strong cluster-scale distribution logic
+
+But they pay elsewhere:
+
+- more system layers
+- more object-store and peering complexity
+- a heavier operational and conceptual model
+
+This is not a free simplification. It is a different complexity trade.
+
+For `sw-block`, the design choice is to keep a narrower software block service with more explicit per-volume replication semantics instead of inheriting the full distributed object-backed block complexity stack.
+
+### PolarFS / ParallelRaft style work
+
+These systems explore more aggressive ordering and apply strategies:
+
+- out-of-order or conflict-aware work
+- deeper parallelism
+- more sophisticated log handling
+
+They are valuable references, especially for:
+
+- LBA conflict reasoning
+- recovery and replay cost thinking
+- future flusher parallelization ideas
+
+But they also introduce a much heavier correctness surface.
+
+The project does not currently want to buy that complexity before fully proving the simpler strict-order path.
+
+### AWS chain replication / EBS-style lessons
+
+Chain replication and related work are attractive because they address real bandwidth and recovery concerns:
+
+- Primary NIC pressure
+- forwarding topology
+- cleaner scaling for RF=3
+
+This is one of the more plausible borrowable directions later.
+
+But it changes:
+
+- latency profile
+- failure handling
+- barrier semantics
+- operational topology
+
+So it belongs to a later architecture stage, not to the current V2 core proof.
+
+### The actual strategic choice
+
+The project is deliberately choosing:
+
+- a narrower software-first block design
+- explicit per-volume correctness
+- strict reasoning before performance heroics
+- validation before feature expansion
+
+That is not conservatism for its own sake. It is how to build a block product that can later be trusted.
+
+## 6. Why This Direction Fits SeaweedFS And Future Standalone sw-block
+
+`sw-block` started inside SeaweedFS, but V2 is already being shaped as the next standalone block service line.
+
+That means the architecture should preserve two things at once:
+
+### What should remain compatible
+
+- placement and topology concepts where they remain useful
+- explainable control-plane contracts
+- operational continuity with the SeaweedFS ecosystem
+
+### What should become more block-specific
+
+- replication correctness
+- recovery ownership
+- recoverability classification
+- block-specific test and evidence story
+
+So the current direction is:
+
+- use SeaweedFS as the practical ecosystem and experience base
+- but shape V2 as a true block-service architecture, not as a minor sub-feature of `weed/`
+
+This is why the V2 line belongs under `sw-block/` rather than as a direct patch path inside the existing production tree.
+
+## 7. The Systematic Validation Method
+
+The second major reason the current direction is rational is the validation method.
+
+The project is no longer relying on:
+
+- implement first
+- discover behavior later
+- patch after failure
+
+Instead, the intended ladder is:
+
+- contract and invariants
+- scenario backlog
+- simulator
+- timer/race simulator
+- standalone prototype
+- real engine test runner
+
+```mermaid
+flowchart TD
+    contract[ContractAndInvariants]
+    scenarios[ScenarioBacklog]
+    distsim[distsim]
+    eventsim[eventsim]
+    prototype[enginev2Prototype]
+    runner[RealTestRunner]
+    confidence[SystemAndProductConfidence]
+
+    contract --> scenarios
+    scenarios --> distsim
+    scenarios --> eventsim
+    distsim --> prototype
+    eventsim --> prototype
+    prototype --> runner
+    runner --> confidence
+```
+
+This is the right shape for a risky block-storage algorithm:
+
+- simulation for protocol truth
+- prototype for executable truth
+- real runner for product/system truth
+
+## 8. What The Simulation System Proves
+
+The simulation system exists to answer:
+
+- what should happen
+- what must never happen
+- which V1/V1.5 shapes fail
+- why the V2 shape is better
+
+### `distsim`
+
+`distsim` is the main protocol simulator.
+
+It is used for:
+
+- protocol correctness
+- state transitions
+- stale authority fencing
+- promotion and lineage safety
+- catch-up vs rebuild
+- changed-address restart
+- candidate safety
+- reference-state checking
+
+### `eventsim`
+
+`eventsim` is the timing/race layer.
+
+It is used for:
+
+- barrier timeout behavior
+- catch-up timeout behavior
+- reservation timeout behavior
+- same-tick and delayed event ordering
+- stale timeout effects
+
+### What the simulator is good at
+
+It is especially strong for proving:
+
+- stale traffic rejection
+- explicit recovery boundaries
+- timeout/race semantics
+- failover correctness at committed prefix
+- why old authority must not mutate current lineage
+
+### What the simulator does not prove
+
+It does not prove:
+
+- real TCP behavior
+- real OS scheduling behavior
+- disk timing
+- real `WALShipper` integration
+- real frontend behavior under iSCSI or NVMe
+
+So the simulator is not the whole truth.
+
+It is the algorithm/protocol truth layer.
+
+## 9. What The Real Test Runner Proves
+
+The real test runner under `learn/projects/sw-block/test/` is the system and product validation layer.
+
+It is not merely QA support. It is a core part of whether the design can be trusted.
+
+### What it covers
+
+The runner and surrounding test system already span:
+
+- unit tests
+- component tests
+- integration tests
+- distributed scenarios
+- real hardware workflows
+
+The environment already includes:
+
+- real nodes
+- real block targets
+- real fault injection
+- benchmark and result capture
+- run bundles and scenario traceability
+
+### Why it matters
+
+The runner is what tells us whether:
+
+- the implemented engine behaves like the design says
+- the product works under real restart/failover/rejoin conditions
+- the operator workflows are credible
+- benchmark claims are real rather than accidental
+
+This is why the runner is best thought of as:
+
+- implementation truth
+- system truth
+- product truth
+
+not just test automation.
+
+## 10. How Simulation And Test Runner Progress Systematically
+
+The intended feedback loop is:
+
+1. V1/V1.5 real failures happen
+2. those failures are turned into design requirements
+3. scenarios are distilled for simulator use
+4. the simulator closes protocol ambiguity
+5. the standalone prototype closes execution ambiguity
+6. the real test runner validates system behavior on real environments
+7. new failures or mismatches feed back into design again
+
+This gives the project two different but complementary truths:
+
+- `simulation -> algorithm / protocol correctness`
+- `test runner -> implementation / system / product correctness`
+
+That separation is healthy.
+
+It prevents two common mistakes:
+
+- trusting design without real behavior
+- trusting green system tests without understanding the protocol deeply enough
+
+## 11. Current Status And Honest Limits
+
+### What is already strong
+
+- `V1.5` has materially better recovery behavior than `V1` and stronger operational evidence
+- `V2` has stronger architectural structure than `V1.5`
+- the simulator has serious acceptance coverage
+- the prototype line has already started closing ownership and orchestration risk
+- the real test runner is large enough to support serious system validation
+
+### What is not yet done
+
+- `V2` is not a production engine
+- prototype work is still in early-to-mid stages
+- historical-data / recovery-boundary prototype work is not complete
+- steady-state performance of `V2` is not yet proven
+- real hardware validation of `V2` does not yet exist
+
+So the correct statement is not:
+
+- "V2 is already better in production"
+
+The correct statement is:
+
+- "V2 is the better long-term architecture, but not yet the stronger deployed engine"
+
+## 12. Why The Current Direction Is Rational
+
+The current direction is rational because it keeps the right split:
+
+- `V1.5` continues as the production line today
+- `V2` continues as the next architecture line
+
+This lets the project:
+
+- keep shipping and hardening what already works
+- explore the better architecture without destabilizing the current engine
+- use simulation, prototype work, and the real runner to decide whether V2 should become the next real engine
+
+The final strategic rule should remain:
+
+- continue WAL investigation because the project now has a credible validation framework
+- continue V2 because the architectural evidence is strong
+- if prototype evidence later reveals a structural flaw, redesign to `V2.5` before heavy implementation
+
+That is the disciplined path for a block-storage algorithm.
+
+## Bottom Line
+
+If choosing based on current production proof:
+
+- use `V1.5`
+
+If choosing based on long-term protocol quality:
+
+- choose `V2`
+
+If choosing based on whether WAL should still be investigated:
+
+- yes, because the project now has the right validation stack to investigate it responsibly
+
+That is the current strategic answer.
--- a/sw-block/design/v2-algorithm-overview.zh.md
+++ b/sw-block/design/v2-algorithm-overview.zh.md
@ -0,0 +1,660 @@
+# V2 算法综述
+
+日期：2026-03-27
+状态：战略级设计综述
+读者：CEO / owner / 技术管理层
+
+## 文档目的
+
+本文用于说明 `sw-block` 当前 `V2` 方向背后的核心判断：
+
+- `V2` 到底想解决什么问题
+- 为什么 `V1` / `V1.5` 不足以作为长期架构
+- 为什么我们仍然认为基于 `WAL` 的方向值得继续走
+- `V2` 与主要市场方案 / 论文路线相比的取舍是什么
+- `simulation` 与真实 `test runner` 如何形成系统化验证闭环
+
+这不是 phase 汇报，也不是对生产可用性的承诺文档。
+
+它是对 `V2` 这条架构线的高层技术解释。
+
+## 与其他文档的关系
+
+| 文档 | 作用 |
+|------|------|
+| `v1-v15-v2-comparison.md` | 三条技术线的详细比较 |
+| `v2-acceptance-criteria.md` | V2 协议验证下限 |
+| `v2_scenarios.md` | 场景清单与 simulator 覆盖 |
+| `v2-open-questions.md` | 仍未关闭的算法问题 |
+| `protocol-development-process.md` | 协议开发方法论 |
+| `learn/projects/sw-block/algorithm_overview.md` | 当前 V1/V1.5 系统级算法综述 |
+| `learn/projects/sw-block/design/algorithm_survey.md` | 论文 / vendor 调研与借鉴项 |
+| `learn/projects/sw-block/test/README.md` | 真实测试系统入口 |
+| `learn/projects/sw-block/test/test-platform-review.md` | test runner 的平台化方向 |
+
+## 1. 执行摘要
+
+当前最准确的结论是：
+
+- `V1` 证明了基于 `WAL` 的复制块存储基本路径是可行的。
+- `V1.5` 在真实恢复场景上已经比 `V1` 明显更强，并且有真实硬件上的运行证据。
+- `V2` 的意义，不是在已有逻辑上继续打补丁，而是把最关键的恢复与一致性问题直接上升为协议对象。
+
+`V2` 的核心想法可以概括为：
+
+- 短间隙恢复要显式
+- 过期 authority 要显式 fencing
+- `catch-up` 与 `rebuild` 的边界要显式
+- 恢复 ownership 要成为协议的一部分，而不是实现细节里的偶然行为
+
+所以今天正确的策略是：
+
+- 继续用 `V1.5` 作为当前生产线
+- 继续用 `V2` 作为长期架构线
+- 继续认真研究 `WAL` 路线，因为现在我们已经具备了可信的验证框架
+- 如果后续 prototype 证明 `V2` 有结构性缺陷，就应当先演进到 `V2.5`，而不是硬着头皮直接实现
+
+## 2. V2 真正要解决的问题
+
+从前端看，块存储似乎只有几个简单动作：
+
+- `write`
+- `flush` / `sync`
+- failover
+- recovery
+
+但真正难的，不是这些前端动作本身，而是异步分布式边界：
+
+- primary 本地 WAL 追加
+- replica 端 durable progress
+- client 可见的 sync / commit 真值
+- failover / promote 时的数据边界
+- lag、restart、address change、timeout 后的恢复正确性
+
+这才是 `V2` 存在的根因。
+
+项目已经反复验证过：块存储真正的 bug 通常不出在 happy path，而出在：
+
+- replica 短暂掉线又回来
+- replica 重启后地址变化
+- 延迟到达的 stale barrier / stale reconnect 结果
+- 一个 lagging replica 看起来“差一点点就能恢复”
+- failover 时基于错误 lineage 做了 promote
+
+`V2` 就是要把这些情况变成协议的第一公民，而不是上线后再继续被动修补。
+
+## 3. 为什么 V1 / V1.5 不够
+
+这份综述不需要长篇回顾 `V1` 和 `V1.5` 的所有细节。
+
+只需要讲清它们为什么不足以作为长期架构。
+
+### `V1` 做对了什么
+
+`V1` 建立了最重要的基础：
+
+- 严格有序的 `WAL`
+- primary-replica 复制
+- 基于 `epoch + lease` 的初步 fencing
+- 以 `extent` 作为稳定数据面，而不是一开始就做全日志结构
+
+### `V1` 的不足
+
+它的关键短板主要在恢复与退化场景：
+
+- 短 outage 很容易演化成 rebuild 或长期 degraded
+- 恢复结构过于隐式
+- changed-address restart 脆弱
+- stale authority / stale result 还不是协议层的显式对象
+- 系统没有足够清晰地区分：
+  - 当前 head
+  - committed prefix
+  - recoverable retained range
+  - stale / divergent tail
+
+### `V1.5` 的不足
+
+`V1.5` 已经解决了不少真实问题：
+
+- retained-WAL catch-up
+- same-address reconnect
+- `sync_all` 的真实行为
+- catch-up 失败后的 rebuild fallback
+- changed-address restart 之后的 control-plane 刷新
+
+所以它今天是更强的生产线。
+
+但它仍然不是长期架构，因为它本质上仍然是增量修复：
+
+- reconnect 逻辑仍然附着在旧 shipper 模型上
+- 恢复 ownership 是先作为 bug 暴露出来，再逐步被抽象
+- `catch-up` vs `rebuild` 更清楚了，但还不够成为协议顶层契约
+- 整体感觉仍然更像“继续修 V1”，而不是“定义下一代复制协议”
+
+### `V2` 的变化
+
+`V2` 不是重新发明一个完全不同的存储模型。
+
+它的目标是把最关键的东西显式化：
+
+- recovery ownership
+- lineage-safe recovery boundary
+- `catch-up` / `rebuild` 分类
+- per-replica sender authority
+- stale-result rejection
+- 明确的 recovery orchestration
+
+因此最诚实的比较是：
+
+- `V1.5` 今天在运行证据上更强
+- `V2` 今天在架构质量上更强
+
+这不是矛盾，而是“当前生产线”和“下一代架构线”应有的分工。
+
+```mermaid
+flowchart TD
+    V1[V1]
+    V15[V1_5]
+    V2[V2]
+    realFailures[真实故障]
+    realTests[真实硬件验证]
+    simAndProto[仿真与原型]
+
+    V1 --> V15
+    V15 --> V2
+    realFailures --> V15
+    realFailures --> V2
+    V15 --> realTests
+    V2 --> simAndProto
+```
+
+## 4. V2 如何解决 WAL 与 Extent 的同步问题
+
+`V2` 的核心问题不是“还要不要 WAL”。
+
+真正的问题是：
+
+**primary 与 replica 之间，WAL 和 extent 如何保持同步，同时还能兼顾稳定性与性能。**
+
+这才是 `V2` 的中心。
+
+### 4.1 基本分工
+
+`V2` 把数据路径拆成两个既分离又协作的层：
+
+- **WAL**：近期历史的有序真相
+- **extent**：稳定的物化数据镜像
+
+WAL 负责：
+
+- 严格写入顺序
+- 本地崩溃恢复
+- 短间隙 replica catch-up
+- 基于 `LSN` 的 durable progress 计量
+
+Extent 负责：
+
+- 稳定读镜像
+- 长期存储
+- checkpoint / base image 生成
+- 长间隙恢复时作为真正 base image 的来源
+
+第一条稳定性原则就是：
+
+- 不要让当前 extent 冒充历史状态
+- 不要让 WAL 永远承担所有长距离恢复责任
+
+### 4.2 Primary-replica 同步模型
+
+`V2` 理想中的 steady-state 同步模型是：
+
+1. primary 分配单调递增的 `LSN`
+2. primary 本地顺序追加 `WAL`
+3. primary 把记录放入 per-replica sender loop
+4. replica 按顺序接收并推进显式 progress
+5. `barrier/sync` 依赖 replica 的 durable progress，而不是 optimistic send progress
+6. flusher 再把 WAL-backed dirty state 物化到 extent
+
+本地 `WAL -> extent` 生命周期可以理解为：
+
+```mermaid
+stateDiagram-v2
+    [*] --> WalAppended
+    WalAppended --> SenderQueued
+    SenderQueued --> ReplicaReceived
+    ReplicaReceived --> ReplicaDurable
+    ReplicaDurable --> SyncEligible
+    SyncEligible --> ExtentMaterialized
+    ExtentMaterialized --> CheckpointAdvanced
+```
+
+这里最关键的规则是：
+
+- **client 可见的 sync 真值必须跟随 durable replica progress**
+- 不能跟随 send progress
+- 不能跟随 local WAL head
+- 不能跟随“看起来 replica 应该已经收到了”
+
+这也是为什么 `V2` 使用像 `CommittedLSN` 这样的 lineage-safe 边界，而不是松散的“当前 primary head”。
+
+### 4.2.1 不同 sync mode 如何判断结果
+
+`V2` 让不同 sync mode 的成功条件变得更明确：
+
+- `best_effort`：primary 达到本地 durability point 后即可成功，replica 可以后台恢复
+- `sync_all`：所有 required replica 都要在目标边界上 durable
+- `sync_quorum`：必须存在真实 durable quorum
+
+其判断路径可以表示为：
+
+```mermaid
+flowchart TD
+    writeReq[WriteAndSyncRequest]
+    localDurable[PrimaryLocalDurable]
+    barrierEval[EvaluateReplicaDurableProgress]
+    bestEffortAck[best_effort成功]
+    syncAllAck[sync_all成功]
+    syncQuorumAck[sync_quorum成功]
+    rejectOrBlock[阻塞或失败]
+
+    writeReq --> localDurable
+    localDurable --> bestEffortAck
+    localDurable --> barrierEval
+
+    barrierEval -->|"allRequiredReplicasDurable"| syncAllAck
+    barrierEval -->|"durableQuorumExists"| syncQuorumAck
+    barrierEval -->|"notEnoughDurableReplicas"| rejectOrBlock
+```
+
+这意味着 sync 结果不再依赖：
+
+- socket 看起来还活着
+- sender 好像还在发
+- replica 似乎“差不多收到了”
+
+而是依赖显式 durable progress。
+
+### 4.3 为什么这个设计应该更稳定
+
+它试图把最危险的模糊边界拆开：
+
+- **写入顺序** 由 `WAL + LSN` 表达
+- **durability truth** 由 barrier / flushed progress 表达
+- **recovery ownership** 由 sender + recovery attempt identity 表达
+- **catch-up vs rebuild** 由显式分类表达
+- **promotion safety** 由 committed prefix 与 lineage 表达
+
+也就是说，`V2` 的稳定性来自于减少隐式耦合。
+
+### 4.4 为什么它仍然可以有高性能
+
+这里不能夸大说 `V2` 一定在所有情况下都更快。
+
+更准确的性能论点是：
+
+- 保持 primary 前台写路径简单：
+  - 本地顺序 `WAL append`
+  - 投递到 per-replica sender loop
+  - 不把复杂恢复逻辑塞进前台写路径
+- 把复杂度主要放在健康热路径之外：
+  - sender ownership
+  - reconnect classification
+  - catch-up / rebuild decision
+  - timeout 和 stale-result fencing
+  主要都在 recovery / control path
+- 让 WAL 只承担它擅长的工作：
+  - 近期 ordered delta
+  - 短间隙 replay
+- 不再让 WAL 承担所有长距离恢复：
+  - 长间隙恢复转向 checkpoint/snapshot base + tail replay
+
+所以 `V2` 的性能论点应该是：
+
+- **健康 steady-state 应该尽量接近 `V1.5`**
+- **退化与恢复路径会更干净**
+- **短间隙恢复会比 rebuild 更便宜**
+- **长间隙恢复不再逼迫系统支付无上限的 WAL retention 税**
+
+这比“V2 天然更快”要可信得多。
+
+### 4.5 为什么仍然选择 WAL
+
+之所以还继续走 WAL，是因为它仍然是解决这个同步问题最有力的基础：
+
+- 显式顺序
+- 显式历史
+- 显式 committed prefix
+- 显式短间隙 replay
+- 显式 failover reasoning
+
+只有当设计把以下概念混淆时，WAL 才会变得危险：
+
+- 本地写入接受
+- replica durable progress
+- committed boundary
+- recoverable retained history
+
+而 `V2` 的存在，正是为了不再混淆这些东西。
+
+## 5. 与市场和论文路线的比较
+
+选择 `V2` 这条路线，并不是因为别的 vendor 都错了，而是因为他们解决的是不同问题，也承担了不同复杂度。
+
+### Ceph / RBD 路线
+
+Ceph/RBD 避开了这种 per-volume replicated WAL 形态。
+
+它获得的是：
+
+- 对象存储深度一体化
+- 成熟的 placement 与 recovery 体系
+- 更强的集群级分布能力
+
+但代价是：
+
+- 系统层次更多
+- object-store / peering 复杂度更重
+- 运维与概念模型更重
+
+所以这不是“更简单”，而是把复杂度迁移到了别处。
+
+对 `sw-block` 而言，当前选择是：
+
+- 保持更窄的软件块服务模型
+- 用更显式的 per-volume correctness 来换取更可控的复杂度
+
+### PolarFS / ParallelRaft 路线
+
+这类系统探索更激进的顺序与并行策略：
+
+- conflict-aware 或乱序并行
+- 更深的日志并行
+- 更复杂的 apply / replay 机制
+
+它们在未来仍然值得借鉴：
+
+- LBA conflict reasoning
+- replay 成本与恢复成本
+- flusher 并行优化
+
+但它们也明显扩大了正确性边界。
+
+在当前阶段，项目不应该在还没彻底证明严格顺序模型之前，就过早买入这类复杂度。
+
+### AWS 链式复制 / EBS 类经验
+
+链式复制之类的路线吸引人，是因为它们能解决真实问题：
+
+- Primary NIC 压力
+- forward 拓扑
+- RF=3 时更好的扩展性
+
+这是后续较有希望借鉴的方向。
+
+但它会改变：
+
+- 延迟画像
+- 失败处理方式
+- barrier 语义
+- 运维拓扑
+
+所以它属于更后面的架构阶段，而不是当前 V2 核心证明。
+
+### 当前的真实选择
+
+项目当前选择的是：
+
+- 更窄的软件优先 block 设计
+- 明确的 per-volume correctness
+- 在性能英雄主义之前先把逻辑讲清
+- 在功能扩张之前先建立验证闭环
+
+这不是保守，而是为了让这个 block 产品未来真的值得信任。
+
+## 6. 为什么这条方向适合 SeaweedFS 与未来独立 sw-block
+
+`sw-block` 起步于 SeaweedFS，但 `V2` 已经在按下一代独立 block service 的方向成形。
+
+这意味着架构上要同时保留两类东西：
+
+### 需要保持兼容的部分
+
+- placement / topology 这些概念
+- 可解释的 control-plane contract
+- 与 SeaweedFS 生态的运维连续性
+
+### 应该更 block-specific 的部分
+
+- replication correctness
+- recovery ownership
+- recoverability classification
+- block 特有的 test / evidence 体系
+
+因此当前方向不是“继续把 V2 当成 weed 里的一个 patch”，而是：
+
+- 以 SeaweedFS 作为经验与生态基础
+- 同时把 `V2` 逐步塑造成真正独立的块服务架构
+
+## 7. 系统化验证方法
+
+当前方向之所以合理，另一个重要原因是验证方法本身已经系统化。
+
+项目不再依赖：
+
+- 先实现
+- 再观察
+- 出 bug 再修
+
+而是依赖如下层次：
+
+- contract / invariants
+- scenario backlog
+- simulator
+- timer/race simulator
+- standalone prototype
+- real engine test runner
+
+```mermaid
+flowchart TD
+    contract[ContractAndInvariants]
+    scenarios[ScenarioBacklog]
+    distsim[distsim]
+    eventsim[eventsim]
+    prototype[enginev2Prototype]
+    runner[RealTestRunner]
+    confidence[SystemAndProductConfidence]
+
+    contract --> scenarios
+    scenarios --> distsim
+    scenarios --> eventsim
+    distsim --> prototype
+    eventsim --> prototype
+    prototype --> runner
+    runner --> confidence
+```
+
+这对于一个高风险块存储算法是非常正确的结构：
+
+- simulation 用来证明协议逻辑
+- prototype 用来证明执行语义
+- 真实 runner 用来证明系统与产品行为
+
+## 8. Simulation 系统证明什么
+
+simulation 系统的目标是回答：
+
+- 应该发生什么
+- 绝不能发生什么
+- 为什么旧设计会失败
+- 为什么 V2 更好
+
+### `distsim`
+
+`distsim` 是主协议仿真器，主要用于：
+
+- 协议正确性
+- 状态迁移
+- stale authority fencing
+- promotion / lineage safety
+- catch-up vs rebuild
+- changed-address restart
+- candidate safety
+- reference-state checking
+
+### `eventsim`
+
+`eventsim` 是时间 / race 层，主要用于：
+
+- barrier timeout
+- catch-up timeout
+- reservation timeout
+- 同 tick / 延迟事件顺序
+- stale timeout 的影响
+
+### simulation 擅长证明什么
+
+它特别擅长证明：
+
+- stale traffic rejection
+- recovery boundary 的显式性
+- timeout/race 语义
+- committed prefix 下的 failover 正确性
+- 旧 authority 不能修改新 lineage
+
+### simulation 不证明什么
+
+它不证明：
+
+- 真实 TCP 行为
+- 真实 OS 调度
+- 磁盘时序
+- 真正的 `WALShipper` 集成
+- iSCSI / NVMe 前端的真实行为
+
+因此 simulation 不是全部真相。
+
+它是 **算法 / 协议真相层**。
+
+## 9. 真实 test runner 证明什么
+
+`learn/projects/sw-block/test/` 下的真实 test runner 是系统与产品验证层。
+
+它不只是 QA 工具，而是设计是否可信的重要组成部分。
+
+### 它覆盖什么
+
+当前 runner 与周边测试体系已经覆盖：
+
+- unit
+- component
+- integration
+- distributed scenario
+- 真实硬件 workflow
+
+而且环境已经包含：
+
+- 真实节点
+- 真实 block target
+- 真实 fault injection
+- benchmark 与结果采集
+- run bundle 与 scenario traceability
+
+### 为什么它重要
+
+它帮助我们判断：
+
+- 实际引擎是否按设计运行
+- 产品在真实 restart / failover / rejoin 场景下是否可靠
+- operator workflow 是否可信
+- benchmark 结果是不是有效而非偶然
+
+所以 test runner 最好被理解为：
+
+- implementation truth
+- system truth
+- product truth
+
+而不只是“测试脚本框架”。
+
+## 10. Simulation 与 test runner 如何系统性推进
+
+理想的反馈闭环是：
+
+1. `V1` / `V1.5` 出现真实故障
+2. 这些故障被转化为设计要求
+3. 再被提炼为 simulator 场景
+4. simulator 关闭协议歧义
+5. standalone prototype 关闭执行歧义
+6. 真实 test runner 在硬件与分布式环境中验证系统行为
+7. 新故障或新偏差再反哺设计
+
+这就形成了两类互补真相：
+
+- `simulation -> algorithm / protocol correctness`
+- `test runner -> implementation / system / product correctness`
+
+这种分层是健康的，因为它避免了两种常见错误：
+
+- 只相信设计推导，却没有真实行为
+- 只相信系统测试全绿，却没有真正理解协议本身
+
+## 11. 当前状态与诚实边界
+
+### 现在已经比较强的部分
+
+- `V1.5` 相比 `V1` 的恢复能力已经明显增强，并且有真实运行证据
+- `V2` 的架构清晰度已经明显强于 `V1.5`
+- simulator 已经有较强的 acceptance 覆盖
+- prototype 已经开始关闭 ownership 与 orchestration 风险
+- 真实 test runner 已经足够大，可以支撑严肃的系统验证
+
+### 现在还没有完成的部分
+
+- `V2` 还不是生产引擎
+- prototype 仍处于早中期
+- historical-data / recovery-boundary prototype 还没有闭合
+- `V2` steady-state 性能还没有真实证明
+- `V2` 还没有真实硬件上的运行验证
+
+所以最准确的话不是：
+
+- “V2 现在已经在生产上更强”
+
+而是：
+
+- “V2 是长期更好的架构，但今天还不是更强的已部署引擎”
+
+## 12. 为什么当前方向是理性的
+
+当前方向之所以理性，是因为它保持了正确的分工：
+
+- `V1.5` 继续作为今天的生产线
+- `V2` 继续作为下一代架构线
+
+这样项目就可以：
+
+- 在已有可运行系统上继续交付和加固
+- 在不扰动生产线的前提下认真验证更强的架构
+- 用 simulation、prototype 和真实 runner 来决定 `V2` 是否真能成为下一代引擎
+
+最终的战略规则应当保持不变：
+
+- 继续研究 WAL，因为现在我们已经有可信的验证框架
+- 继续推进 V2，因为架构证据已经很强
+- 如果 prototype 证明 V2 有结构性缺陷，就先演进到 `V2.5`，不要急于重实现
+
+## 结论
+
+如果按当前生产证据选择：
+
+- 选择 `V1.5`
+
+如果按长期协议质量选择：
+
+- 选择 `V2`
+
+如果问 WAL 是否还值得继续研究：
+
+- 值得，因为现在项目已经拥有了足够严肃的验证体系，可以负责任地继续推进
+
+这就是当前最合理的技术与战略判断。
--- a/sw-block/design/v2-detailed-algorithm.zh.md
+++ b/sw-block/design/v2-detailed-algorithm.zh.md
--- a/sw-block/design/v2-engine-readiness-review.md
+++ b/sw-block/design/v2-engine-readiness-review.md
@ -0,0 +1,170 @@
+# V2 Engine Readiness Review
+
+Date: 2026-03-29
+Status: active
+Purpose: record the decision on whether the current V2 design + prototype + simulator stack is strong enough to begin real V2 engine slicing
+
+## Decision
+
+Current judgment:
+
+- proceed to real V2 engine planning
+- do not open a `V2.5` redesign track at this time
+
+This is a planning-readiness decision, not a production-readiness claim.
+
+## Why This Review Exists
+
+The project has now completed:
+
+1. design/FSM closure for the V2 line
+2. protocol simulation closure for:
+   - V1 / V1.5 / V2 comparison
+   - timeout/race behavior
+   - ownership/session semantics
+3. standalone prototype closure for:
+   - sender/session ownership
+   - execution authority
+   - recovery branching
+   - minimal historical-data proof
+   - prototype scenario closure
+4. `Phase 4.5` hardening for:
+   - bounded `CatchUp`
+   - first-class `Rebuild`
+   - crash-consistency / restart-recoverability
+   - `A5-A8` stronger evidence
+
+So the question is no longer:
+
+- "can the prototype be made richer?"
+
+The question is:
+
+- "is the evidence now strong enough to begin real engine slicing?"
+
+## Evidence Summary
+
+### 1. Design / Protocol
+
+Primary docs:
+
+- `sw-block/design/v2-acceptance-criteria.md`
+- `sw-block/design/v2-open-questions.md`
+- `sw-block/design/v2_scenarios.md`
+- `sw-block/design/v1-v15-v2-comparison.md`
+- `sw-block/design/v2-prototype-roadmap-and-gates.md`
+
+Judgment:
+
+- protocol story is coherent
+- acceptance set exists
+- major V1 / V1.5 failures are mapped into V2 scenarios
+
+### 2. Simulator
+
+Primary code/tests:
+
+- `sw-block/prototype/distsim/`
+- `sw-block/prototype/distsim/eventsim.go`
+- `learn/projects/sw-block/test/results/v2-simulation-review.md`
+
+Judgment:
+
+- strong enough for protocol/design validation
+- strong enough to challenge crash-consistency and liveness assumptions
+- not a substitute for real engine / hardware proof
+
+### 3. Prototype
+
+Primary code/tests:
+
+- `sw-block/prototype/enginev2/`
+- `sw-block/prototype/enginev2/acceptance_test.go`
+
+Judgment:
+
+- ownership is explicit and fenced
+- execution authority is explicit and fenced
+- bounded `CatchUp` is semantic, not documentary
+- `Rebuild` is a first-class sender-owned path
+- historical-data and recoverability reasoning are executable
+
+### 4. `A5-A8` Double Evidence
+
+Prototype-side grouped evidence:
+
+- `sw-block/prototype/enginev2/acceptance_test.go`
+
+Simulator-side grouped evidence:
+
+- `sw-block/design/a5-a8-traceability.md`
+- `sw-block/prototype/distsim/`
+
+Judgment:
+
+- the critical acceptance items that most affect engine risk now have materially stronger proof on both sides
+
+## What Is Good Enough Now
+
+The following are good enough to begin engine slicing:
+
+1. sender/session ownership model
+2. stale authority fencing
+3. recovery orchestration shape
+4. bounded `CatchUp` contract
+5. `Rebuild` as formal path
+6. committed/recoverable boundary thinking
+7. crash-consistency / restart-recoverability proof style
+
+## What Is Still Not Proven
+
+The following still require real engine work and later real-system validation:
+
+1. actual engine lifecycle integration
+2. real storage/backend implementation
+3. real control-plane integration
+4. real durability / fsync behavior under the actual engine
+5. real hardware timing / performance
+6. final production observability and failure handling
+
+These are expected gaps. They do not block engine planning.
+
+## Open Risks To Carry Forward
+
+These are not blockers, but they should remain explicit:
+
+1. prototype and simulator are still reduced models
+2. rebuild-source quality in the real engine will depend on actual checkpoint/base-image mechanics
+3. durability truth in the real engine must still be re-proven against actual persistence behavior
+4. predicate exploration can still grow, but should not block engine slicing
+
+## Engine-Planning Decision
+
+Decision:
+
+- start real V2 engine planning
+
+Reason:
+
+1. no current evidence points to a structural flaw requiring `V2.5`
+2. the remaining gaps are implementation/system gaps, not prototype ambiguity
+3. continuing to extend prototype/simulator breadth would have diminishing returns
+
+## Required Outputs After This Review
+
+1. `sw-block/design/v2-engine-slicing-plan.md`
+2. first real engine slice definition
+3. explicit non-goals for first engine stage
+4. explicit validation plan for engine slices
+
+## Non-Goals Of This Review
+
+This review does not claim:
+
+1. V2 is production-ready
+2. V2 should replace V1 immediately
+3. all design questions are forever closed
+
+It only claims:
+
+- the project now has enough evidence to begin disciplined real engine slicing
--- a/sw-block/design/v2-engine-slicing-plan.md
+++ b/sw-block/design/v2-engine-slicing-plan.md
@ -0,0 +1,191 @@
+# V2 Engine Slicing Plan
+
+Date: 2026-03-29
+Status: active
+Purpose: define the first real V2 engine slices after prototype and `Phase 4.5` closure
+
+## Goal
+
+Move from:
+
+- standalone design/prototype truth under `sw-block/prototype/`
+
+to:
+
+- a real V2 engine core under `sw-block/`
+
+without dragging V1.5 lifecycle assumptions into the implementation.
+
+## Planning Rules
+
+1. reuse V1 ideas and tests selectively, not structurally
+2. prefer narrow vertical slices over broad skeletons
+3. each slice must preserve the accepted V2 ownership/fencing model
+4. keep simulator/prototype as validation support, not as the implementation itself
+5. do not mix V2 engine work into `weed/storage/blockvol/`
+
+## First Engine Stage
+
+The first engine stage should build the control/recovery core, not the full storage engine.
+
+That means:
+
+1. per-replica sender identity
+2. one active recovery session per replica per epoch
+3. sender-owned execution authority
+4. explicit recovery outcomes:
+   - zero gap
+   - bounded catch-up
+   - rebuild
+5. rebuild execution shell only
+   - do not hard-code final snapshot + tail vs full base decision logic yet
+   - keep real rebuild-source choice tied to Slice 3 recoverability inputs
+
+## Recommended Slice Order
+
+### Slice 1: Engine Ownership Core
+
+Purpose:
+
+- carry the accepted `enginev2` ownership/fencing model into the real engine core
+
+Scope:
+
+1. stable per-replica sender object
+2. stable recovery-session object
+3. session identity fencing
+4. endpoint / epoch invalidation
+5. sender-group or equivalent ownership registry
+
+Acceptance:
+
+1. stale session results cannot mutate current authority
+2. changed-address and epoch-bump invalidation work in engine code
+3. the 4 V2-boundary ownership themes remain provable
+
+### Slice 2: Engine Recovery Execution Core
+
+Purpose:
+
+- move the prototype execution APIs into real engine behavior
+
+Scope:
+
+1. connect / handshake / catch-up flow
+2. bounded `CatchUp`
+3. explicit `NeedsRebuild`
+4. sender-owned rebuild execution path
+5. rebuild execution shell without final trusted-base selection policy
+
+Acceptance:
+
+1. bounded catch-up does not chase indefinitely
+2. rebuild is exclusive from catch-up
+3. session completion rules are explicit and fenced
+
+### Slice 3: Engine Data / Recoverability Core
+
+Purpose:
+
+- connect recovery behavior to real retained-history / checkpoint mechanics
+
+Scope:
+
+1. real recoverability decision inputs
+2. trusted-base decision for rebuild source
+3. minimal real checkpoint/base-image integration
+4. real truncation / safe-boundary handling
+
+This is the first slice that should decide, from real engine inputs, between:
+
+1. `snapshot + tail`
+2. `full base`
+
+Acceptance:
+
+1. engine can explain why recovery is allowed
+2. rebuild-source choice is explicit and testable
+3. historical correctness and truncation rules remain intact
+
+### Slice 4: Engine Integration Closure
+
+Purpose:
+
+- bind engine control/recovery core to real orchestration and validation surfaces
+
+Scope:
+
+1. real assignment/control intent entry path
+2. engine-facing observability
+3. focused real-engine tests for V2-boundary cases
+4. first integration review against real failure classes
+
+Acceptance:
+
+1. key V2-boundary failures are reproduced and closed in engine tests
+2. engine observability is good enough to debug ownership/recovery failures
+3. remaining gaps are system/performance gaps, not control-model ambiguity
+
+## What To Reuse
+
+Good reuse candidates:
+
+1. tests and failure cases from V1 / V1.5
+2. narrow utility/data helpers where not coupled to V1 lifecycle
+3. selected WAL/history concepts if they fit V2 ownership boundaries
+
+Do not structurally reuse:
+
+1. V1/V1.5 shipper lifecycle
+2. address-based identity assumptions
+3. `SetReplicaAddrs`-style behavior
+4. old recovery control structure
+
+## Where The Work Should Live
+
+Real V2 engine work should continue under:
+
+- `sw-block/`
+
+Recommended next area:
+
+- `sw-block/core/`
+or
+- `sw-block/engine/`
+
+Exact path can be chosen later, but it should remain separate from:
+
+- `sw-block/prototype/`
+- `weed/storage/blockvol/`
+
+## Validation Plan For Engine Slices
+
+Each engine slice should be validated at three levels:
+
+1. prototype alignment
+- does engine behavior preserve the accepted prototype invariant?
+
+2. focused engine tests
+- does the real engine slice enforce the same contract?
+
+3. scenario mapping
+- does at least one important V1/V1.5 failure class remain closed?
+
+## Non-Goals For First Engine Stage
+
+Do not try to do these immediately:
+
+1. full Smart WAL expansion
+2. performance optimization
+3. V1 replacement/migration plan
+4. full product integration
+5. all storage/backend redesign at once
+
+## Immediate Next Assignment
+
+The first concrete engine-planning task should be:
+
+1. choose the real V2 engine module location under `sw-block/`
+2. define Slice 1 file/module boundaries
+3. write a short engine ownership-core spec
+4. map 3-5 acceptance scenarios directly onto Slice 1 expectations
--- a/sw-block/design/v2-production-roadmap.md
+++ b/sw-block/design/v2-production-roadmap.md
@ -0,0 +1,199 @@
+# V2 Production Roadmap
+
+Date: 2026-03-30
+Status: active
+Purpose: define the path from the accepted V2 engine core to a production candidate
+
+## Current Position
+
+Completed:
+
+1. design / FSM closure
+2. simulator / protocol validation
+3. prototype closure
+4. evidence hardening
+5. engine core slices:
+   - Slice 1 ownership core
+   - Slice 2 recovery execution core
+   - Slice 3 data / recoverability core
+   - Slice 4 integration closure
+
+Current stage:
+
+- entering broader engine implementation
+
+This means the main risk is no longer:
+
+- whether the V2 idea stands up
+
+The main risk is:
+
+- whether the accepted engine core can be turned into a real system without reintroducing V1/V1.5 structure and semantics
+
+## Roadmap Summary
+
+1. Phase 06: broader engine implementation stage
+2. Phase 07: real-system integration / product-path decision
+3. Phase 08: pre-production hardening
+4. Phase 09: performance / scale / soak validation
+5. Phase 10: production candidate and rollout gate
+
+## Phase 06
+
+### Goal
+
+Connect the accepted engine core to:
+
+1. real control truth
+2. real storage truth
+3. explicit engine execution steps
+
+### Outputs
+
+1. control-plane adapter into the engine core
+2. storage/base/recoverability adapters
+3. explicit execution-driver model where synchronous helpers are no longer sufficient
+4. validation against selected real failure classes
+
+### Gate
+
+At the end of Phase 06, the project should be able to say:
+
+- the engine core can live inside a real system shape
+
+## Phase 07
+
+### Goal
+
+Move from engine-local correctness to a real runnable subsystem.
+
+### Outputs
+
+1. service-style runnable engine slice
+2. integration with real control and storage surfaces
+3. crash/failover/restart integration tests
+4. decision on the first viable product path
+
+### Gate
+
+At the end of Phase 07, the project should be able to say:
+
+- the engine can run as a real subsystem, not only as an isolated core
+
+## Phase 08
+
+### Goal
+
+Turn correctness into operational safety.
+
+### Outputs
+
+1. observability hardening
+2. operator/debug flows
+3. recovery/runbook procedures
+4. config surface cleanup
+5. realistic durability/restart validation
+
+### Gate
+
+At the end of Phase 08, the project should be able to say:
+
+- operators can run, debug, and recover the system safely
+
+## Phase 09
+
+### Goal
+
+Prove viability under load and over time.
+
+### Outputs
+
+1. throughput / latency baselines
+2. rebuild / catch-up cost characterization
+3. steady-state overhead measurement
+4. soak testing
+5. scale and failure-under-load validation
+
+### Gate
+
+At the end of Phase 09, the project should be able to say:
+
+- the design is not only correct, but viable at useful scale and duration
+
+## Phase 10
+
+### Goal
+
+Produce a controlled production candidate.
+
+### Outputs
+
+1. feature-gated production candidate
+2. rollback strategy
+3. migration/coexistence plan with V1
+4. staged rollout plan
+5. production acceptance checklist
+
+### Gate
+
+At the end of Phase 10, the project should be able to say:
+
+- the system is ready for a controlled production rollout
+
+## Cross-Phase Rules
+
+### Rule 1: Do not reopen protocol shape casually
+
+The accepted core should remain stable unless new implementation evidence forces a change.
+
+### Rule 2: Use V1 as validation source, not design template
+
+Use:
+
+1. `learn/projects/sw-block/`
+2. `weed/storage/block*`
+
+for:
+
+1. failure gates
+2. constraints
+3. integration references
+
+Do not use them as the default V2 architecture template.
+
+### Rule 3: Keep `CatchUp` narrow
+
+Do not let later implementation phases re-expand `CatchUp` into a broad, optimistic, long-lived recovery mode.
+
+### Rule 4: Keep evidence quality ahead of object growth
+
+New work should preferentially improve:
+
+1. traceability
+2. diagnosability
+3. real-failure validation
+4. operational confidence
+
+not simply add new objects, states, or mechanisms.
+
+## Production Readiness Ladder
+
+The project should move through this ladder explicitly:
+
+1. proof-of-design
+2. proof-of-engine-shape
+3. proof-of-runnable-engine-stage
+4. proof-of-operable-system
+5. proof-of-viable-production-candidate
+
+Current ladder position:
+
+- between `2` and `3`
+- engine core accepted; broader runnable engine stage underway
+
+## Next Documents To Maintain
+
+1. `sw-block/.private/phase/phase-06.md`
+2. `sw-block/design/v2-engine-readiness-review.md`
+3. `sw-block/design/v2-engine-slicing-plan.md`
+4. this roadmap
--- a/sw-block/design/v2-protocol-truths.md
+++ b/sw-block/design/v2-protocol-truths.md
@ -0,0 +1,561 @@
+# V2 Protocol Truths
+
+Date: 2026-03-30
+Status: active
+Purpose: record the compact, stable truths that later phases must preserve, and provide a conformance reference for implementation reviews
+
+## Why This Document Exists
+
+`FSM`, `simulator`, `prototype`, and `engine` are not a code-production pipeline.
+They are an evidence ladder.
+
+So the most important output to carry forward is not only code, but:
+
+1. accepted semantics
+2. must-hold boundaries
+3. failure classes that must stay closed
+4. explicit places where later phases may improve or drift
+
+This document is the compact truth table for the V2 line.
+
+## How To Use It
+
+For each later phase or slice, ask:
+
+1. does the new implementation remain aligned with these truths?
+2. if not, is the deviation constructive or risky?
+3. which truth is newly strengthened by this phase?
+
+Deviation labels:
+
+- `Aligned`: implementation preserves the truth
+- `Constructive deviation`: implementation changes shape but strengthens the truth
+- `Risky deviation`: implementation weakens or blurs the truth
+
+## Core Truths
+
+### T1. `CommittedLSN` is the external truth boundary
+
+Short form:
+
+- external promises are anchored at `CommittedLSN`, not `HeadLSN`
+
+Meaning:
+
+- recovery targets
+- promotion safety
+- flush/visibility reasoning
+
+must all be phrased against `CommittedLSN`.
+
+Prevents:
+
+- using optimistic WAL head as committed truth
+- acknowledging lineage that failover cannot preserve
+
+Evidence anchor:
+
+- strong in design
+- strong in simulator
+- strong in prototype
+- strong in engine
+
+### T2. `ZeroGap <=> ReplicaFlushedLSN == CommittedLSN`
+
+Short form:
+
+- zero-gap requires exact equality with committed truth
+
+Meaning:
+
+- replica ahead is not zero-gap
+- replica behind is not zero-gap
+
+Prevents:
+
+- unsafe fast-path completion
+- replica-ahead being mistaken for in-sync
+
+Evidence anchor:
+
+- strong in prototype
+- strong in engine
+
+### T3. `CatchUp` is bounded replay on a still-trusted base
+
+Short form:
+
+- `CatchUp = KeepUp with bounded debt`
+
+Meaning:
+
+- catch-up is a short-gap, low-cost, bounded replay path
+- it only makes sense while the replica base is still trustworthy enough to continue from
+
+Prevents:
+
+- turning catch-up into indefinite moving-head chase
+- hiding broad recovery complexity in replay logic
+
+Evidence anchor:
+
+- strong in design
+- strong in simulator
+- strong in prototype
+- strong in engine
+
+### T4. `NeedsRebuild` is explicit when replay is not the right answer
+
+Short form:
+
+- `NeedsRebuild <=> replay is unrecoverable, unstable, or no longer worth bounded replay`
+
+Meaning:
+
+- long-gap
+- lost recoverability
+- no trusted base
+- budget violation
+
+must escalate explicitly.
+
+Prevents:
+
+- pretending catch-up will eventually succeed
+- carrying V1/V1.5-style unbounded degraded chase forward
+
+Evidence anchor:
+
+- strong in simulator
+- strong in prototype
+- strong in engine
+
+### T5. `Rebuild` is the formal primary recovery path
+
+Short form:
+
+- `Rebuild = frozen TargetLSN + trusted base + optional tail + barrier`
+
+Meaning:
+
+- rebuild is not a shameful fallback
+- it is the general recovery framework
+
+Prevents:
+
+- overloading catch-up with broad recovery semantics
+- treating full/partial rebuild as unrelated protocols
+
+Evidence anchor:
+
+- strong in design
+- strong in prototype
+- strong in engine
+
+### T6. Full and partial rebuild share one correctness contract
+
+Short form:
+
+- `full rebuild` and `partial rebuild` differ in transfer choice, not in truth model
+
+Meaning:
+
+- both require frozen `TargetLSN`
+- both require trusted pinned base
+- both require explicit durable completion
+
+Prevents:
+
+- optimization layers redefining protocol truth
+- bitmap/range paths bypassing trusted-base rules
+
+Evidence anchor:
+
+- strong in design
+- partial in engine
+- stronger real-system proof still deferred
+
+### T7. No recovery result may outlive its authority
+
+Short form:
+
+- `ValidMutation <=> sender exists && sessionID matches && epoch current && endpoint current`
+
+Meaning:
+
+- stale session
+- stale epoch
+- stale endpoint
+- stale sender
+
+must all fail closed.
+
+Prevents:
+
+- late results mutating current lineage
+- changed-address stale completion bugs
+
+Evidence anchor:
+
+- strong in simulator
+- strong in prototype
+- strong in engine
+
+### T8. `ReplicaID` is stable identity; `Endpoint` is mutable location
+
+Short form:
+
+- `ReplicaID != address`
+
+Meaning:
+
+- address changes may invalidate sessions
+- address changes must not destroy sender identity
+
+Prevents:
+
+- reintroducing address-shaped identity
+- changed-address restarting as logical removal + add
+
+Evidence anchor:
+
+- strong in prototype
+- strong in engine
+- strong in bridge P0
+
+### T9. Truncation is a protocol boundary, not cleanup
+
+Short form:
+
+- replica-ahead cannot complete until divergent tail is explicitly truncated
+
+Meaning:
+
+- truncation is part of recovery contract
+- not a side-effect or best-effort cleanup
+
+Prevents:
+
+- completing recovery while replica still contains newer divergent writes
+
+Evidence anchor:
+
+- strong in design
+- strong in engine
+
+### T10. Recoverability must be proven from real retained history
+
+Short form:
+
+- `CatchUp allowed <=> required replay range is recoverable from retained history`
+
+Meaning:
+
+- the engine should consume storage truth
+- not test-reconstructed optimism
+
+Prevents:
+
+- replay on missing WAL
+- fake recoverability based only on watermarks
+
+Evidence anchor:
+
+- strong in simulator
+- strong in engine
+- strengthened in driver/adapter phases
+
+### T11. Trusted-base choice must be explicit and causal
+
+Short form:
+
+- `snapshot_tail` requires both trusted checkpoint and replayable tail
+
+Meaning:
+
+- snapshot existence alone is insufficient
+- fallback to full-base must be explainable
+
+Prevents:
+
+- over-trusting old checkpoints
+- silently choosing an invalid rebuild source
+
+Evidence anchor:
+
+- strong in simulator
+- strong in engine
+- strengthened by Phase 06
+
+### T12. Current extent cannot fake old history
+
+Short form:
+
+- historical correctness requires reconstructable history, not current-state approximation
+
+Meaning:
+
+- live extent state is not sufficient proof of an old target point
+- historical reconstruction must be justified by checkpoint + retained history
+
+Prevents:
+
+- using current extent as fake proof of older state
+
+Evidence anchor:
+
+- strongest in simulator
+- engine currently proves prerequisites, not full reconstruction proof
+
+### T13. Promotion requires recoverable committed prefix
+
+Short form:
+
+- promoted replica must be able to recover committed truth, not merely advertise a high watermark
+
+Meaning:
+
+- candidate selection is about recoverable lineage, not optimistic flush visibility
+
+Prevents:
+
+- promoting a replica that cannot reconstruct committed prefix after crash/restart
+
+Evidence anchor:
+
+- strong in simulator
+- partially carried into engine semantics
+- real-system validation still needed
+
+### T14. `blockvol` executes I/O; engine owns recovery policy
+
+Short form:
+
+- adapters may translate engine decisions into concrete work
+- they must not silently re-decide recovery classification or source choice
+
+Meaning:
+
+- master remains control authority
+- engine remains recovery authority
+- storage remains truth source
+
+Prevents:
+
+- V1/V1.5 policy leakage back into service glue
+
+Evidence anchor:
+
+- strong in Phase 07 service-slice planning
+- initial bridge P0 aligns
+- real-system proof still pending
+
+### T15. Reuse reality, not inherited semantics
+
+Short form:
+
+- V2 may reuse existing Seaweed control/runtime/storage paths
+- it must not inherit old semantics as protocol truth
+
+Meaning:
+
+- reuse existing heartbeat, assignment, `blockvol`, receiver, shipper, retention, and runtime machinery when useful
+- keep `ReplicaID`, epoch authority, recovery classification, committed truth, and rebuild boundaries anchored in accepted V2 semantics
+
+Prevents:
+
+- V1/V1.5 structure silently redefining V2 behavior
+- convenience reuse turning old runtime assumptions into new protocol truth
+
+Evidence anchor:
+
+- strong in Phase 07/08 direction
+- should remain active in later implementation phases
+
+## Current Strongest Evidence By Layer
+
+| Layer | Main value |
+|------|------------|
+| `FSM` / design | define truth and non-goals |
+| simulator | prove protocol truth and failure-class closure cheaply |
+| prototype | prove implementation-shape and authority semantics cheaply |
+| engine | prove the accepted contracts survive real implementation structure |
+| service slice / runner | prove truth survives real control/storage/system reality |
+
+## Phase Conformance Notes
+
+### Phase 04
+
+- `Aligned`: T7, T8
+- strengthened sender/session ownership and stale rejection
+
+### Phase 4.5
+
+- `Aligned`: T3, T4, T5, T10, T12
+- major tightening:
+  - bounded catch-up
+  - first-class rebuild
+  - crash-consistency and recoverability proof style
+
+### Phase 05
+
+- `Aligned`: T1, T2, T3, T4, T5, T7, T8, T9, T10, T11
+- engine core slices closed:
+  - ownership
+  - execution
+  - recoverability gating
+  - orchestrated entry path
+
+### Phase 06
+
+- `Aligned`: T10, T11, T14
+- `Constructive deviation`: planner/executor split replaced convenience wrappers without changing protocol truth
+- strengthened:
+  - real storage/resource contracts
+  - explicit release symmetry
+  - failure-class validation against engine path
+
+### Phase 07 P0
+
+- `Aligned`: T8, T10, T14
+- bridge now makes stable `ReplicaID` explicit at service boundary
+- bridge states the hard rule that engine decides policy and `blockvol` executes I/O
+- real `weed/storage/blockvol/` integration still pending
+
+## Current Carry-Forward Truths For Later Phases
+
+Later phases must not regress these:
+
+1. `CommittedLSN` remains the external truth boundary
+2. `CatchUp` stays narrow and bounded
+3. `Rebuild` remains the formal primary recovery path
+4. stale authority must fail closed
+5. stable identity must remain separate from mutable endpoint
+6. trusted-base choice must remain explicit and causal
+7. service glue must not silently re-decide recovery policy
+8. reuse reality, but do not inherit old semantics as V2 truth
+
+## Review Rule
+
+Every later phase or slice should explicitly answer:
+
+1. which truths are exercised?
+2. which truths are strengthened?
+3. does this phase introduce any constructive or risky deviation?
+4. which evidence layer now carries the truth most strongly?
+
+## Phase Alignment Rule
+
+From `Phase 05` onward, every phase or slice should align explicitly against this document.
+
+Minimum phase-alignment questions:
+
+1. which truths are in scope?
+2. which truths are strengthened?
+3. which truths are merely carried forward?
+4. does the phase introduce any constructive deviation?
+5. does the phase introduce any risky deviation?
+6. which evidence layer currently carries each in-scope truth most strongly?
+
+Expected output shape for each later phase:
+
+- `In-scope truths`
+- `Strengthened truths`
+- `Carry-forward truths`
+- `Constructive deviations`
+- `Risky deviations`
+- `Evidence shift`
+
+## Phase 05-07 Alignment
+
+### Phase 05
+
+Primary alignment focus:
+
+- T1 `CommittedLSN` as external truth boundary
+- T2 zero-gap exactness
+- T3 bounded `CatchUp`
+- T4 explicit `NeedsRebuild`
+- T5/T6 rebuild correctness contract
+- T7 stale authority must fail closed
+- T8 stable `ReplicaID`
+- T9 truncation as protocol boundary
+- T10/T11 recoverability and trusted-base gating
+
+Main strengthening:
+
+- engine core adopted accepted protocol truths as real implementation structure
+
+Main review risk:
+
+- engine structure accidentally collapsing back to address identity or unfenced execution
+
+### Phase 06
+
+Primary alignment focus:
+
+- T10 recoverability from real retained history
+- T11 trusted-base choice remains explicit and causal
+- T14 engine owns policy, adapters carry truth and execution contracts
+
+Main strengthening:
+
+- planner/executor/resource contracts
+- fail-closed cleanup symmetry
+- cross-layer proof path through engine execution
+
+Main review risk:
+
+- executor or adapters recomputing policy from convenience inputs
+- storage/resource contracts becoming approximate instead of real
+
+### Phase 07+
+
+Primary alignment focus:
+
+- T8 stable identity at the real service boundary
+- T10 real storage truth into engine decisions
+- T11 trusted-base proof remains explicit through service glue
+- T14 `blockvol` executes I/O but does not own recovery policy
+
+Main strengthening:
+
+- real-system service-slice conformance
+- real control-plane and storage-plane integration
+- diagnosable failure replay through the integrated path
+
+Main review risk:
+
+- V1/V1.5 semantics leaking back in through service glue
+- address-shaped identity reappearing at the boundary
+- blockvol-side code silently re-deciding recovery policy
+
+## Future Feature Rule
+
+When a later feature expands the protocol surface (for example `SmartWAL` or a new rebuild optimization), the order should be:
+
+1. `FSM / design`
+- define the new semantics and non-goals
+
+2. `Truth update`
+- either attach the feature to an existing truth
+- or add a new protocol truth if the feature creates a new long-lived invariant
+
+3. `Phase alignment`
+- define which later phases strengthen or validate that truth
+
+4. `Evidence ladder`
+- simulator, prototype, engine, service slice as needed
+
+Do not start feature implementation by editing engine or service glue first and only later trying to explain what truth changed.
+
+## Feature Review Rule
+
+For any future feature, later reviews should ask:
+
+1. did the feature create a new truth or just strengthen an existing one?
+2. which phase first validates it?
+3. which evidence layer proves it most strongly today?
+4. does the feature weaken any existing truth?
+
+This keeps feature growth aligned with protocol truth instead of letting implementation convenience define semantics.
--- a/sw-block/prototype/distsim/cluster.go
+++ b/sw-block/prototype/distsim/cluster.go
@ -1066,9 +1066,10 @@ type CandidateEligibility struct {
 }

 // EvaluateCandidateEligibility checks all promotion prerequisites for a node.
-// A candidate must have the full committed prefix (FlushedLSN >= CommittedLSN)
-// to be eligible. Promoting a replica that is missing committed data would
-// lose acknowledged writes.
+// Phase 4.5: uses RecoverableLSN (not just FlushedLSN) to verify that the
+// candidate can actually recover the committed prefix after a crash+restart,
+// not just that it received durable WAL entries. RecoverableLSN accounts for
+// checkpoint + WAL replay availability.
 func (c *Cluster) EvaluateCandidateEligibility(candidateID string) CandidateEligibility {
 	n := c.Nodes[candidateID]
 	if n == nil {
@ -1084,7 +1085,11 @@ func (c *Cluster) EvaluateCandidateEligibility(candidateID string) CandidateElig
 	if n.ReplicaState == NodeStateNeedsRebuild || n.ReplicaState == NodeStateRebuilding {
 		reasons = append(reasons, "state_ineligible")
 	}
-	if n.Storage.FlushedLSN < c.Coordinator.CommittedLSN {
+	// Phase 4.5: check recoverable committed prefix, not just durable watermark.
+	// RecoverableLSN = the highest LSN that would survive crash + restart.
+	// This is stronger than FlushedLSN when checkpoint + WAL GC may have
+	// created gaps in the replay path.
+	if n.Storage.RecoverableLSN() < c.Coordinator.CommittedLSN {
 		reasons = append(reasons, "insufficient_committed_prefix")
 	}
 	return CandidateEligibility{
--- a/sw-block/prototype/distsim/cluster_test.go
+++ b/sw-block/prototype/distsim/cluster_test.go
@ -166,7 +166,7 @@ func TestZombieOldPrimaryWritesAreFenced(t *testing.T) {
 	if c.Coordinator.CommittedLSN != 1 {
 		t.Fatalf("stale message changed committed lsn: got=%d", c.Coordinator.CommittedLSN)
 	}
-	if got := c.Nodes["r1"].Storage.Extent[42]; got != 0 {
+	if got := c.Nodes["r1"].Storage.LiveExtent[42]; got != 0 {
 		t.Fatalf("stale message mutated new primary extent: block42=%d", got)
 	}
 }
--- a/sw-block/prototype/distsim/phase02_candidate_test.go
+++ b/sw-block/prototype/distsim/phase02_candidate_test.go
@ -353,6 +353,7 @@ func TestP02_CandidateEligibility_InsufficientCommittedPrefix(t *testing.T) {

 	// Manually set r1 behind committed prefix.
 	c.Nodes["r1"].Storage.FlushedLSN = 0
+	c.Nodes["r1"].Storage.WALDurableLSN = 0
 	e = c.EvaluateCandidateEligibility("r1")
 	if e.Eligible {
 		t.Fatal("FlushedLSN=0 with CommittedLSN=1 should not be eligible")
@ -379,14 +380,19 @@ func TestP02_CandidateEligibility_InSyncButLagging_Rejected(t *testing.T) {
 	// r1: InSync, correct epoch, but FlushedLSN=1. Ineligible.
 	c.Nodes["r1"].ReplicaState = NodeStateInSync
 	c.Nodes["r1"].Storage.FlushedLSN = 1
+	c.Nodes["r1"].Storage.WALDurableLSN = 1

 	// r2: CatchingUp, correct epoch, FlushedLSN=100. Eligible.
 	c.Nodes["r2"].ReplicaState = NodeStateCatchingUp
 	c.Nodes["r2"].Storage.FlushedLSN = 100
+	c.Nodes["r2"].Storage.WALDurableLSN = 100
+	c.Nodes["r2"].Storage.CheckpointLSN = 100

 	// r3: InSync, correct epoch, FlushedLSN=100. Eligible.
 	c.Nodes["r3"].ReplicaState = NodeStateInSync
 	c.Nodes["r3"].Storage.FlushedLSN = 100
+	c.Nodes["r3"].Storage.WALDurableLSN = 100
+	c.Nodes["r3"].Storage.CheckpointLSN = 100

 	// r1 is ineligible despite being InSync.
 	e1 := c.EvaluateCandidateEligibility("r1")
--- a/sw-block/prototype/distsim/phase045_adversarial_test.go
+++ b/sw-block/prototype/distsim/phase045_adversarial_test.go
@ -0,0 +1,219 @@
+package distsim
+
+import (
+	"math/rand"
+	"testing"
+)
+
+// Phase 4.5: Adversarial predicate search.
+// These tests run randomized/semi-randomized scenarios and check danger
+// predicates after each step. The goal is to find protocol violations
+// that handwritten scenarios might miss.
+
+// TestAdversarial_RandomWritesAndCrashes runs random write + crash + restart
+// sequences and checks all danger predicates after each step.
+func TestAdversarial_RandomWritesAndCrashes(t *testing.T) {
+	rng := rand.New(rand.NewSource(42))
+
+	for trial := 0; trial < 50; trial++ {
+		c := NewCluster(CommitSyncAll, "p", "r1", "r2")
+
+		// Random sequence of operations.
+		for step := 0; step < 30; step++ {
+			op := rng.Intn(10)
+			switch {
+			case op < 5:
+				// Write.
+				block := uint64(rng.Intn(10) + 1)
+				c.CommitWrite(block)
+			case op < 7:
+				// Tick (advance time, deliver messages).
+				c.Tick()
+			case op < 8:
+				// Crash a random node.
+				nodes := []string{"p", "r1", "r2"}
+				target := nodes[rng.Intn(3)]
+				node := c.Nodes[target]
+				if node.Running {
+					node.Storage.Crash()
+					node.Running = false
+					node.ReplicaState = NodeStateLagging
+				}
+			case op < 9:
+				// Restart a crashed node.
+				nodes := []string{"p", "r1", "r2"}
+				target := nodes[rng.Intn(3)]
+				node := c.Nodes[target]
+				if !node.Running {
+					node.Storage.Restart()
+					node.Running = true
+					node.ReplicaState = NodeStateLagging // needs catch-up
+				}
+			default:
+				// Flusher tick on all running nodes.
+				for _, node := range c.Nodes {
+					if node.Running {
+						node.Storage.ApplyToExtent(node.Storage.WALDurableLSN)
+						node.Storage.AdvanceCheckpoint(node.Storage.WALDurableLSN)
+					}
+				}
+			}
+
+			// Check predicates after every step.
+			violations := CheckAllPredicates(c)
+			if len(violations) > 0 {
+				for name, detail := range violations {
+					t.Errorf("trial %d step %d: PREDICATE VIOLATED [%s]: %s", trial, step, name, detail)
+				}
+				t.FailNow()
+			}
+		}
+	}
+}
+
+// TestAdversarial_FailoverChainWithPredicates runs a sequence of
+// failovers (promote, crash, promote) and checks predicates.
+func TestAdversarial_FailoverChainWithPredicates(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r1", "r2")
+
+	// Write some data and commit.
+	for i := 0; i < 5; i++ {
+		c.CommitWrite(uint64(i + 1))
+	}
+	c.TickN(5)
+
+	check := func(label string) {
+		violations := CheckAllPredicates(c)
+		for name, detail := range violations {
+			t.Fatalf("%s: PREDICATE VIOLATED [%s]: %s", label, name, detail)
+		}
+	}
+
+	check("after initial writes")
+
+	// Kill primary.
+	c.Nodes["p"].Running = false
+	c.Nodes["p"].Storage.Crash()
+
+	// Promote r1.
+	c.Promote("r1")
+	c.TickN(3)
+	check("after first promotion")
+
+	// Write more under new primary.
+	for i := 0; i < 3; i++ {
+		c.CommitWrite(uint64(i + 10))
+	}
+	c.TickN(5)
+	check("after writes on new primary")
+
+	// Kill new primary.
+	c.Nodes["r1"].Running = false
+	c.Nodes["r1"].Storage.Crash()
+
+	// Promote r2.
+	c.Promote("r2")
+	c.TickN(3)
+	check("after second promotion")
+
+	// Write more under third primary.
+	c.CommitWrite(99)
+	c.TickN(5)
+	check("after writes on third primary")
+}
+
+// TestAdversarial_CatchUpUnderLoad runs catch-up while the primary keeps
+// writing, then checks predicates for livelock.
+func TestAdversarial_CatchUpUnderLoad(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r1")
+
+	// Write initial data.
+	for i := 0; i < 10; i++ {
+		c.CommitWrite(uint64(i + 1))
+	}
+	c.TickN(5)
+
+	// Disconnect r1.
+	c.Nodes["r1"].Running = false
+
+	// Write more while r1 is down.
+	for i := 0; i < 20; i++ {
+		c.CommitWrite(uint64(i + 100))
+		c.Tick()
+	}
+
+	// Reconnect r1 — needs catch-up.
+	c.Nodes["r1"].Running = true
+	c.Nodes["r1"].ReplicaState = NodeStateLagging
+
+	// Attempt catch-up while primary keeps writing.
+	for step := 0; step < 20; step++ {
+		// Primary writes more.
+		c.CommitWrite(uint64(step + 200))
+		c.Tick()
+
+		// Attempt catch-up progress.
+		c.CatchUpWithEscalation("r1", 5)
+
+		// Check predicates.
+		violations := CheckAllPredicates(c)
+		for name, detail := range violations {
+			t.Fatalf("step %d: PREDICATE VIOLATED [%s]: %s", step, name, detail)
+		}
+	}
+
+	// After the loop, r1 should be either InSync or NeedsRebuild.
+	state := c.Nodes["r1"].ReplicaState
+	if state != NodeStateInSync && state != NodeStateNeedsRebuild {
+		t.Fatalf("r1 should be InSync or NeedsRebuild after catch-up under load, got %s", state)
+	}
+}
+
+// TestAdversarial_CheckpointGCThenCrash runs checkpoint + WAL GC + crash
+// sequences and verifies acked data is never lost.
+func TestAdversarial_CheckpointGCThenCrash(t *testing.T) {
+	rng := rand.New(rand.NewSource(99))
+
+	for trial := 0; trial < 30; trial++ {
+		c := NewCluster(CommitSyncAll, "p", "r1")
+
+		// Write and commit data.
+		for i := 0; i < 15; i++ {
+			c.CommitWrite(uint64(rng.Intn(20) + 1))
+		}
+		c.TickN(10)
+
+		// Flusher + checkpoint at various points.
+		for _, node := range c.Nodes {
+			if node.Running {
+				flushTo := node.Storage.WALDurableLSN
+				node.Storage.ApplyToExtent(flushTo)
+				// Checkpoint at a random point up to flush.
+				cpLSN := uint64(rng.Int63n(int64(flushTo+1)))
+				node.Storage.AdvanceCheckpoint(cpLSN)
+
+				// GC WAL entries before checkpoint.
+				retained := make([]Write, 0)
+				for _, w := range node.Storage.WAL {
+					if w.LSN > node.Storage.CheckpointLSN {
+						retained = append(retained, w)
+					}
+				}
+				node.Storage.WAL = retained
+			}
+		}
+
+		// Crash primary.
+		primary := c.Primary()
+		if primary != nil {
+			primary.Storage.Crash()
+			primary.Storage.Restart()
+		}
+
+		// Check predicates — committed data must still be recoverable.
+		violations := CheckAllPredicates(c)
+		for name, detail := range violations {
+			t.Errorf("trial %d: PREDICATE VIOLATED [%s]: %s", trial, name, detail)
+		}
+	}
+}
--- a/sw-block/prototype/distsim/phase045_crash_test.go
+++ b/sw-block/prototype/distsim/phase045_crash_test.go
@ -0,0 +1,334 @@
+package distsim
+
+import (
+	"testing"
+)
+
+// Phase 4.5: Crash-consistency and recoverability tests.
+// These validate invariants I1-I5 from the crash-consistency simulation plan.
+
+// --- Invariant I1: ACK'd flush is recoverable after any crash ---
+
+func TestI1_AckedFlush_RecoverableAfterPrimaryCrash(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r")
+
+	// Write 3 entries and commit (sync_all = durable on both nodes).
+	for i := 0; i < 3; i++ {
+		c.CommitWrite(uint64(i + 1))
+	}
+	c.Tick()
+	c.Tick()
+	c.Tick()
+
+	if c.Coordinator.CommittedLSN < 3 {
+		t.Fatalf("expected CommittedLSN>=3, got %d", c.Coordinator.CommittedLSN)
+	}
+
+	committedLSN := c.Coordinator.CommittedLSN
+
+	// Crash the primary.
+	primary := c.Nodes["p"]
+	primary.Storage.Crash()
+
+	// Restart: recover from checkpoint + durable WAL.
+	recoveredLSN := primary.Storage.Restart()
+
+	// I1: all committed data must be recoverable.
+	if recoveredLSN < committedLSN {
+		t.Fatalf("I1 VIOLATED: recoveredLSN=%d < committedLSN=%d — acked data lost",
+			recoveredLSN, committedLSN)
+	}
+
+	// Verify data correctness against reference.
+	refState := c.Reference.StateAt(committedLSN)
+	recState := primary.Storage.StateAt(committedLSN)
+	for block, expected := range refState {
+		if got := recState[block]; got != expected {
+			t.Fatalf("I1 VIOLATED: block %d: reference=%d recovered=%d", block, expected, got)
+		}
+	}
+}
+
+// --- Invariant I2: No ghost visible state after crash ---
+
+func TestI2_ExtentAheadOfCheckpoint_CrashRestart(t *testing.T) {
+	s := NewStorage()
+
+	// Write 5 entries to WAL.
+	for i := uint64(1); i <= 5; i++ {
+		s.AppendWrite(Write{Block: 10 + i, Value: i * 100, LSN: i})
+	}
+
+	// Make all 5 durable.
+	s.AdvanceFlush(5)
+
+	// Flusher materializes entries 1-3 to live extent.
+	s.ApplyToExtent(3)
+
+	// Checkpoint at LSN 1 only.
+	s.AdvanceCheckpoint(1)
+
+	// Crash.
+	s.Crash()
+	if s.LiveExtent != nil {
+		t.Fatal("after crash, LiveExtent should be nil")
+	}
+
+	// Restart.
+	recoveredLSN := s.Restart()
+	if recoveredLSN != 5 {
+		t.Fatalf("recoveredLSN should be 5, got %d", recoveredLSN)
+	}
+
+	// I2: all durable data recovered from checkpoint + WAL replay.
+	for i := uint64(1); i <= 5; i++ {
+		block := 10 + i
+		expected := i * 100
+		if got := s.LiveExtent[block]; got != expected {
+			t.Fatalf("I2: block %d: expected %d, got %d", block, expected, got)
+		}
+	}
+}
+
+func TestI2_UnackedData_LostAfterCrash(t *testing.T) {
+	s := NewStorage()
+
+	for i := uint64(1); i <= 5; i++ {
+		s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i})
+	}
+
+	// Only fsync 1-3. Entries 4-5 are NOT durable.
+	s.AdvanceFlush(3)
+	s.ApplyToExtent(5) // should clamp to 3
+	s.AdvanceCheckpoint(3)
+
+	if s.ExtentAppliedLSN != 3 {
+		t.Fatalf("ApplyToExtent should clamp to WALDurableLSN=3, got %d", s.ExtentAppliedLSN)
+	}
+
+	s.Crash()
+	s.Restart()
+
+	// Blocks 4,5 must NOT be in recovered extent.
+	if val, ok := s.LiveExtent[4]; ok && val != 0 {
+		t.Fatalf("I2 VIOLATED: block 4=%d survived crash — unfsynced data", val)
+	}
+	if val, ok := s.LiveExtent[5]; ok && val != 0 {
+		t.Fatalf("I2 VIOLATED: block 5=%d survived crash — unfsynced data", val)
+	}
+
+	// Blocks 1-3 must be there.
+	for i := uint64(1); i <= 3; i++ {
+		if got := s.LiveExtent[i]; got != i*10 {
+			t.Fatalf("block %d: expected %d, got %d", i, i*10, got)
+		}
+	}
+}
+
+// --- Invariant I3: CatchUp converges or escalates ---
+
+func TestI3_CatchUpConvergesOrEscalates(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r")
+
+	// Commit initial entry.
+	c.CommitWrite(1)
+	c.Tick()
+	c.Tick()
+
+	// Disconnect replica and write more.
+	c.Nodes["r"].Running = false
+	for i := uint64(2); i <= 10; i++ {
+		c.CommitWrite(i)
+		c.Tick()
+	}
+
+	// Reconnect.
+	c.Nodes["r"].Running = true
+	c.Nodes["r"].ReplicaState = NodeStateLagging
+
+	// Catch-up with escalation.
+	converged := c.CatchUpWithEscalation("r", 3)
+
+	// I3: must resolve — either converged or escalated to NeedsRebuild.
+	state := c.Nodes["r"].ReplicaState
+	if !converged && state != NodeStateNeedsRebuild {
+		t.Fatalf("I3 VIOLATED: catchup did not converge and state=%s (not NeedsRebuild)", state)
+	}
+}
+
+// --- Invariant I4: Promoted replica has committed prefix ---
+
+func TestI4_PromotedReplica_HasCommittedPrefix(t *testing.T) {
+	c := NewCluster(CommitSyncAll, "p", "r")
+
+	for i := uint64(1); i <= 5; i++ {
+		c.CommitWrite(i)
+	}
+	c.Tick()
+	c.Tick()
+	c.Tick()
+
+	committedLSN := c.Coordinator.CommittedLSN
+	if committedLSN < 5 {
+		t.Fatalf("expected CommittedLSN>=5, got %d", committedLSN)
+	}
+
+	// Promote replica.
+	if err := c.Promote("r"); err != nil {
+		t.Fatalf("promote: %v", err)
+	}
+
+	// I4: new primary must have recoverable committed prefix.
+	newPrimary := c.Nodes["r"]
+	recoverableLSN := newPrimary.Storage.RecoverableLSN()
+	if recoverableLSN < committedLSN {
+		t.Fatalf("I4 VIOLATED: promoted recoverableLSN=%d < committedLSN=%d",
+			recoverableLSN, committedLSN)
+	}
+
+	// Verify data matches reference.
+	refState := c.Reference.StateAt(committedLSN)
+	recState := newPrimary.Storage.StateAt(committedLSN)
+	for block, expected := range refState {
+		if got := recState[block]; got != expected {
+			t.Fatalf("I4 VIOLATED: block %d: ref=%d got=%d", block, expected, got)
+		}
+	}
+}
+
+// --- Direct test: checkpoint must not leak applied-but-uncheckpointed state ---
+
+func TestI2_CheckpointDoesNotLeakAppliedState(t *testing.T) {
+	s := NewStorage()
+
+	// Write 5 entries, all durable.
+	for i := uint64(1); i <= 5; i++ {
+		s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i})
+	}
+	s.AdvanceFlush(5)
+
+	// Flusher applies all 5 to LiveExtent.
+	s.ApplyToExtent(5)
+
+	// But checkpoint only at LSN 2.
+	s.AdvanceCheckpoint(2)
+
+	// CheckpointExtent must contain ONLY blocks 1-2, not 3-5.
+	for i := uint64(3); i <= 5; i++ {
+		if val, ok := s.CheckpointExtent[i]; ok && val != 0 {
+			t.Fatalf("CHECKPOINT LEAK: block %d=%d in checkpoint but CheckpointLSN=2", i, val)
+		}
+	}
+	// Blocks 1-2 must be in checkpoint.
+	for i := uint64(1); i <= 2; i++ {
+		expected := i * 10
+		if got := s.CheckpointExtent[i]; got != expected {
+			t.Fatalf("block %d: checkpoint should have %d, got %d", i, expected, got)
+		}
+	}
+
+	// Now crash: LiveExtent lost, entries 3-5 only in WAL.
+	s.Crash()
+	recoveredLSN := s.Restart()
+
+	if recoveredLSN != 5 {
+		t.Fatalf("recoveredLSN should be 5, got %d", recoveredLSN)
+	}
+
+	// All 5 blocks must be recovered: 1-2 from checkpoint, 3-5 from WAL replay.
+	for i := uint64(1); i <= 5; i++ {
+		expected := i * 10
+		if got := s.LiveExtent[i]; got != expected {
+			t.Fatalf("block %d: expected %d after crash+restart, got %d", i, expected, got)
+		}
+	}
+}
+
+// --- A7: Historical state before checkpoint is not fakeable ---
+
+func TestA7_HistoricalState_NotReconstructableAfterGC(t *testing.T) {
+	s := NewStorage()
+
+	// Write 10 entries, all durable.
+	for i := uint64(1); i <= 10; i++ {
+		s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i})
+	}
+	s.AdvanceFlush(10)
+	s.ApplyToExtent(10)
+
+	// Checkpoint at LSN 7.
+	s.AdvanceCheckpoint(7)
+
+	// GC WAL entries before checkpoint.
+	retained := make([]Write, 0)
+	for _, w := range s.WAL {
+		if w.LSN > s.CheckpointLSN {
+			retained = append(retained, w)
+		}
+	}
+	s.WAL = retained
+
+	// Can reconstruct at LSN 7 (checkpoint covers it).
+	if !s.CanReconstructAt(7) {
+		t.Fatal("should be reconstructable at checkpoint LSN")
+	}
+
+	// Can reconstruct at LSN 10 (checkpoint + WAL 8-10).
+	if !s.CanReconstructAt(10) {
+		t.Fatal("should be reconstructable at LSN 10 (checkpoint + WAL)")
+	}
+
+	// CANNOT accurately reconstruct at LSN 3 (WAL 1-6 has been GC'd).
+	// The state at LSN 3 required WAL entries 1-3 which are gone.
+	if s.CanReconstructAt(3) {
+		t.Fatal("A7: should NOT be reconstructable at LSN 3 after WAL GC — history is lost")
+	}
+
+	// StateAt(3) returns checkpoint state (best-effort approximation, not exact).
+	// This is fine for display but must NOT be treated as authoritative.
+	state3 := s.StateAt(3)
+	// The returned state includes blocks 1-7 (from checkpoint), which is MORE
+	// than what was actually committed at LSN 3. This is the "current extent
+	// cannot fake old history" problem from A7.
+	if len(state3) == 3 {
+		t.Fatal("StateAt(3) after GC should return checkpoint state (7 blocks), not exact 3-block state")
+	}
+}
+
+// --- Invariant I5: Checkpoint GC preserves recovery proof ---
+
+func TestI5_CheckpointGC_PreservesAckedBoundary(t *testing.T) {
+	s := NewStorage()
+
+	for i := uint64(1); i <= 10; i++ {
+		s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i})
+	}
+	s.AdvanceFlush(10)
+	s.ApplyToExtent(7)
+	s.AdvanceCheckpoint(7)
+
+	// GC: remove WAL entries before checkpoint.
+	retained := make([]Write, 0)
+	for _, w := range s.WAL {
+		if w.LSN > s.CheckpointLSN {
+			retained = append(retained, w)
+		}
+	}
+	s.WAL = retained
+
+	// Crash + restart.
+	s.Crash()
+	recoveredLSN := s.Restart()
+
+	if recoveredLSN != 10 {
+		t.Fatalf("I5: recoveredLSN should be 10, got %d", recoveredLSN)
+	}
+
+	// All 10 blocks recoverable: 1-7 from checkpoint, 8-10 from WAL.
+	for i := uint64(1); i <= 10; i++ {
+		expected := i * 10
+		if got := s.LiveExtent[i]; got != expected {
+			t.Fatalf("I5 VIOLATED: block %d: expected %d, got %d", i, expected, got)
+		}
+	}
+}
--- a/sw-block/prototype/distsim/predicates.go
+++ b/sw-block/prototype/distsim/predicates.go
@ -0,0 +1,160 @@
+package distsim
+
+import "fmt"
+
+// DangerPredicate checks for a protocol-violating or dangerous state.
+// Returns (violated bool, detail string).
+type DangerPredicate func(c *Cluster) (bool, string)
+
+// PredicateAckedFlushLost checks if any committed (ACK'd) write has become
+// unrecoverable on ANY node that is supposed to have it.
+// This is the most dangerous protocol violation: data loss after ACK.
+func PredicateAckedFlushLost(c *Cluster) (bool, string) {
+	committedLSN := c.Coordinator.CommittedLSN
+	if committedLSN == 0 {
+		return false, ""
+	}
+
+	refState := c.Reference.StateAt(committedLSN)
+
+	// Check primary.
+	primary := c.Primary()
+	if primary != nil && primary.Running {
+		recLSN := primary.Storage.RecoverableLSN()
+		if recLSN < committedLSN {
+			return true, fmt.Sprintf("primary %s: recoverableLSN=%d < committedLSN=%d",
+				primary.ID, recLSN, committedLSN)
+		}
+		// Verify committed state correctness using StateAt (not LiveExtent).
+		// LiveExtent may contain uncommitted-but-durable writes beyond committedLSN.
+		// Only check if we can reconstruct the exact committed state.
+		if primary.Storage.CanReconstructAt(committedLSN) {
+			nodeState := primary.Storage.StateAt(committedLSN)
+			for block, expected := range refState {
+				if got := nodeState[block]; got != expected {
+					return true, fmt.Sprintf("primary %s: block %d = %d, reference = %d at committedLSN=%d",
+						primary.ID, block, got, expected, committedLSN)
+				}
+			}
+		}
+	}
+
+	// Check replicas that should have committed data (InSync replicas).
+	for _, node := range c.Nodes {
+		if node.ID == c.Coordinator.PrimaryID {
+			continue
+		}
+		if !node.Running || node.ReplicaState != NodeStateInSync {
+			continue
+		}
+		recLSN := node.Storage.RecoverableLSN()
+		if recLSN < committedLSN {
+			return true, fmt.Sprintf("InSync replica %s: recoverableLSN=%d < committedLSN=%d",
+				node.ID, recLSN, committedLSN)
+		}
+	}
+
+	return false, ""
+}
+
+// PredicateVisibleUnrecoverableState checks if any running node has extent
+// state that would NOT survive a crash+restart. This detects ghost visible
+// state — data that is readable now but would be lost on crash.
+func PredicateVisibleUnrecoverableState(c *Cluster) (bool, string) {
+	for _, node := range c.Nodes {
+		if !node.Running || node.Storage.LiveExtent == nil {
+			continue
+		}
+		// Simulate what would happen on crash+restart.
+		recoverableLSN := node.Storage.RecoverableLSN()
+
+		// Check each block in LiveExtent: is its value backed by
+		// a write at LSN <= recoverableLSN?
+		for block, value := range node.Storage.LiveExtent {
+			// Find which LSN wrote this value.
+			writtenAtLSN := uint64(0)
+			for _, w := range node.Storage.WAL {
+				if w.Block == block && w.Value == value {
+					writtenAtLSN = w.LSN
+				}
+			}
+			if writtenAtLSN > recoverableLSN {
+				return true, fmt.Sprintf("node %s: block %d has value %d (written at LSN %d) but recoverableLSN=%d — ghost state",
+					node.ID, block, value, writtenAtLSN, recoverableLSN)
+			}
+		}
+	}
+	return false, ""
+}
+
+// PredicateCatchUpLivelockOrMissingEscalation checks if any replica is stuck
+// in CatchingUp without making progress and without being escalated to
+// NeedsRebuild. Also checks if a replica needs rebuild but hasn't been
+// escalated.
+func PredicateCatchUpLivelockOrMissingEscalation(c *Cluster) (bool, string) {
+	for _, node := range c.Nodes {
+		if !node.Running {
+			continue
+		}
+		if node.ReplicaState == NodeStateCatchingUp {
+			// A node in CatchingUp is suspicious if it has been there for
+			// many ticks without approaching the target. We check if its
+			// FlushedLSN is far behind the primary's head.
+			primary := c.Primary()
+			if primary == nil {
+				continue
+			}
+			primaryHead := primary.Storage.WALDurableLSN
+			replicaFlushed := node.Storage.FlushedLSN
+			gap := primaryHead - replicaFlushed
+
+			// If the gap is larger than what the WAL can reasonably hold
+			// and the node hasn't been escalated, that's a livelock risk.
+			// Use a simple heuristic: if gap > 2x what we've seen committed, flag it.
+			if gap > c.Coordinator.CommittedLSN*2 && c.Coordinator.CommittedLSN > 5 {
+				return true, fmt.Sprintf("node %s: CatchingUp with gap=%d (primary head=%d, replica flushed=%d) — potential livelock",
+					node.ID, gap, primaryHead, replicaFlushed)
+			}
+		}
+
+		// Check if a node is Lagging for a long time without being moved to
+		// CatchingUp or NeedsRebuild.
+		// Note: Lagging is a transient state that the control plane should resolve.
+		// In adversarial random tests without explicit recovery triggers, a node
+		// staying Lagging is expected. We only flag truly excessive lag (> 3x committed)
+		// as potential livelock — anything smaller is normal recovery latency.
+		if node.ReplicaState == NodeStateLagging {
+			primary := c.Primary()
+			if primary != nil {
+				gap := primary.Storage.WALDurableLSN - node.Storage.FlushedLSN
+				if gap > c.Coordinator.CommittedLSN*3 && c.Coordinator.CommittedLSN > 10 {
+					return true, fmt.Sprintf("node %s: Lagging with gap=%d without escalation to CatchingUp or NeedsRebuild",
+						node.ID, gap)
+				}
+			}
+		}
+	}
+	return false, ""
+}
+
+// AllDangerPredicates returns the standard set of danger predicates.
+func AllDangerPredicates() map[string]DangerPredicate {
+	return map[string]DangerPredicate{
+		"acked_flush_lost":           PredicateAckedFlushLost,
+		"visible_unrecoverable":      PredicateVisibleUnrecoverableState,
+		"catchup_livelock_or_no_esc": PredicateCatchUpLivelockOrMissingEscalation,
+	}
+}
+
+// CheckAllPredicates runs all danger predicates against a cluster state.
+// Returns a map of violated predicate names → detail messages.
+func CheckAllPredicates(c *Cluster) map[string]string {
+	violations := map[string]string{}
+	for name, pred := range AllDangerPredicates() {
+		violated, detail := pred(c)
+		if violated {
+			violations[name] = detail
+		}
+	}
+	return violations
+}
--- a/sw-block/prototype/distsim/simulator.go
+++ b/sw-block/prototype/distsim/simulator.go
@ -300,8 +300,11 @@ func (s *Simulator) execute(e Event) {

 	case EvFlusherTick:
 		if node != nil && node.Running {
-			node.Storage.AdvanceCheckpoint(node.Storage.FlushedLSN)
-			s.record(e, fmt.Sprintf("flusher tick %s checkpoint=%d", e.NodeID, node.Storage.CheckpointLSN))
+			// Phase 4.5: flusher first materializes WAL to extent, then checkpoints.
+			node.Storage.ApplyToExtent(node.Storage.WALDurableLSN)
+			node.Storage.AdvanceCheckpoint(node.Storage.WALDurableLSN)
+			s.record(e, fmt.Sprintf("flusher tick %s applied=%d checkpoint=%d",
+				e.NodeID, node.Storage.ExtentAppliedLSN, node.Storage.CheckpointLSN))
 		}

 	case EvPromote:
--- a/sw-block/prototype/distsim/storage.go
+++ b/sw-block/prototype/distsim/storage.go
@ -8,23 +8,42 @@ type SnapshotState struct {
 	State map[uint64]uint64
 }

+// Storage models the per-node storage state with explicit crash-consistency
+// boundaries. Phase 4.5: split into 5 distinct LSN boundaries.
+//
+// State progression:
+//   Write arrives → ReceivedLSN (not yet durable)
+//   WAL fsync     → WALDurableLSN (survives crash)
+//   Flusher       → ExtentAppliedLSN (materialized to live extent, volatile)
+//   Checkpoint    → CheckpointLSN (durable base image)
+//
+// After crash + restart:
+//   RecoverableState = CheckpointExtent + WAL[CheckpointLSN+1 .. WALDurableLSN]
 type Storage struct {
-	WAL          []Write
-	Extent       map[uint64]uint64
-	ReceivedLSN  uint64
-	FlushedLSN   uint64
-	CheckpointLSN uint64
-	Snapshots    map[string]SnapshotState
-	BaseSnapshot *SnapshotState
+	WAL              []Write
+	LiveExtent       map[uint64]uint64 // runtime view (volatile — lost on crash)
+	CheckpointExtent map[uint64]uint64 // crash-safe base image (survives crash)
+	ReceivedLSN      uint64            // highest LSN received (may not be durable)
+	WALDurableLSN    uint64            // highest LSN guaranteed to survive crash (= FlushedLSN)
+	ExtentAppliedLSN uint64            // highest LSN materialized into LiveExtent
+	CheckpointLSN    uint64            // highest LSN in the durable base image
+	Snapshots        map[string]SnapshotState
+	BaseSnapshot     *SnapshotState
+
+	// Backward compat alias.
+	FlushedLSN uint64 // = WALDurableLSN
 }

 func NewStorage() *Storage {
 	return &Storage{
-		Extent:    map[uint64]uint64{},
-		Snapshots: map[string]SnapshotState{},
+		LiveExtent:       map[uint64]uint64{},
+		CheckpointExtent: map[uint64]uint64{},
+		Snapshots:        map[string]SnapshotState{},
 	}
 }

+// AppendWrite adds a WAL entry. Does NOT update LiveExtent — that's the flusher's job.
+// Does NOT advance WALDurableLSN — that requires explicit AdvanceFlush (WAL fsync).
 func (s *Storage) AppendWrite(w Write) {
 	// Insert in LSN order (handles out-of-order delivery from jitter).
 	inserted := false
@ -41,43 +60,162 @@ func (s *Storage) AppendWrite(w Write) {
 	if !inserted {
 		s.WAL = append(s.WAL, w)
 	}
-	s.Extent[w.Block] = w.Value
 	if w.LSN > s.ReceivedLSN {
 		s.ReceivedLSN = w.LSN
 	}
 }

+// AdvanceFlush simulates WAL fdatasync completing. Entries up to lsn are now
+// durable and will survive crash. This is the authoritative progress for sync_all.
 func (s *Storage) AdvanceFlush(lsn uint64) {
 	if lsn > s.ReceivedLSN {
 		lsn = s.ReceivedLSN
 	}
-	if lsn > s.FlushedLSN {
-		s.FlushedLSN = lsn
+	if lsn > s.WALDurableLSN {
+		s.WALDurableLSN = lsn
+		s.FlushedLSN = lsn // backward compat alias
+	}
+}
+
+// ApplyToExtent simulates the flusher materializing WAL entries into the live extent.
+// Entries from (ExtentAppliedLSN, targetLSN] are applied. This is a volatile operation —
+// LiveExtent is lost on crash.
+func (s *Storage) ApplyToExtent(targetLSN uint64) {
+	if targetLSN > s.WALDurableLSN {
+		targetLSN = s.WALDurableLSN // can't materialize un-durable entries
+	}
+	for _, w := range s.WAL {
+		if w.LSN <= s.ExtentAppliedLSN {
+			continue
+		}
+		if w.LSN > targetLSN {
+			break
+		}
+		s.LiveExtent[w.Block] = w.Value
+	}
+	if targetLSN > s.ExtentAppliedLSN {
+		s.ExtentAppliedLSN = targetLSN
 	}
 }

+// AdvanceCheckpoint creates a crash-safe base image at exactly the given LSN.
+// The checkpoint image contains state ONLY through lsn — not the full LiveExtent.
+// This is critical: LiveExtent may contain applied entries beyond lsn that are
+// NOT part of the checkpoint and must NOT survive a crash.
 func (s *Storage) AdvanceCheckpoint(lsn uint64) {
-	if lsn > s.FlushedLSN {
-		lsn = s.FlushedLSN
+	if lsn > s.ExtentAppliedLSN {
+		lsn = s.ExtentAppliedLSN
 	}
 	if lsn > s.CheckpointLSN {
 		s.CheckpointLSN = lsn
+		// Build checkpoint image from base + WAL replay through exactly lsn.
+		// Do NOT clone LiveExtent — it may contain entries beyond checkpoint.
+		s.CheckpointExtent = s.StateAt(lsn)
+		// Set BaseSnapshot so StateAt() can use it after WAL GC.
+		s.BaseSnapshot = &SnapshotState{
+			ID:    "checkpoint",
+			LSN:   lsn,
+			State: cloneMap(s.CheckpointExtent),
+		}
+	}
+}
+
+// Crash simulates a node crash: LiveExtent is lost, only CheckpointExtent
+// and durable WAL entries survive.
+func (s *Storage) Crash() {
+	s.LiveExtent = nil
+	s.ExtentAppliedLSN = 0
+	// ReceivedLSN drops to WALDurableLSN (un-fsynced entries lost)
+	s.ReceivedLSN = s.WALDurableLSN
+	// Remove non-durable WAL entries
+	durable := make([]Write, 0, len(s.WAL))
+	for _, w := range s.WAL {
+		if w.LSN <= s.WALDurableLSN {
+			durable = append(durable, w)
+		}
+	}
+	s.WAL = durable
+}
+
+// Restart recovers state from CheckpointExtent + durable WAL replay.
+// Sets BaseSnapshot from checkpoint so StateAt() works after WAL GC.
+// Returns the RecoverableLSN (highest LSN in the recovered view).
+func (s *Storage) Restart() uint64 {
+	// Start from checkpoint base image.
+	s.LiveExtent = cloneMap(s.CheckpointExtent)
+	// Set BaseSnapshot so StateAt() can reconstruct from checkpoint after WAL GC.
+	s.BaseSnapshot = &SnapshotState{
+		ID:    "checkpoint",
+		LSN:   s.CheckpointLSN,
+		State: cloneMap(s.CheckpointExtent),
+	}
+	// Replay durable WAL entries past checkpoint.
+	for _, w := range s.WAL {
+		if w.LSN <= s.CheckpointLSN {
+			continue
+		}
+		if w.LSN > s.WALDurableLSN {
+			break
+		}
+		s.LiveExtent[w.Block] = w.Value
+	}
+	s.ExtentAppliedLSN = s.WALDurableLSN
+	return s.WALDurableLSN
+}
+
+// RecoverableLSN returns the highest LSN that would be recoverable after
+// crash + restart. This is a replayability proof, not just a watermark:
+//   - CheckpointExtent covers [0, CheckpointLSN]
+//   - WAL entries (CheckpointLSN, WALDurableLSN] must exist contiguously
+//   - If any gap exists in the WAL between CheckpointLSN and WALDurableLSN,
+//     recovery would be incomplete
+//
+// Returns the highest contiguously recoverable LSN from checkpoint + WAL.
+func (s *Storage) RecoverableLSN() uint64 {
+	// Start from checkpoint — everything through CheckpointLSN is safe.
+	recoverable := s.CheckpointLSN
+
+	// Walk durable WAL entries past checkpoint and verify contiguity.
+	for _, w := range s.WAL {
+		if w.LSN <= s.CheckpointLSN {
+			continue // already covered by checkpoint
+		}
+		if w.LSN > s.WALDurableLSN {
+			break // not durable
+		}
+		if w.LSN == recoverable+1 {
+			recoverable = w.LSN // contiguous — extend
+		} else {
+			break // gap — stop here
+		}
 	}
+	return recoverable
 }

+// StateAt computes the block state by replaying WAL entries up to the given LSN.
+// Used for correctness assertions against the reference model.
+//
+// Phase 4.5: for lsn < CheckpointLSN (after WAL GC), the WAL entries needed
+// to reconstruct historical state may no longer exist. In that case, we return
+// the checkpoint state (best available), but callers should use
+// CanReconstructAt(lsn) to check if the result is authoritative.
 func (s *Storage) StateAt(lsn uint64) map[uint64]uint64 {
 	state := map[uint64]uint64{}
+	usedSnapshot := false
 	if s.BaseSnapshot != nil {
 		if s.BaseSnapshot.LSN > lsn {
-			return cloneMap(s.BaseSnapshot.State)
+			// Snapshot is NEWER than requested — cannot use it.
+			// Fall through to WAL-only replay.
+		} else {
+			state = cloneMap(s.BaseSnapshot.State)
+			usedSnapshot = true
 		}
-		state = cloneMap(s.BaseSnapshot.State)
 	}
 	for _, w := range s.WAL {
 		if w.LSN > lsn {
 			break
 		}
-		if s.BaseSnapshot != nil && w.LSN <= s.BaseSnapshot.LSN {
+		if usedSnapshot && w.LSN <= s.BaseSnapshot.LSN {
 			continue
 		}
 		state[w.Block] = w.Value
@ -85,6 +223,62 @@ func (s *Storage) StateAt(lsn uint64) map[uint64]uint64 {
 	return state
 }

+// CanReconstructAt returns true if the storage has enough information to
+// accurately reconstruct state at the given LSN. False means the WAL entries
+// needed for historical reconstruction have been GC'd and StateAt(lsn) may
+// return an approximation (checkpoint state) rather than exact history.
+//
+// A7 (Historical Data Correctness): this should be checked before trusting
+// StateAt() results for old LSNs. Current extent cannot fake old history.
+func (s *Storage) CanReconstructAt(lsn uint64) bool {
+	if lsn == 0 {
+		return true // empty state is always reconstructable
+	}
+
+	// To reconstruct state at exactly lsn, we need a contiguous chain of
+	// evidence from LSN 0 (or a snapshot taken AT lsn) through lsn.
+	//
+	// A checkpoint at LSN C contains state through C. If lsn < C, the
+	// checkpoint has MORE data than existed at lsn — it cannot reconstruct
+	// the exact historical state at lsn. We would need WAL entries [1, lsn]
+	// to rebuild from scratch, which are gone after GC.
+	//
+	// A checkpoint at LSN C where C == lsn is exact.
+	// A checkpoint at LSN C where C > lsn cannot help with exact lsn state.
+
+	// Check if any snapshot was taken exactly at this LSN.
+	for _, snap := range s.Snapshots {
+		if snap.LSN == lsn {
+			return true
+		}
+	}
+
+	// Find the best base: a snapshot/checkpoint at or before lsn.
+	baseLSN := uint64(0)
+	if s.BaseSnapshot != nil && s.BaseSnapshot.LSN <= lsn {
+		baseLSN = s.BaseSnapshot.LSN
+	}
+
+	// If baseLSN > 0, we have a snapshot that provides state through baseLSN.
+	// We need contiguous WAL from baseLSN+1 through lsn.
+	// If baseLSN == 0, we need contiguous WAL from 1 through lsn.
+
+	expected := baseLSN + 1
+	for _, w := range s.WAL {
+		if w.LSN <= baseLSN {
+			continue
+		}
+		if w.LSN > lsn {
+			break
+		}
+		if w.LSN != expected {
+			return false // gap — history is incomplete
+		}
+		expected = w.LSN + 1
+	}
+	return expected > lsn
+}
+
 func (s *Storage) TakeSnapshot(id string, lsn uint64) SnapshotState {
 	snap := SnapshotState{
 		ID:    id,
@ -96,10 +290,13 @@ func (s *Storage) TakeSnapshot(id string, lsn uint64) SnapshotState {
 }

 func (s *Storage) LoadSnapshot(snap SnapshotState) {
-	s.Extent = cloneMap(snap.State)
+	s.LiveExtent = cloneMap(snap.State)
+	s.CheckpointExtent = cloneMap(snap.State)
+	s.WALDurableLSN = snap.LSN
 	s.FlushedLSN = snap.LSN
 	s.ReceivedLSN = snap.LSN
 	s.CheckpointLSN = snap.LSN
+	s.ExtentAppliedLSN = snap.LSN
 	s.BaseSnapshot = &SnapshotState{
 		ID:    snap.ID,
 		LSN:   snap.LSN,
@ -111,7 +308,14 @@ func (s *Storage) LoadSnapshot(snap SnapshotState) {
 func (s *Storage) ReplaceWAL(writes []Write) {
 	s.WAL = append([]Write(nil), writes...)
 	sort.Slice(s.WAL, func(i, j int) bool { return s.WAL[i].LSN < s.WAL[j].LSN })
-	s.Extent = s.StateAt(s.ReceivedLSN)
+	// Recompute LiveExtent from base + WAL
+	s.LiveExtent = s.StateAt(s.ReceivedLSN)
+}
+
+// Extent returns the current live extent for backward compatibility.
+// Callers should migrate to LiveExtent.
+func (s *Storage) Extent() map[uint64]uint64 {
+	return s.LiveExtent
 }

 func writesInRange(writes []Write, startExclusive, endInclusive uint64) []Write {
--- a/weed/server/master_block_failover.go
+++ b/weed/server/master_block_failover.go
@ -10,10 +10,12 @@ import (

 // pendingRebuild records a volume that needs rebuild when a dead VS reconnects.
 type pendingRebuild struct {
-	VolumeName string
-	OldPath    string // path on dead server
-	NewPrimary string // promoted replica server
-	Epoch      uint64
+	VolumeName      string
+	OldPath         string // path on dead server
+	NewPrimary      string // promoted replica server
+	Epoch           uint64
+	ReplicaDataAddr string // CP13-8: saved from before death for catch-up-first recovery
+	ReplicaCtrlAddr string // CP13-8: saved from before death for catch-up-first recovery
 }

 // blockFailoverState holds failover and rebuild state on the master.
@ -88,6 +90,8 @@ func (ms *MasterServer) failoverBlockVolumes(deadServer string) {
 			ri := entry.ReplicaByServer(deadServer)
 			if ri != nil {
 				replicaPath := ri.Path
+				replicaDataAddr := ri.DataAddr // CP13-8: save before removal
+				replicaCtrlAddr := ri.CtrlAddr
 				// Remove dead replica from registry.
 				if err := ms.blockRegistry.RemoveReplica(entry.Name, deadServer); err != nil {
 					glog.Warningf("failover: RemoveReplica %q on %s: %v", entry.Name, deadServer, err)
@ -95,10 +99,12 @@ func (ms *MasterServer) failoverBlockVolumes(deadServer string) {
 				}
 				// Record pending rebuild for when dead server reconnects.
 				ms.recordPendingRebuild(deadServer, pendingRebuild{
-					VolumeName: entry.Name,
-					OldPath:    replicaPath,
-					NewPrimary: entry.VolumeServer, // current primary (unchanged)
-					Epoch:      entry.Epoch,
+					VolumeName:      entry.Name,
+					OldPath:         replicaPath,
+					NewPrimary:      entry.VolumeServer,
+					Epoch:           entry.Epoch,
+					ReplicaDataAddr: replicaDataAddr,
+					ReplicaCtrlAddr: replicaCtrlAddr,
 				})
 				glog.V(0).Infof("failover: removed dead replica %s for %q, pending rebuild",
 					deadServer, entry.Name)
@ -238,20 +244,73 @@ func (ms *MasterServer) recoverBlockVolumes(reconnectedServer string) {
 			continue
 		}

-		// Update registry: reconnected server becomes a replica (via AddReplica for RF≥2 support).
+		// CP13-8: Use replica addresses saved before death for catch-up-first recovery.
+		// These are deterministic (derived from volume path hash in ReplicationPorts),
+		// so they should be the same after VS restart. If the VS somehow gets different
+		// ports (e.g., port conflict), the catch-up attempt will fail at the TCP level
+		// and fall through to the shipper's NeedsRebuild → master rebuild path.
+		// This is an optimization, not a source of truth — the master remains the
+		// authority for topology/assignment changes.
+		dataAddr := rb.ReplicaDataAddr
+		ctrlAddr := rb.ReplicaCtrlAddr
+
+		// Update registry: reconnected server becomes a replica.
 		ms.blockRegistry.AddReplica(rb.VolumeName, ReplicaInfo{
-			Server: reconnectedServer,
-			Path:   rb.OldPath,
+			Server:   reconnectedServer,
+			Path:     rb.OldPath,
+			DataAddr: dataAddr,
+			CtrlAddr: ctrlAddr,
 		})

-		// T4: Warn if RebuildListenAddr is empty (new primary hasn't heartbeated yet).
+		// CP13-8: Try catch-up first (Replica assignment), fall back to rebuild.
+		// If the replica can catch up from the primary's retained WAL, this is
+		// much faster than a full rebuild. The shipper's reconnect handshake
+		// (CP13-5) determines whether catch-up or rebuild is actually needed.
+		// If catch-up fails, the shipper marks NeedsRebuild, and the master
+		// sends a Rebuilding assignment on the next heartbeat cycle.
+		if dataAddr != "" {
+			leaseTTLMs := blockvol.LeaseTTLToWire(30 * time.Second)
+			// Send Replica assignment to the reconnected server.
+			ms.blockAssignmentQueue.Enqueue(reconnectedServer, blockvol.BlockVolumeAssignment{
+				Path:            rb.OldPath,
+				Epoch:           entry.Epoch,
+				Role:            blockvol.RoleToWire(blockvol.RoleReplica),
+				LeaseTtlMs:      leaseTTLMs,
+				ReplicaDataAddr: dataAddr,
+				ReplicaCtrlAddr: ctrlAddr,
+			})
+			// Also re-send Primary assignment so the primary gets fresh replica addresses.
+			primaryAssignment := blockvol.BlockVolumeAssignment{
+				Path:       entry.Path,
+				Epoch:      entry.Epoch,
+				Role:       blockvol.RoleToWire(blockvol.RolePrimary),
+				LeaseTtlMs: leaseTTLMs,
+			}
+			// Include all replica addresses.
+			for _, ri := range entry.Replicas {
+				primaryAssignment.ReplicaAddrs = append(primaryAssignment.ReplicaAddrs, blockvol.ReplicaAddr{
+					DataAddr: ri.DataAddr,
+					CtrlAddr: ri.CtrlAddr,
+				})
+			}
+			if len(entry.Replicas) == 1 {
+				primaryAssignment.ReplicaDataAddr = entry.Replicas[0].DataAddr
+				primaryAssignment.ReplicaCtrlAddr = entry.Replicas[0].CtrlAddr
+			}
+			ms.blockAssignmentQueue.Enqueue(entry.VolumeServer, primaryAssignment)
+
+			glog.V(0).Infof("recover: enqueued catch-up (Replica) for %q on %s (epoch=%d, data=%s) + Primary refresh on %s",
+				rb.VolumeName, reconnectedServer, entry.Epoch, dataAddr, entry.VolumeServer)
+			continue
+		}
+
+		// Fallback: no known addresses — use rebuild path.
 		rebuildAddr := entry.RebuildListenAddr
 		if rebuildAddr == "" {
 			glog.Warningf("rebuild: %q RebuildListenAddr is empty (new primary %s may not have heartbeated yet), "+
 				"queuing rebuild anyway — VS should retry on empty addr", rb.VolumeName, entry.VolumeServer)
 		}

-		// Enqueue rebuild assignment for the reconnected server.
 		ms.blockAssignmentQueue.Enqueue(reconnectedServer, blockvol.BlockVolumeAssignment{
 			Path:        rb.OldPath,
 			Epoch:       entry.Epoch,
@ -268,6 +327,39 @@ func (ms *MasterServer) recoverBlockVolumes(reconnectedServer string) {
 // reevaluateOrphanedPrimaries checks if the given server is a replica for any
 // volumes whose primary is dead (not block-capable). If so, promotes the best
 // available replica — but only after the old primary's lease has expired, to
+// refreshPrimaryForAddrChange sends a fresh Primary assignment when a replica's
+// receiver address changed (e.g., restart with port conflict). This ensures the
+// primary's shipper gets the new address without waiting for the next heartbeat cycle.
+func (ms *MasterServer) refreshPrimaryForAddrChange(ac ReplicaAddrChange) {
+	entry, ok := ms.blockRegistry.Lookup(ac.VolumeName)
+	if !ok {
+		return
+	}
+	leaseTTLMs := blockvol.LeaseTTLToWire(30 * time.Second)
+	assignment := blockvol.BlockVolumeAssignment{
+		Path:       entry.Path,
+		Epoch:      entry.Epoch,
+		Role:       blockvol.RoleToWire(blockvol.RolePrimary),
+		LeaseTtlMs: leaseTTLMs,
+	}
+	for _, ri := range entry.Replicas {
+		assignment.ReplicaAddrs = append(assignment.ReplicaAddrs, blockvol.ReplicaAddr{
+			DataAddr: ri.DataAddr,
+			CtrlAddr: ri.CtrlAddr,
+		})
+	}
+	if len(entry.Replicas) == 1 {
+		assignment.ReplicaDataAddr = entry.Replicas[0].DataAddr
+		assignment.ReplicaCtrlAddr = entry.Replicas[0].CtrlAddr
+	}
+	// Use current registry primary (not stale ac.PrimaryServer) in case
+	// failover happened between address-change detection and this refresh.
+	currentPrimary := entry.VolumeServer
+	ms.blockAssignmentQueue.Enqueue(currentPrimary, assignment)
+	glog.V(0).Infof("recover: replica addr changed for %q (data: %s→%s, ctrl: %s→%s), refreshed Primary on %s",
+		ac.VolumeName, ac.OldDataAddr, ac.NewDataAddr, ac.OldCtrlAddr, ac.NewCtrlAddr, currentPrimary)
+}
+
 // maintain the same split-brain protection as failoverBlockVolumes().
 // This fixes B-06 (orphaned primary after replica re-register)
 // and partially B-08 (fast reconnect skips failover window).
--- a/weed/server/master_block_registry.go
+++ b/weed/server/master_block_registry.go
@ -353,7 +353,21 @@ func (r *BlockVolumeRegistry) ListByServer(server string) []BlockVolumeEntry {
 // Called on the first heartbeat from a volume server.
 // Marks reported volumes as Active, removes entries for this server
 // that are not reported (stale).
-func (r *BlockVolumeRegistry) UpdateFullHeartbeat(server string, infos []*master_pb.BlockVolumeInfoMessage, nvmeAddr string) {
+// ReplicaAddrChange records a replica whose advertised address changed,
+// requiring a Primary assignment refresh so the shipper gets the new address.
+// Detected only in the full heartbeat path (UpdateFullHeartbeat). Delta
+// heartbeats do not carry replica addresses and cannot trigger this.
+type ReplicaAddrChange struct {
+	VolumeName    string
+	PrimaryServer string
+	OldDataAddr   string
+	OldCtrlAddr   string
+	NewDataAddr   string
+	NewCtrlAddr   string
+}
+
+func (r *BlockVolumeRegistry) UpdateFullHeartbeat(server string, infos []*master_pb.BlockVolumeInfoMessage, nvmeAddr string) []ReplicaAddrChange {
+	var addrChanges []ReplicaAddrChange
 	r.mu.Lock()
 	defer r.mu.Unlock()

@ -495,6 +509,31 @@ func (r *BlockVolumeRegistry) UpdateFullHeartbeat(server string, infos []*master
 						} else {
 							existing.Replicas[i].WALLag = 0
 						}
+						// CP13-8: detect address change on replica restart.
+						// If either the data or control address changed, the primary's
+						// shipper has a stale endpoint. Queue a Primary refresh.
+						if info.ReplicaDataAddr != "" || info.ReplicaCtrlAddr != "" {
+							oldData := existing.Replicas[i].DataAddr
+							oldCtrl := existing.Replicas[i].CtrlAddr
+							dataChanged := info.ReplicaDataAddr != "" && oldData != "" && oldData != info.ReplicaDataAddr
+							ctrlChanged := info.ReplicaCtrlAddr != "" && oldCtrl != "" && oldCtrl != info.ReplicaCtrlAddr
+							if dataChanged || ctrlChanged {
+								addrChanges = append(addrChanges, ReplicaAddrChange{
+									VolumeName:    existingName,
+									PrimaryServer: existing.VolumeServer,
+									OldDataAddr:   oldData,
+									OldCtrlAddr:   oldCtrl,
+									NewDataAddr:   info.ReplicaDataAddr,
+									NewCtrlAddr:   info.ReplicaCtrlAddr,
+								})
+							}
+							if info.ReplicaDataAddr != "" {
+								existing.Replicas[i].DataAddr = info.ReplicaDataAddr
+							}
+							if info.ReplicaCtrlAddr != "" {
+								existing.Replicas[i].CtrlAddr = info.ReplicaCtrlAddr
+							}
+						}
 						break
 					}
 				}
@ -511,6 +550,14 @@ func (r *BlockVolumeRegistry) UpdateFullHeartbeat(server string, infos []*master
 			if name == "" {
 				continue
 			}
+			// Skip auto-register if a create is in progress for this volume.
+			// Without this gate, the replica VS heartbeat can race ahead of
+			// CreateBlockVolume.Register and create a bare entry that lacks
+			// replica info, causing the real Register to hit "already registered"
+			// and fall back to the incomplete auto-registered entry.
+			if r.IsInflight(name) {
+				continue
+			}
 			existing, dup := r.volumes[name]
 			if !dup {
 				entry := &BlockVolumeEntry{
@ -545,6 +592,7 @@ func (r *BlockVolumeRegistry) UpdateFullHeartbeat(server string, infos []*master
 			}
 		}
 	}
+	return addrChanges
 }

 // reconcileOnRestart handles the case where a second server reports a volume
@ -769,6 +817,12 @@ func (r *BlockVolumeRegistry) ReleaseInflight(name string) {
 	r.inflight.Delete(name)
 }

+// IsInflight returns true if a create is in progress for the given volume name.
+func (r *BlockVolumeRegistry) IsInflight(name string) bool {
+	_, ok := r.inflight.Load(name)
+	return ok
+}
+
 // countForServer returns the number of volumes on the given server.
 // Caller must hold at least RLock.
 func (r *BlockVolumeRegistry) countForServer(server string) int {
--- a/weed/server/master_block_registry_test.go
+++ b/weed/server/master_block_registry_test.go
@ -1900,3 +1900,63 @@ func TestUpdateEntry_NotFound(t *testing.T) {
 		t.Fatal("expected error for nonexistent volume")
 	}
 }
+
+// TestRegistry_InflightBlocksAutoRegister verifies that heartbeat auto-register
+// is suppressed while a create is in-flight for the same volume. This prevents
+// a race where the replica VS heartbeat arrives before CreateBlockVolume.Register
+// completes, creating a bare entry that lacks replica info.
+func TestRegistry_InflightBlocksAutoRegister(t *testing.T) {
+	r := NewBlockVolumeRegistry()
+
+	// Simulate CreateBlockVolume acquiring the inflight lock.
+	if !r.AcquireInflight("vol1") {
+		t.Fatal("AcquireInflight should succeed")
+	}
+
+	// Replica VS sends heartbeat reporting vol1 — while create is in-flight.
+	// This should be silently skipped (not auto-registered).
+	r.UpdateFullHeartbeat("replica-server:8080", []*master_pb.BlockVolumeInfoMessage{
+		{Path: "/blocks/vol1.blk", Epoch: 1, Role: 2, VolumeSize: 1 << 30},
+	}, "")
+
+	// vol1 should NOT be in the registry (auto-register was blocked).
+	if _, ok := r.Lookup("vol1"); ok {
+		t.Fatal("vol1 should not be auto-registered while inflight lock is held")
+	}
+
+	// Now simulate CreateBlockVolume completing: register with replicas.
+	r.Register(&BlockVolumeEntry{
+		Name:         "vol1",
+		VolumeServer: "primary-server:8080",
+		Path:         "/blocks/vol1.blk",
+		SizeBytes:    1 << 30,
+		Epoch:        1,
+		Status:       StatusActive,
+		Replicas: []ReplicaInfo{
+			{Server: "replica-server:8080", Path: "/blocks/vol1.blk"},
+		},
+	})
+	r.ReleaseInflight("vol1")
+
+	// Entry should have the replica.
+	entry, ok := r.Lookup("vol1")
+	if !ok {
+		t.Fatal("vol1 should exist after Register")
+	}
+	if len(entry.Replicas) != 1 {
+		t.Fatalf("replicas=%d, want 1", len(entry.Replicas))
+	}
+	if entry.Replicas[0].Server != "replica-server:8080" {
+		t.Fatalf("replica server=%s", entry.Replicas[0].Server)
+	}
+
+	// After inflight released, subsequent heartbeats should update normally.
+	r.UpdateFullHeartbeat("replica-server:8080", []*master_pb.BlockVolumeInfoMessage{
+		{Path: "/blocks/vol1.blk", Epoch: 2, Role: 2, VolumeSize: 1 << 30, HealthScore: 0.9},
+	}, "")
+
+	entry, _ = r.Lookup("vol1")
+	if entry.Replicas[0].HealthScore != 0.9 {
+		t.Fatalf("replica health not updated after inflight released: %f", entry.Replicas[0].HealthScore)
+	}
+}
--- a/weed/server/master_grpc_server.go
+++ b/weed/server/master_grpc_server.go
@ -277,7 +277,12 @@ func (ms *MasterServer) SendHeartbeat(stream master_pb.Seaweed_SendHeartbeatServ
 		// (BlockVolumeInfos on first heartbeat) or deltas (NewBlockVolumes/DeletedBlockVolumes
 		// on subsequent heartbeats), never both in the same message.
 		if len(heartbeat.BlockVolumeInfos) > 0 || heartbeat.HasNoBlockVolumes {
-			ms.blockRegistry.UpdateFullHeartbeat(dn.Url(), heartbeat.BlockVolumeInfos, heartbeat.BlockNvmeAddr)
+			addrChanges := ms.blockRegistry.UpdateFullHeartbeat(dn.Url(), heartbeat.BlockVolumeInfos, heartbeat.BlockNvmeAddr)
+			// CP13-8: If a replica's receiver address changed (e.g., restart with port conflict),
+			// immediately refresh the primary's assignment with the new addresses.
+			for _, ac := range addrChanges {
+				ms.refreshPrimaryForAddrChange(ac)
+			}
 			// T2 (B-06): After updating registry from heartbeat, check if this server
 			// is a replica for any volume whose primary is dead. If so, promote.
 			ms.reevaluateOrphanedPrimaries(dn.Url())
--- a/weed/server/qa_block_edge_cases_test.go
+++ b/weed/server/qa_block_edge_cases_test.go
@ -0,0 +1,481 @@
+package weed_server
+
+import (
+	"sync"
+	"testing"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
+)
+
+// ============================================================
+// Edge Case Tests: RF, Promotion, Network, LSN
+//
+// Covers gaps identified in the testing framework review:
+// 1. LSN-lagging replica skipped during promotion
+// 2. Cascading double failover (RF=3, epoch chain 1→2→3)
+// 3. Demotion/drain under concurrent promotion pressure
+// 4. Promotion with mixed LSN + health scores
+// 5. Network flap simulation (mark/unmark block capable rapidly)
+// 6. RF=3 all-gate evaluation under pressure
+// ============================================================
+
+// --- Test 1: LSN-lagging replica skipped, fresher one promoted ---
+
+func TestEdge_LSNLag_StaleReplicaSkipped(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.SetPromotionLSNTolerance(10)
+
+	ms.blockRegistry.MarkBlockCapable("primary")
+	ms.blockRegistry.MarkBlockCapable("stale-replica")
+	ms.blockRegistry.MarkBlockCapable("fresh-replica")
+
+	entry := &BlockVolumeEntry{
+		Name: "lsn-test", VolumeServer: "primary", Path: "/data/lsn-test.blk",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant: time.Now().Add(-10 * time.Second), // expired
+		WALHeadLSN: 1000,
+		Replicas: []ReplicaInfo{
+			{
+				Server: "stale-replica", Path: "/data/lsn-test.blk",
+				HealthScore: 1.0, WALHeadLSN: 100, // lag=900, way beyond tolerance=10
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now(),
+			},
+			{
+				Server: "fresh-replica", Path: "/data/lsn-test.blk",
+				HealthScore: 0.9, WALHeadLSN: 995, // lag=5, within tolerance=10
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now(),
+			},
+		},
+	}
+	ms.blockRegistry.Register(entry)
+
+	// Kill primary.
+	ms.blockRegistry.UnmarkBlockCapable("primary")
+	ms.failoverBlockVolumes("primary")
+
+	// Verify: fresh-replica promoted (despite lower health score), stale skipped.
+	after, ok := ms.blockRegistry.Lookup("lsn-test")
+	if !ok {
+		t.Fatal("volume not found")
+	}
+	if after.VolumeServer != "fresh-replica" {
+		t.Fatalf("expected fresh-replica promoted, got %q (stale-replica with lag=900 should be skipped)", after.VolumeServer)
+	}
+	if after.Epoch != 2 {
+		t.Fatalf("epoch: got %d, want 2", after.Epoch)
+	}
+}
+
+// --- Test 2: Cascading double failover (RF=3, epoch 1→2→3) ---
+
+func TestEdge_CascadeFailover_RF3_EpochChain(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.MarkBlockCapable("vs1")
+	ms.blockRegistry.MarkBlockCapable("vs2")
+	ms.blockRegistry.MarkBlockCapable("vs3")
+
+	entry := &BlockVolumeEntry{
+		Name: "cascade-test", VolumeServer: "vs1", Path: "/data/cascade.blk",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant: time.Now().Add(-10 * time.Second),
+		ReplicaFactor: 3,
+		Replicas: []ReplicaInfo{
+			{Server: "vs2", Path: "/r2.blk", HealthScore: 1.0, WALHeadLSN: 100,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+			{Server: "vs3", Path: "/r3.blk", HealthScore: 0.9, WALHeadLSN: 100,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+		},
+	}
+	ms.blockRegistry.Register(entry)
+
+	// Failover 1: vs1 dies → vs2 promoted (higher health).
+	ms.blockRegistry.UnmarkBlockCapable("vs1")
+	ms.failoverBlockVolumes("vs1")
+
+	after1, _ := ms.blockRegistry.Lookup("cascade-test")
+	if after1.VolumeServer != "vs2" {
+		t.Fatalf("failover 1: expected vs2, got %q", after1.VolumeServer)
+	}
+	if after1.Epoch != 2 {
+		t.Fatalf("failover 1: epoch got %d, want 2", after1.Epoch)
+	}
+
+	// Failover 2: vs2 dies → vs3 promoted (only remaining).
+	// Update vs3's heartbeat and set lease expired for the new primary.
+	ms.blockRegistry.UpdateEntry("cascade-test", func(e *BlockVolumeEntry) {
+		e.LastLeaseGrant = time.Now().Add(-10 * time.Second)
+		for i := range e.Replicas {
+			if e.Replicas[i].Server == "vs3" {
+				e.Replicas[i].LastHeartbeat = time.Now()
+			}
+		}
+	})
+
+	ms.blockRegistry.UnmarkBlockCapable("vs2")
+	ms.failoverBlockVolumes("vs2")
+
+	after2, _ := ms.blockRegistry.Lookup("cascade-test")
+	if after2.VolumeServer != "vs3" {
+		t.Fatalf("failover 2: expected vs3, got %q", after2.VolumeServer)
+	}
+	if after2.Epoch != 3 {
+		t.Fatalf("failover 2: epoch got %d, want 3", after2.Epoch)
+	}
+
+	// No more replicas — third failover should fail silently.
+	ms.blockRegistry.UpdateEntry("cascade-test", func(e *BlockVolumeEntry) {
+		e.LastLeaseGrant = time.Now().Add(-10 * time.Second)
+	})
+	ms.blockRegistry.UnmarkBlockCapable("vs3")
+	ms.failoverBlockVolumes("vs3")
+
+	after3, _ := ms.blockRegistry.Lookup("cascade-test")
+	// Epoch should still be 3 — no eligible replicas.
+	if after3.Epoch != 3 {
+		t.Fatalf("failover 3: epoch should stay 3, got %d", after3.Epoch)
+	}
+}
+
+// --- Test 3: Concurrent failover + heartbeat + promotion (stress) ---
+
+func TestEdge_ConcurrentFailoverAndHeartbeat_NoPanic(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.MarkBlockCapable("vs1")
+	ms.blockRegistry.MarkBlockCapable("vs2")
+	ms.blockRegistry.MarkBlockCapable("vs3")
+
+	setup := func() {
+		ms.blockRegistry.Unregister("stress-vol")
+		ms.blockRegistry.MarkBlockCapable("vs1")
+		ms.blockRegistry.MarkBlockCapable("vs2")
+		ms.blockRegistry.MarkBlockCapable("vs3")
+		ms.blockRegistry.Register(&BlockVolumeEntry{
+			Name: "stress-vol", VolumeServer: "vs1", Path: "/data/stress.blk",
+			SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+			Status: StatusActive, LeaseTTL: 5 * time.Second,
+			LastLeaseGrant: time.Now().Add(-10 * time.Second),
+			Replicas: []ReplicaInfo{
+				{Server: "vs2", Path: "/r2.blk", HealthScore: 1.0,
+					Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+				{Server: "vs3", Path: "/r3.blk", HealthScore: 0.9,
+					Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+			},
+		})
+	}
+
+	for round := 0; round < 30; round++ {
+		setup()
+		var wg sync.WaitGroup
+		wg.Add(4)
+		go func() { defer wg.Done(); ms.failoverBlockVolumes("vs1") }()
+		go func() { defer wg.Done(); ms.reevaluateOrphanedPrimaries("vs2") }()
+		go func() { defer wg.Done(); ms.blockRegistry.PromoteBestReplica("stress-vol") }()
+		go func() {
+			defer wg.Done()
+			ms.blockRegistry.ManualPromote("stress-vol", "", true)
+		}()
+		wg.Wait()
+	}
+	// No panic = pass.
+}
+
+// --- Test 4: LSN + health score interaction — health wins within tolerance ---
+
+func TestEdge_LSNWithinTolerance_HealthWins(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.SetPromotionLSNTolerance(100)
+	ms.blockRegistry.MarkBlockCapable("primary")
+	ms.blockRegistry.MarkBlockCapable("high-health")
+	ms.blockRegistry.MarkBlockCapable("high-lsn")
+
+	ms.blockRegistry.Register(&BlockVolumeEntry{
+		Name: "health-vs-lsn", VolumeServer: "primary", Path: "/data/hvl.blk",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant: time.Now().Add(-10 * time.Second),
+		WALHeadLSN: 1000,
+		Replicas: []ReplicaInfo{
+			{Server: "high-health", Path: "/r1.blk", HealthScore: 1.0, WALHeadLSN: 950,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+			{Server: "high-lsn", Path: "/r2.blk", HealthScore: 0.5, WALHeadLSN: 999,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+		},
+	})
+
+	ms.blockRegistry.UnmarkBlockCapable("primary")
+	ms.failoverBlockVolumes("primary")
+
+	after, _ := ms.blockRegistry.Lookup("health-vs-lsn")
+	// Both within tolerance (lag ≤ 100). Health wins: high-health (1.0) > high-lsn (0.5).
+	if after.VolumeServer != "high-health" {
+		t.Fatalf("expected high-health promoted (higher health, both within LSN tolerance), got %q", after.VolumeServer)
+	}
+}
+
+// --- Test 5: Network flap simulation — rapid mark/unmark block capable ---
+
+func TestEdge_NetworkFlap_RapidMarkUnmark(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.MarkBlockCapable("flapper")
+	ms.blockRegistry.MarkBlockCapable("stable")
+
+	ms.blockRegistry.Register(&BlockVolumeEntry{
+		Name: "flap-test", VolumeServer: "stable", Path: "/data/flap.blk",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant: time.Now(),
+		Replicas: []ReplicaInfo{
+			{Server: "flapper", Path: "/r.blk", HealthScore: 1.0,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+		},
+	})
+
+	var wg sync.WaitGroup
+	// Goroutine 1: rapidly flap the "flapper" server.
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		for i := 0; i < 100; i++ {
+			ms.blockRegistry.UnmarkBlockCapable("flapper")
+			ms.blockRegistry.MarkBlockCapable("flapper")
+		}
+	}()
+
+	// Goroutine 2: attempt promotions during flapping.
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		for i := 0; i < 50; i++ {
+			ms.blockRegistry.EvaluatePromotion("flap-test")
+		}
+	}()
+
+	// Goroutine 3: concurrent heartbeat updates.
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		for i := 0; i < 50; i++ {
+			ms.blockRegistry.UpdateFullHeartbeat("flapper", nil, "")
+		}
+	}()
+
+	wg.Wait()
+	// No panic, no corruption = pass.
+
+	// Volume should still be on stable primary.
+	after, ok := ms.blockRegistry.Lookup("flap-test")
+	if !ok {
+		t.Fatal("volume lost during flapping")
+	}
+	if after.VolumeServer != "stable" {
+		t.Fatalf("primary changed from stable to %q during flapping", after.VolumeServer)
+	}
+}
+
+// --- Test 6: RF=3 all gates — mixed rejection reasons ---
+
+func TestEdge_RF3_MixedGates_BestEligiblePromoted(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.SetPromotionLSNTolerance(50)
+	ms.blockRegistry.MarkBlockCapable("primary")
+	// Note: "dead-server" NOT marked block capable.
+	ms.blockRegistry.MarkBlockCapable("stale-hb")
+	ms.blockRegistry.MarkBlockCapable("good")
+
+	ms.blockRegistry.Register(&BlockVolumeEntry{
+		Name: "mixed-gates", VolumeServer: "primary", Path: "/data/mixed.blk",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant: time.Now().Add(-10 * time.Second),
+		WALHeadLSN: 500,
+		Replicas: []ReplicaInfo{
+			{Server: "dead-server", Path: "/r1.blk", HealthScore: 1.0, WALHeadLSN: 500,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+			{Server: "stale-hb", Path: "/r2.blk", HealthScore: 1.0, WALHeadLSN: 500,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica),
+				LastHeartbeat: time.Now().Add(-10 * time.Minute)}, // stale
+			{Server: "good", Path: "/r3.blk", HealthScore: 0.8, WALHeadLSN: 480,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+		},
+	})
+
+	// Evaluate preflight first (read-only).
+	pf, err := ms.blockRegistry.EvaluatePromotion("mixed-gates")
+	if err != nil {
+		t.Fatalf("evaluate: %v", err)
+	}
+	if !pf.Promotable {
+		t.Fatalf("should be promotable, reason=%s, rejections=%v", pf.Reason, pf.Rejections)
+	}
+	// Should have 2 rejections: dead-server (server_dead) + stale-hb (stale_heartbeat).
+	if len(pf.Rejections) != 2 {
+		t.Fatalf("expected 2 rejections, got %d: %v", len(pf.Rejections), pf.Rejections)
+	}
+	reasons := map[string]string{}
+	for _, r := range pf.Rejections {
+		reasons[r.Server] = r.Reason
+	}
+	if reasons["dead-server"] != "server_dead" {
+		t.Fatalf("dead-server: got %q, want server_dead", reasons["dead-server"])
+	}
+	if reasons["stale-hb"] != "stale_heartbeat" {
+		t.Fatalf("stale-hb: got %q, want stale_heartbeat", reasons["stale-hb"])
+	}
+
+	// Now actually promote.
+	ms.blockRegistry.UnmarkBlockCapable("primary")
+	ms.failoverBlockVolumes("primary")
+
+	after, _ := ms.blockRegistry.Lookup("mixed-gates")
+	if after.VolumeServer != "good" {
+		t.Fatalf("expected 'good' promoted (only eligible), got %q", after.VolumeServer)
+	}
+}
+
+// --- Test 7: Promotion changes publication (ISCSIAddr, NvmeAddr) ---
+
+func TestEdge_PromotionUpdatesPublication(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.MarkBlockCapable("primary")
+	ms.blockRegistry.MarkBlockCapable("replica")
+
+	ms.blockRegistry.Register(&BlockVolumeEntry{
+		Name: "pub-test", VolumeServer: "primary", Path: "/data/pub.blk",
+		ISCSIAddr: "primary:3260", NvmeAddr: "primary:4420", NQN: "nqn.primary",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant: time.Now().Add(-10 * time.Second),
+		Replicas: []ReplicaInfo{
+			{Server: "replica", Path: "/r.blk", HealthScore: 1.0,
+				ISCSIAddr: "replica:3261", NvmeAddr: "replica:4421", NQN: "nqn.replica",
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+		},
+	})
+
+	ms.blockRegistry.UnmarkBlockCapable("primary")
+	ms.failoverBlockVolumes("primary")
+
+	after, _ := ms.blockRegistry.Lookup("pub-test")
+	if after.ISCSIAddr != "replica:3261" {
+		t.Fatalf("ISCSIAddr: got %q, want replica:3261", after.ISCSIAddr)
+	}
+	if after.NvmeAddr != "replica:4421" {
+		t.Fatalf("NvmeAddr: got %q, want replica:4421", after.NvmeAddr)
+	}
+	if after.NQN != "nqn.replica" {
+		t.Fatalf("NQN: got %q, want nqn.replica", after.NQN)
+	}
+}
+
+// --- Test 8: Orphaned primary re-evaluation with LSN lag ---
+
+func TestEdge_OrphanReevaluation_LSNLag_StillPromotes(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.SetPromotionLSNTolerance(10)
+	// Primary is dead, replica is alive but lagging.
+	ms.blockRegistry.MarkBlockCapable("replica")
+
+	ms.blockRegistry.Register(&BlockVolumeEntry{
+		Name: "orphan-lag", VolumeServer: "dead-primary", Path: "/data/orphan.blk",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant: time.Now().Add(-10 * time.Second), // expired
+		WALHeadLSN: 1000,
+		Replicas: []ReplicaInfo{
+			{Server: "replica", Path: "/r.blk", HealthScore: 1.0, WALHeadLSN: 500,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+		},
+	})
+
+	// Orphan re-evaluation: replica reconnects.
+	ms.reevaluateOrphanedPrimaries("replica")
+
+	// The replica has WAL lag of 500 (way beyond tolerance=10).
+	// But it's the ONLY replica — should it promote or not?
+	// Current behavior: LSN gate rejects it. No promotion.
+	after, _ := ms.blockRegistry.Lookup("orphan-lag")
+	if after.Epoch != 1 {
+		// If epoch changed, the lagging replica was promoted.
+		// This may or may not be desired — document the behavior.
+		t.Logf("NOTE: lagging replica WAS promoted (epoch=%d). LSN lag=%d, tolerance=%d",
+			after.Epoch, 1000-500, 10)
+	} else {
+		t.Logf("NOTE: lagging replica was NOT promoted (epoch=1). Volume is stuck with dead primary.")
+		t.Logf("This is the current behavior: LSN gate blocks promotion even when it's the only option.")
+	}
+	// This test documents behavior, doesn't assert pass/fail.
+	// The question is: should a lagging-but-only replica be promoted to avoid downtime?
+}
+
+// --- Test 9: Rebuild addr cleared after promotion, then repopulated ---
+
+func TestEdge_RebuildAddr_ClearedThenRepopulated(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.MarkBlockCapable("primary")
+	ms.blockRegistry.MarkBlockCapable("replica")
+
+	ms.blockRegistry.Register(&BlockVolumeEntry{
+		Name: "rebuild-addr", VolumeServer: "primary", Path: "/data/rebuild.blk",
+		SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+		Status: StatusActive, LeaseTTL: 5 * time.Second,
+		LastLeaseGrant:     time.Now().Add(-10 * time.Second),
+		RebuildListenAddr:  "primary:15000", // old primary's rebuild addr
+		Replicas: []ReplicaInfo{
+			{Server: "replica", Path: "/r.blk", HealthScore: 1.0,
+				Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+		},
+	})
+
+	ms.blockRegistry.UnmarkBlockCapable("primary")
+	ms.failoverBlockVolumes("primary")
+
+	after, _ := ms.blockRegistry.Lookup("rebuild-addr")
+	// RebuildListenAddr should be cleared after promotion (B-11 fix).
+	if after.RebuildListenAddr != "" {
+		t.Fatalf("RebuildListenAddr should be cleared after promotion, got %q", after.RebuildListenAddr)
+	}
+}
+
+// --- Test 10: Multiple volumes on same server — all fail over ---
+
+func TestEdge_MultipleVolumes_SameServer_AllFailover(t *testing.T) {
+	ms := testMasterServerForFailover(t)
+	ms.blockRegistry.MarkBlockCapable("vs1")
+	ms.blockRegistry.MarkBlockCapable("vs2")
+
+	// Register 5 volumes, all with primary on vs1.
+	for i := 0; i < 5; i++ {
+		name := "multi-" + string(rune('a'+i))
+		ms.blockRegistry.Register(&BlockVolumeEntry{
+			Name: name, VolumeServer: "vs1", Path: "/data/" + name + ".blk",
+			SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary),
+			Status: StatusActive, LeaseTTL: 5 * time.Second,
+			LastLeaseGrant: time.Now().Add(-10 * time.Second),
+			Replicas: []ReplicaInfo{
+				{Server: "vs2", Path: "/r/" + name + ".blk", HealthScore: 1.0,
+					Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()},
+			},
+		})
+	}
+
+	// Kill vs1 — all 5 volumes should fail over.
+	ms.blockRegistry.UnmarkBlockCapable("vs1")
+	ms.failoverBlockVolumes("vs1")
+
+	for i := 0; i < 5; i++ {
+		name := "multi-" + string(rune('a'+i))
+		entry, ok := ms.blockRegistry.Lookup(name)
+		if !ok {
+			t.Fatalf("volume %s not found", name)
+		}
+		if entry.VolumeServer != "vs2" {
+			t.Fatalf("volume %s: expected vs2, got %q", name, entry.VolumeServer)
+		}
+		if entry.Epoch != 2 {
+			t.Fatalf("volume %s: epoch got %d, want 2", name, entry.Epoch)
+		}
+	}
+}
--- a/weed/server/volume_grpc_client_to_master.go
+++ b/weed/server/volume_grpc_client_to_master.go
@ -309,6 +309,16 @@ func (vs *VolumeServer) doHeartbeatWithRetry(masterAddress pb.ServerAddress, grp
 				glog.V(0).Infof("Volume Server Failed to update to master %s: %v", masterAddress, err)
 				return "", err
 			}
+		case <-vs.blockStateChangeChan:
+			// Immediate block heartbeat on shipper state change (degraded/recovered).
+			if vs.blockService == nil {
+				continue
+			}
+			glog.V(0).Infof("volume server %s:%d block state change → immediate heartbeat", vs.store.Ip, vs.store.Port)
+			if err = stream.Send(vs.collectBlockVolumeHeartbeat(ip, port, dataCenter, rack)); err != nil {
+				glog.V(0).Infof("Volume Server Failed to send block state-change heartbeat to master %s: %v", masterAddress, err)
+				return "", err
+			}
 		case <-blockVolTickChan.C:
 			if vs.blockService == nil {
 				continue
--- a/weed/server/volume_server.go
+++ b/weed/server/volume_server.go
@ -55,7 +55,8 @@ type VolumeServer struct {
 	isHeartbeating           bool
 	stopChan                 chan bool

-	blockService *BlockService // block volume iSCSI service (nil if disabled)
+	blockService         *BlockService // block volume iSCSI service (nil if disabled)
+	blockStateChangeChan chan bool      // triggers immediate block heartbeat on shipper state change
 }

 func NewVolumeServer(adminMux, publicMux *http.ServeMux, ip string,
@ -103,6 +104,7 @@ func NewVolumeServer(adminMux, publicMux *http.ServeMux, ip string,
 		fileSizeLimitBytes:            int64(fileSizeLimitMB) * 1024 * 1024,
 		isHeartbeating:                true,
 		stopChan:                      make(chan bool),
+		blockStateChangeChan:          make(chan bool, 1),
 		inFlightUploadDataLimitCond:   sync.NewCond(new(sync.Mutex)),
 		inFlightDownloadDataLimitCond: sync.NewCond(new(sync.Mutex)),
 		concurrentUploadLimit:         concurrentUploadLimit,
@ -135,6 +137,7 @@ func NewVolumeServer(adminMux, publicMux *http.ServeMux, ip string,
 			adminMux.HandleFunc("/stats/disk", vs.guard.WhiteList(vs.statsDiskHandler))
 		*/
 	}
+	adminMux.HandleFunc("/debug/block/shipper", vs.debugBlockShipperHandler)
 	adminMux.HandleFunc("/", requestIDMiddleware(vs.privateStoreHandler))
 	if publicMux != adminMux {
 		// separated admin and public port
--- a/weed/server/volume_server_block.go
+++ b/weed/server/volume_server_block.go
@ -14,6 +14,7 @@ import (
 	"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
 	"github.com/seaweedfs/seaweedfs/weed/storage/blockvol/iscsi"
 	"github.com/seaweedfs/seaweedfs/weed/storage/blockvol/nvme"
+	"github.com/seaweedfs/seaweedfs/weed/storage/blockvol/v2bridge"
 )

 // volReplState tracks active replication addresses per volume.
@ -45,6 +46,23 @@ type BlockService struct {
 	// Replication state (CP6-3).
 	replMu     sync.RWMutex
 	replStates map[string]*volReplState // keyed by volume path
+
+	// V2 engine bridge (Phase 08 P1).
+	v2Bridge *v2bridge.ControlBridge
+}
+
+// WireStateChangeNotify sets up shipper state change callbacks on all
+// registered volumes so that degradation/recovery triggers an immediate
+// heartbeat via the provided channel. Non-blocking send (buffered chan 1).
+func (bs *BlockService) WireStateChangeNotify(ch chan bool) {
+	bs.blockStore.IterateBlockVolumes(func(path string, vol *blockvol.BlockVol) {
+		vol.SetOnShipperStateChange(func(from, to blockvol.ReplicaState) {
+			select {
+			case ch <- true:
+			default: // already pending
+			}
+		})
+	})
 }

 // StartBlockService scans blockDir for .blk files, opens them as block volumes,
@ -70,6 +88,7 @@ func StartBlockService(listenAddr, blockDir, iqnPrefix, portalAddr string, nvmeC
 		blockDir:       blockDir,
 		listenAddr:     listenAddr,
 		nvmeListenAddr: nvmeCfg.ListenAddr,
+		v2Bridge:       v2bridge.NewControlBridge(),
 	}

 	// iSCSI target setup.
@ -312,7 +331,18 @@ func (bs *BlockService) DeleteBlockVol(name string) error {
 }

 // ProcessAssignments applies assignments from master, including replication setup.
+// V2 bridge: also delivers each assignment to the V2 engine for recovery ownership.
 func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAssignment) {
+	// V2 bridge: convert and deliver to engine (Phase 08 P1).
+	if bs.v2Bridge != nil {
+		for _, a := range assignments {
+			intent := bs.v2Bridge.ConvertAssignment(a, bs.listenAddr)
+			_ = intent // TODO(P2): deliver to engine orchestrator
+			glog.V(1).Infof("v2bridge: converted assignment %s epoch=%d → %d replicas",
+				a.Path, a.Epoch, len(intent.Replicas))
+		}
+	}
+
 	for _, a := range assignments {
 		role := blockvol.RoleFromWire(a.Role)
 		ttl := blockvol.LeaseTTLFromWire(a.LeaseTtlMs)
@ -645,6 +675,8 @@ func (bs *BlockService) Shutdown() {

 // SetBlockService wires a BlockService into the VolumeServer so that
 // heartbeats include block volume info and the server is marked block-capable.
+// Also wires shipper state change callbacks for immediate heartbeat on degradation.
 func (vs *VolumeServer) SetBlockService(bs *BlockService) {
 	vs.blockService = bs
+	bs.WireStateChangeNotify(vs.blockStateChangeChan)
 }
--- a/weed/server/volume_server_block_debug.go
+++ b/weed/server/volume_server_block_debug.go
@ -0,0 +1,77 @@
+package weed_server
+
+import (
+	"encoding/json"
+	"net/http"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
+)
+
+// ShipperDebugInfo is the real-time shipper state for one replica.
+type ShipperDebugInfo struct {
+	DataAddr   string `json:"data_addr"`
+	State      string `json:"state"`
+	FlushedLSN uint64 `json:"flushed_lsn"`
+}
+
+// BlockVolumeDebugInfo is the real-time block volume state.
+type BlockVolumeDebugInfo struct {
+	Path      string             `json:"path"`
+	Role      string             `json:"role"`
+	Epoch     uint64             `json:"epoch"`
+	HeadLSN   uint64             `json:"head_lsn"`
+	Degraded  bool               `json:"degraded"`
+	Shippers  []ShipperDebugInfo `json:"shippers,omitempty"`
+	Timestamp string             `json:"timestamp"`
+}
+
+// debugBlockShipperHandler returns real-time shipper state for all block volumes.
+// Unlike the master's replica_degraded (heartbeat-lagged), this reads directly
+// from the shipper's atomic state field — no heartbeat delay.
+//
+// GET /debug/block/shipper
+func (vs *VolumeServer) debugBlockShipperHandler(w http.ResponseWriter, r *http.Request) {
+	if vs.blockService == nil {
+		w.Header().Set("Content-Type", "application/json")
+		json.NewEncoder(w).Encode([]BlockVolumeDebugInfo{})
+		return
+	}
+
+	store := vs.blockService.Store()
+	if store == nil {
+		w.Header().Set("Content-Type", "application/json")
+		json.NewEncoder(w).Encode([]BlockVolumeDebugInfo{})
+		return
+	}
+
+	var infos []BlockVolumeDebugInfo
+	store.IterateBlockVolumes(func(path string, vol *blockvol.BlockVol) {
+		status := vol.Status()
+		info := BlockVolumeDebugInfo{
+			Path:      path,
+			Role:      status.Role.String(),
+			Epoch:     status.Epoch,
+			HeadLSN:   status.WALHeadLSN,
+			Degraded:  status.ReplicaDegraded,
+			Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
+		}
+
+		// Get per-shipper state from ShipperGroup if available.
+		sg := vol.GetShipperGroup()
+		if sg != nil {
+			for _, ss := range sg.ShipperStates() {
+				info.Shippers = append(info.Shippers, ShipperDebugInfo{
+					DataAddr:   ss.DataAddr,
+					State:      ss.State,
+					FlushedLSN: ss.FlushedLSN,
+				})
+			}
+		}
+
+		infos = append(infos, info)
+	})
+
+	w.Header().Set("Content-Type", "application/json")
+	json.NewEncoder(w).Encode(infos)
+}
--- a/weed/storage/blockvol/block_heartbeat.go
+++ b/weed/storage/blockvol/block_heartbeat.go
@ -47,6 +47,7 @@ type BlockVolumeAssignment struct {
 	LeaseTtlMs      uint32        // lease TTL in milliseconds (0 = no lease)
 	ReplicaDataAddr string        // where primary ships WAL data (scalar, RF=2 compat)
 	ReplicaCtrlAddr string        // where primary sends barriers (scalar, RF=2 compat)
+	ReplicaServerID string        // V2: stable server identity for scalar replica (from registry)
 	RebuildAddr     string        // where rebuild server listens
 	ReplicaAddrs    []ReplicaAddr // CP8-2: multi-replica addrs (precedence over scalar)
 }
--- a/weed/storage/blockvol/blockvol.go
+++ b/weed/storage/blockvol/blockvol.go
@ -83,6 +83,9 @@ type BlockVol struct {
 	// Observability (CP8-4).
 	Metrics *EngineMetrics

+	// Shipper state change callback — triggers immediate heartbeat.
+	onShipperStateChange func(from, to ReplicaState)
+
 	// Snapshot fields (Phase 5 CP5-2).
 	snapMu    sync.RWMutex
 	snapshots map[uint32]*activeSnapshot
@ -782,6 +785,7 @@ func (v *BlockVol) SyncCache() error {
 type ReplicaAddr struct {
 	DataAddr string
 	CtrlAddr string
+	ServerID string // V2: stable server identity from registry (not address-derived)
 }

 // WALAccess provides the shipper with the minimal WAL interface needed
@ -824,6 +828,18 @@ func (a *walAccess) StreamEntries(fromLSN uint64, fn func(*WALEntry) error) erro
 	return a.vol.wal.ScanFrom(a.vol.fd, a.vol.super.WALOffset, checkpointLSN, fromLSN, fn)
 }

+// SetOnShipperStateChange registers a callback for shipper state transitions.
+// Called by the volume server to trigger immediate heartbeat on degradation/recovery.
+func (v *BlockVol) SetOnShipperStateChange(fn func(from, to ReplicaState)) {
+	v.onShipperStateChange = fn
+}
+
+// GetShipperGroup returns the shipper group for debug/observability.
+// Returns nil if no replication is configured.
+func (v *BlockVol) GetShipperGroup() *ShipperGroup {
+	return v.shipperGroup
+}
+
 // SetReplicaAddr configures a single replica endpoint. Backward-compatible wrapper
 // around SetReplicaAddrs for RF=2 callers.
 func (v *BlockVol) SetReplicaAddr(dataAddr, ctrlAddr string) {
@ -842,6 +858,11 @@ func (v *BlockVol) SetReplicaAddrs(addrs []ReplicaAddr) {
 	}
 	v.shipperGroup = NewShipperGroup(shippers)

+	// Wire state change callback so shipper degradation triggers immediate heartbeat.
+	if v.onShipperStateChange != nil {
+		v.shipperGroup.SetOnStateChange(v.onShipperStateChange)
+	}
+
 	// Replace the group committer's sync function with a distributed version.
 	v.groupCommit.Stop()
 	v.groupCommit = NewGroupCommitter(GroupCommitterConfig{
--- a/weed/storage/blockvol/shipper_group.go
+++ b/weed/storage/blockvol/shipper_group.go
@ -188,6 +188,17 @@ func (sg *ShipperGroup) EvaluateRetentionBudgets(timeout time.Duration) {
 	}
 }

+// SetOnStateChange registers a callback on all current shippers for state transitions.
+// Used by the volume server to trigger an immediate block heartbeat when a shipper
+// transitions to/from degraded.
+func (sg *ShipperGroup) SetOnStateChange(fn func(from, to ReplicaState)) {
+	sg.mu.RLock()
+	defer sg.mu.RUnlock()
+	for _, s := range sg.shippers {
+		s.SetOnStateChange(fn)
+	}
+}
+
 // ShipperStates returns per-replica status for heartbeat reporting.
 // Master uses this to identify which replicas need rebuild.
 func (sg *ShipperGroup) ShipperStates() []ReplicaShipperStatus {
--- a/weed/storage/blockvol/testrunner/scenarios/internal/robust-slow-replica.yaml
+++ b/weed/storage/blockvol/testrunner/scenarios/internal/robust-slow-replica.yaml
@ -104,28 +104,46 @@ phases:
        iqn: "{{ vol_iqn }}"
        save_as: device

-  - name: inject-partition
+  - name: inject-delay
    actions:
      - action: print
-        msg: "=== Blocking replication ports (3295) from primary to replica ==="
+        msg: "=== Blocking replication ports (4000-6000) from primary to replica ==="

-      # Block only replication port — SSH and master heartbeat still work.
-      - action: inject_partition
+      # Block the replication port range. Replication data/ctrl ports are
+      # basePort(3295) + 1000 + hash*3, landing in ~4295-5794 range.
+      # Blocking 4000-6000 covers all possible replication ports while
+      # leaving SSH (22) and master heartbeat (9433/18480) open.
+      - action: exec
        node: m02
-        target_ip: "192.168.1.181"
-        ports: "3295"
+        cmd: "iptables -A OUTPUT -d 192.168.1.181 -p tcp --dport 4000:6000 -j REJECT --reject-with tcp-reset"
+        root: "true"

-      # Trigger a write so barrier fires and times out.
-      - action: exec
+      - action: print
+        msg: "=== Writing to trigger Ship failure + degradation ==="
+
+      # Write in background via fio (best_effort: writes succeed locally).
+      - action: fio_json
        node: m01
-        cmd: "timeout 10 dd if=/dev/urandom of={{ device }} bs=4k count=1 oflag=direct 2>/dev/null; true"
-        root: "true"
+        device: "{{ device }}"
+        rw: randwrite
+        bs: 4k
+        iodepth: "1"
+        runtime: "10"
+        time_based: "true"
+        name: write-during-fault
+        save_as: fio_fault
        ignore_error: true

-      # Wait for barrier timeout (5s) + degradation detection.
-      - action: sleep
-        duration: 10s
+      - action: fio_parse
+        json_var: fio_fault
+        metric: iops
+        save_as: iops_fault
+        ignore_error: true
+
+      - action: print
+        msg: "Write IOPS during fault: {{ iops_fault }}"

+      # Check degraded state after writes.
      - action: assert_block_field
        name: "{{ volume_name }}"
        field: replica_degraded
@ -134,16 +152,17 @@ phases:
        ignore_error: true

      - action: print
-        msg: "During partition: degraded={{ degraded_during }}"
+        msg: "During fault: degraded={{ degraded_during }}"

  - name: clear-and-measure
    actions:
      - action: print
-        msg: "=== Clearing partition, measuring shipper recovery ==="
+        msg: "=== Clearing fault, measuring shipper recovery ==="

-      - action: clear_fault
+      - action: exec
        node: m02
-        type: partition
+        cmd: "iptables -D OUTPUT -d 192.168.1.181 -p tcp --dport 4000:6000 -j REJECT --reject-with tcp-reset 2>/dev/null; true"
+        root: "true"

      # Check at 5s — V1.5 background reconnect interval is 5s.
      - action: sleep
@ -221,9 +240,10 @@ phases:
  - name: cleanup
    always: true
    actions:
-      - action: clear_fault
+      - action: exec
        node: m02
-        type: netem
+        cmd: "iptables -D OUTPUT -d 192.168.1.181 -p tcp --dport 4000:6000 -j REJECT --reject-with tcp-reset 2>/dev/null; true"
+        root: "true"
        ignore_error: true
      - action: stop_weed
        node: m01
--- a/weed/storage/blockvol/v2bridge/control.go
+++ b/weed/storage/blockvol/v2bridge/control.go
@ -1,15 +1,15 @@
 // control.go implements the real control-plane delivery bridge.
-// It converts BlockVolumeAssignment (from master heartbeat) into
-// V2 engine AssignmentIntent, using real master/registry identity.
+// Converts BlockVolumeAssignment (from master heartbeat) into V2 engine
+// AssignmentIntent using stable server identity from the master registry.
 //
-// Identity rule: ReplicaID = <volume-path>/<replica-server>
-// The replica-server is the VS identity from the master registry,
-// not a transport address. This survives address changes.
+// Identity rule: ReplicaID = <volume-path>/<server-id>
+// ServerID comes from BlockVolumeAssignment.ReplicaServerID or
+// ReplicaAddr.ServerID — NOT derived from transport addresses.
 package v2bridge

 import (
 	"fmt"
-	"strings"
+	"log"

 	bridge "github.com/seaweedfs/seaweedfs/sw-block/bridge/blockvol"
 	engine "github.com/seaweedfs/seaweedfs/sw-block/engine/replication"
@ -17,27 +17,16 @@ import (
 )

 // ControlBridge converts real BlockVolumeAssignment into V2 engine intents.
-// It is the live replacement for direct AssignmentIntent construction.
 type ControlBridge struct {
 	adapter *bridge.ControlAdapter
 }

-// NewControlBridge creates a control bridge.
 func NewControlBridge() *ControlBridge {
-	return &ControlBridge{
-		adapter: bridge.NewControlAdapter(),
-	}
+	return &ControlBridge{adapter: bridge.NewControlAdapter()}
 }

-// ConvertAssignment converts a real BlockVolumeAssignment from the master
-// heartbeat response into a V2 engine AssignmentIntent.
-//
-// Identity mapping:
-//   - VolumeName = assignment.Path
-//   - For primary: ReplicaID per replica = <path>/<replica-server-id>
-//   - replica-server-id = extracted from ReplicaAddrs or scalar fields
-//   - Epoch from assignment
-//   - SessionKind from Role
+// ConvertAssignment converts a real BlockVolumeAssignment into an engine intent.
+// localServerID is the identity of the local volume server (for replica/rebuild roles).
 func (cb *ControlBridge) ConvertAssignment(a blockvol.BlockVolumeAssignment, localServerID string) engine.AssignmentIntent {
 	role := blockvol.RoleFromWire(a.Role)
 	volumeName := a.Path
@ -54,48 +43,48 @@ func (cb *ControlBridge) ConvertAssignment(a blockvol.BlockVolumeAssignment, loc
 	}
 }

-// convertPrimaryAssignment: primary receives assignment with replica targets.
 func (cb *ControlBridge) convertPrimaryAssignment(a blockvol.BlockVolumeAssignment, volumeName string) engine.AssignmentIntent {
 	primary := bridge.MasterAssignment{
-		VolumeName:      volumeName,
-		Epoch:           a.Epoch,
-		Role:            "primary",
-		PrimaryServerID: "", // primary doesn't need its own server ID in the assignment
+		VolumeName: volumeName,
+		Epoch:      a.Epoch,
+		Role:       "primary",
 	}

 	var replicas []bridge.MasterAssignment
 	if len(a.ReplicaAddrs) > 0 {
 		for _, ra := range a.ReplicaAddrs {
-			serverID := extractServerID(ra.DataAddr)
+			if ra.ServerID == "" {
+				log.Printf("v2bridge: skipping replica with empty ServerID (data=%s)", ra.DataAddr)
+				continue // fail closed: skip replicas without stable identity
+			}
 			replicas = append(replicas, bridge.MasterAssignment{
 				VolumeName:      volumeName,
 				Epoch:           a.Epoch,
 				Role:            "replica",
-				ReplicaServerID: serverID,
+				ReplicaServerID: ra.ServerID,
 				DataAddr:        ra.DataAddr,
 				CtrlAddr:        ra.CtrlAddr,
-				AddrVersion:     0, // will be bumped on address change detection
 			})
 		}
-	} else if a.ReplicaDataAddr != "" {
-		// Scalar RF=2 compat.
-		serverID := extractServerID(a.ReplicaDataAddr)
+	} else if a.ReplicaServerID != "" && a.ReplicaDataAddr != "" {
+		// Scalar RF=2 path with explicit ServerID.
 		replicas = append(replicas, bridge.MasterAssignment{
 			VolumeName:      volumeName,
 			Epoch:           a.Epoch,
 			Role:            "replica",
-			ReplicaServerID: serverID,
+			ReplicaServerID: a.ReplicaServerID,
 			DataAddr:        a.ReplicaDataAddr,
 			CtrlAddr:        a.ReplicaCtrlAddr,
 		})
+	} else if a.ReplicaDataAddr != "" {
+		log.Printf("v2bridge: scalar replica assignment without ServerID (data=%s) — skipping", a.ReplicaDataAddr)
+		// Fail closed: do not create address-derived identity.
 	}

 	return cb.adapter.ToAssignmentIntent(primary, replicas)
 }

-// convertReplicaAssignment: replica receives its own role assignment.
 func (cb *ControlBridge) convertReplicaAssignment(a blockvol.BlockVolumeAssignment, volumeName, localServerID string) engine.AssignmentIntent {
-	// Replica doesn't manage other replicas — just acknowledges its role.
 	return engine.AssignmentIntent{
 		Epoch: a.Epoch,
 		Replicas: []engine.ReplicaAssignment{
@ -110,7 +99,6 @@ func (cb *ControlBridge) convertReplicaAssignment(a blockvol.BlockVolumeAssignme
 	}
 }

-// convertRebuildAssignment: rebuilding replica.
 func (cb *ControlBridge) convertRebuildAssignment(a blockvol.BlockVolumeAssignment, volumeName, localServerID string) engine.AssignmentIntent {
 	replicaID := fmt.Sprintf("%s/%s", volumeName, localServerID)
 	return engine.AssignmentIntent{
@ -129,22 +117,3 @@ func (cb *ControlBridge) convertRebuildAssignment(a blockvol.BlockVolumeAssignme
 		},
 	}
 }
-
-// extractServerID derives a stable server identity from an address.
-// Uses the host:port as the server ID (this is how the master registry
-// keys servers). In production, this would come from the registry's
-// ReplicaInfo.Server field directly.
-//
-// For now: strip to host:grpc-port format to match master registry keys.
-func extractServerID(addr string) string {
-	// addr is typically "ip:port" — use as-is for server ID.
-	// The master registry uses the same format for ReplicaInfo.Server.
-	if addr == "" {
-		return "unknown"
-	}
-	// Strip any path suffix, keep host:port.
-	if idx := strings.Index(addr, "/"); idx >= 0 {
-		return addr[:idx]
-	}
-	return addr
-}
--- a/weed/storage/blockvol/v2bridge/control_test.go
+++ b/weed/storage/blockvol/v2bridge/control_test.go
@ -9,234 +9,184 @@ import (

 // ============================================================
 // Phase 08 P1: Real control delivery tests
-// Validates real BlockVolumeAssignment → engine AssignmentIntent.
+// Identity: ReplicaID = <path>/<ServerID> — NOT address-derived.
 // ============================================================

-// --- E1: Live assignment delivery → engine intent ---
-
-func TestControl_PrimaryAssignment_StableIdentity(t *testing.T) {
+func TestControl_PrimaryAssignment_StableServerID(t *testing.T) {
 	cb := NewControlBridge()

-	// Real assignment from master heartbeat.
 	a := blockvol.BlockVolumeAssignment{
 		Path:            "pvc-data-1",
 		Epoch:           3,
 		Role:            uint32(blockvol.RolePrimary),
+		ReplicaServerID: "vs2",
 		ReplicaDataAddr: "10.0.0.2:9333",
 		ReplicaCtrlAddr: "10.0.0.2:9334",
 	}

-	intent := cb.ConvertAssignment(a, "vs1:9333")
+	intent := cb.ConvertAssignment(a, "vs1")

-	if intent.Epoch != 3 {
-		t.Fatalf("epoch=%d", intent.Epoch)
-	}
 	if len(intent.Replicas) != 1 {
 		t.Fatalf("replicas=%d", len(intent.Replicas))
 	}

-	// ReplicaID = volume-path / replica-server (NOT address-derived transport endpoint).
 	r := intent.Replicas[0]
-	expected := "pvc-data-1/10.0.0.2:9333"
-	if r.ReplicaID != expected {
-		t.Fatalf("ReplicaID=%s, want %s", r.ReplicaID, expected)
+	// ReplicaID uses ServerID, not address.
+	if r.ReplicaID != "pvc-data-1/vs2" {
+		t.Fatalf("ReplicaID=%s, want pvc-data-1/vs2", r.ReplicaID)
 	}
-
-	// Endpoint is the transport address.
 	if r.Endpoint.DataAddr != "10.0.0.2:9333" {
 		t.Fatalf("DataAddr=%s", r.Endpoint.DataAddr)
 	}
+	if intent.RecoveryTargets["pvc-data-1/vs2"] != engine.SessionCatchUp {
+		t.Fatalf("recovery=%s", intent.RecoveryTargets["pvc-data-1/vs2"])
+	}
+}
+
+func TestControl_AddressChange_IdentityPreserved(t *testing.T) {
+	cb := NewControlBridge()
+
+	// Same ServerID, different address.
+	a1 := blockvol.BlockVolumeAssignment{
+		Path: "vol1", Epoch: 1, Role: uint32(blockvol.RolePrimary),
+		ReplicaServerID: "vs2",
+		ReplicaDataAddr: "10.0.0.2:9333", ReplicaCtrlAddr: "10.0.0.2:9334",
+	}
+	a2 := blockvol.BlockVolumeAssignment{
+		Path: "vol1", Epoch: 1, Role: uint32(blockvol.RolePrimary),
+		ReplicaServerID: "vs2",
+		ReplicaDataAddr: "10.0.0.5:9333", ReplicaCtrlAddr: "10.0.0.5:9334",
+	}
+
+	intent1 := cb.ConvertAssignment(a1, "vs1")
+	intent2 := cb.ConvertAssignment(a2, "vs1")

-	// Recovery target for replica.
-	if intent.RecoveryTargets[expected] != engine.SessionCatchUp {
-		t.Fatalf("recovery=%s", intent.RecoveryTargets[expected])
+	if intent1.Replicas[0].ReplicaID != intent2.Replicas[0].ReplicaID {
+		t.Fatalf("identity changed: %s → %s",
+			intent1.Replicas[0].ReplicaID, intent2.Replicas[0].ReplicaID)
+	}
+	if intent2.Replicas[0].Endpoint.DataAddr != "10.0.0.5:9333" {
+		t.Fatal("endpoint should be updated")
 	}
 }

-func TestControl_PrimaryAssignment_MultiReplica(t *testing.T) {
+func TestControl_MultiReplica_StableServerIDs(t *testing.T) {
 	cb := NewControlBridge()

 	a := blockvol.BlockVolumeAssignment{
-		Path:  "pvc-data-1",
-		Epoch: 2,
-		Role:  uint32(blockvol.RolePrimary),
+		Path: "vol1", Epoch: 2, Role: uint32(blockvol.RolePrimary),
 		ReplicaAddrs: []blockvol.ReplicaAddr{
-			{DataAddr: "10.0.0.2:9333", CtrlAddr: "10.0.0.2:9334"},
-			{DataAddr: "10.0.0.3:9333", CtrlAddr: "10.0.0.3:9334"},
+			{DataAddr: "10.0.0.2:9333", CtrlAddr: "10.0.0.2:9334", ServerID: "vs2"},
+			{DataAddr: "10.0.0.3:9333", CtrlAddr: "10.0.0.3:9334", ServerID: "vs3"},
 		},
 	}

-	intent := cb.ConvertAssignment(a, "vs1:9333")
-
+	intent := cb.ConvertAssignment(a, "vs1")
 	if len(intent.Replicas) != 2 {
 		t.Fatalf("replicas=%d", len(intent.Replicas))
 	}

-	// Both replicas have stable identity.
 	ids := map[string]bool{}
 	for _, r := range intent.Replicas {
 		ids[r.ReplicaID] = true
 	}
-	if !ids["pvc-data-1/10.0.0.2:9333"] || !ids["pvc-data-1/10.0.0.3:9333"] {
-		t.Fatalf("IDs: %v", ids)
+	if !ids["vol1/vs2"] || !ids["vol1/vs3"] {
+		t.Fatalf("IDs: %v (should use ServerID, not address)", ids)
 	}
 }

-// --- E2: Address change preserves identity ---
-
-func TestControl_AddressChange_SameServerID(t *testing.T) {
+func TestControl_MissingServerID_FailsClosed(t *testing.T) {
 	cb := NewControlBridge()

-	// First assignment.
+	// Scalar: no ServerID → no replica created.
 	a1 := blockvol.BlockVolumeAssignment{
-		Path:            "vol1",
-		Epoch:           1,
-		Role:            uint32(blockvol.RolePrimary),
-		ReplicaDataAddr: "10.0.0.2:9333",
-		ReplicaCtrlAddr: "10.0.0.2:9334",
+		Path: "vol1", Epoch: 1, Role: uint32(blockvol.RolePrimary),
+		ReplicaDataAddr: "10.0.0.2:9333", ReplicaCtrlAddr: "10.0.0.2:9334",
+		// ReplicaServerID intentionally empty.
+	}
+	intent1 := cb.ConvertAssignment(a1, "vs1")
+	if len(intent1.Replicas) != 0 {
+		t.Fatalf("scalar without ServerID should produce 0 replicas, got %d", len(intent1.Replicas))
 	}
-	intent1 := cb.ConvertAssignment(a1, "vs1:9333")

-	// Address changes (replica restarted on different IP).
+	// Multi: one with ServerID, one without → only one replica.
 	a2 := blockvol.BlockVolumeAssignment{
-		Path:            "vol1",
-		Epoch:           1,
-		Role:            uint32(blockvol.RolePrimary),
-		ReplicaDataAddr: "10.0.0.5:9333",
-		ReplicaCtrlAddr: "10.0.0.5:9334",
-	}
-	intent2 := cb.ConvertAssignment(a2, "vs1:9333")
-
-	// NOTE: with current extractServerID, different IPs = different server IDs.
-	// This is a known limitation: address-based server identity.
-	// In production, the master registry would supply a stable server ID.
-	// For now, document the boundary.
-	id1 := intent1.Replicas[0].ReplicaID
-	id2 := intent2.Replicas[0].ReplicaID
-	t.Logf("address change: id1=%s id2=%s (different if IP changes)", id1, id2)
-
-	// The critical test: same IP, different port (same server, port change).
-	a3 := blockvol.BlockVolumeAssignment{
-		Path:            "vol1",
-		Epoch:           1,
-		Role:            uint32(blockvol.RolePrimary),
-		ReplicaDataAddr: "10.0.0.2:9444", // same IP, different port
-		ReplicaCtrlAddr: "10.0.0.2:9445",
+		Path: "vol1", Epoch: 1, Role: uint32(blockvol.RolePrimary),
+		ReplicaAddrs: []blockvol.ReplicaAddr{
+			{DataAddr: "10.0.0.2:9333", ServerID: "vs2"},
+			{DataAddr: "10.0.0.3:9333", ServerID: ""}, // empty → skipped
+		},
+	}
+	intent2 := cb.ConvertAssignment(a2, "vs1")
+	if len(intent2.Replicas) != 1 {
+		t.Fatalf("multi with 1 missing ServerID: replicas=%d, want 1", len(intent2.Replicas))
 	}
-	intent3 := cb.ConvertAssignment(a3, "vs1:9333")
-	id3 := intent3.Replicas[0].ReplicaID
-
-	// Same IP different port = different server ID in current model.
-	// This is the V1 identity limitation that a future registry-backed
-	// server ID would resolve.
-	t.Logf("port change: id1=%s id3=%s", id1, id3)
 }

-// --- E3: Epoch fencing through real assignment ---
-
 func TestControl_EpochFencing_IntegratedPath(t *testing.T) {
 	cb := NewControlBridge()
-	driver := engine.NewRecoveryDriver(nil) // no storage needed for control-path test
+	driver := engine.NewRecoveryDriver(nil)

-	// Epoch 1 assignment.
 	a1 := blockvol.BlockVolumeAssignment{
-		Path:            "vol1",
-		Epoch:           1,
-		Role:            uint32(blockvol.RolePrimary),
-		ReplicaDataAddr: "10.0.0.2:9333",
-		ReplicaCtrlAddr: "10.0.0.2:9334",
+		Path: "vol1", Epoch: 1, Role: uint32(blockvol.RolePrimary),
+		ReplicaServerID: "vs2", ReplicaDataAddr: "10.0.0.2:9333", ReplicaCtrlAddr: "10.0.0.2:9334",
 	}
-	intent1 := cb.ConvertAssignment(a1, "vs1:9333")
-	driver.Orchestrator.ProcessAssignment(intent1)
+	driver.Orchestrator.ProcessAssignment(cb.ConvertAssignment(a1, "vs1"))

-	s := driver.Orchestrator.Registry.Sender("vol1/10.0.0.2:9333")
-	if s == nil {
-		t.Fatal("sender should exist after epoch 1 assignment")
-	}
-	if !s.HasActiveSession() {
-		t.Fatal("should have session after epoch 1")
+	s := driver.Orchestrator.Registry.Sender("vol1/vs2")
+	if s == nil || !s.HasActiveSession() {
+		t.Fatal("should have session at epoch 1")
 	}

-	// Epoch bump (failover).
 	driver.Orchestrator.InvalidateEpoch(2)
-	driver.Orchestrator.UpdateSenderEpoch("vol1/10.0.0.2:9333", 2)
+	driver.Orchestrator.UpdateSenderEpoch("vol1/vs2", 2)

 	if s.HasActiveSession() {
-		t.Fatal("old session should be invalidated after epoch bump")
+		t.Fatal("old session should be invalidated")
 	}

-	// Epoch 2 assignment.
 	a2 := blockvol.BlockVolumeAssignment{
-		Path:            "vol1",
-		Epoch:           2,
-		Role:            uint32(blockvol.RolePrimary),
-		ReplicaDataAddr: "10.0.0.2:9333",
-		ReplicaCtrlAddr: "10.0.0.2:9334",
+		Path: "vol1", Epoch: 2, Role: uint32(blockvol.RolePrimary),
+		ReplicaServerID: "vs2", ReplicaDataAddr: "10.0.0.2:9333", ReplicaCtrlAddr: "10.0.0.2:9334",
 	}
-	intent2 := cb.ConvertAssignment(a2, "vs1:9333")
-	driver.Orchestrator.ProcessAssignment(intent2)
+	driver.Orchestrator.ProcessAssignment(cb.ConvertAssignment(a2, "vs1"))

 	if !s.HasActiveSession() {
 		t.Fatal("should have new session at epoch 2")
 	}

-	// Log shows invalidation.
 	hasInvalidation := false
-	for _, e := range driver.Orchestrator.Log.EventsFor("vol1/10.0.0.2:9333") {
+	for _, e := range driver.Orchestrator.Log.EventsFor("vol1/vs2") {
 		if e.Event == "session_invalidated" {
 			hasInvalidation = true
 		}
 	}
 	if !hasInvalidation {
-		t.Fatal("log must show session invalidation on epoch bump")
+		t.Fatal("log must show invalidation")
 	}
 }

-// --- E4: Rebuild role mapping ---
-
 func TestControl_RebuildAssignment(t *testing.T) {
 	cb := NewControlBridge()
-
 	a := blockvol.BlockVolumeAssignment{
-		Path:            "vol1",
-		Epoch:           3,
-		Role:            uint32(blockvol.RoleRebuilding),
-		ReplicaDataAddr: "10.0.0.2:9333",
-		ReplicaCtrlAddr: "10.0.0.2:9334",
-		RebuildAddr:     "10.0.0.1:15000",
-	}
-
-	intent := cb.ConvertAssignment(a, "10.0.0.2:9333")
-
-	if len(intent.RecoveryTargets) != 1 {
-		t.Fatalf("recovery targets=%d", len(intent.RecoveryTargets))
+		Path: "vol1", Epoch: 3, Role: uint32(blockvol.RoleRebuilding),
+		ReplicaDataAddr: "10.0.0.2:9333", ReplicaCtrlAddr: "10.0.0.2:9334",
+		RebuildAddr: "10.0.0.1:15000",
 	}
-
-	replicaID := "vol1/10.0.0.2:9333"
-	if intent.RecoveryTargets[replicaID] != engine.SessionRebuild {
-		t.Fatalf("recovery=%s", intent.RecoveryTargets[replicaID])
+	intent := cb.ConvertAssignment(a, "vs2")
+	if intent.RecoveryTargets["vol1/vs2"] != engine.SessionRebuild {
+		t.Fatalf("recovery=%s", intent.RecoveryTargets["vol1/vs2"])
 	}
 }

-// --- E5: Replica assignment ---
-
 func TestControl_ReplicaAssignment(t *testing.T) {
 	cb := NewControlBridge()
-
 	a := blockvol.BlockVolumeAssignment{
-		Path:            "vol1",
-		Epoch:           1,
-		Role:            uint32(blockvol.RoleReplica),
-		ReplicaDataAddr: "10.0.0.1:14260",
-		ReplicaCtrlAddr: "10.0.0.1:14261",
-	}
-
-	intent := cb.ConvertAssignment(a, "vs2:9333")
-
-	if len(intent.Replicas) != 1 {
-		t.Fatalf("replicas=%d", len(intent.Replicas))
+		Path: "vol1", Epoch: 1, Role: uint32(blockvol.RoleReplica),
+		ReplicaDataAddr: "10.0.0.1:14260", ReplicaCtrlAddr: "10.0.0.1:14261",
 	}
-	if intent.Replicas[0].ReplicaID != "vol1/vs2:9333" {
+	intent := cb.ConvertAssignment(a, "vs2")
+	if intent.Replicas[0].ReplicaID != "vol1/vs2" {
 		t.Fatalf("ReplicaID=%s", intent.Replicas[0].ReplicaID)
 	}
 }
--- a/weed/storage/blockvol/wal_shipper.go
+++ b/weed/storage/blockvol/wal_shipper.go
@ -71,6 +71,17 @@ type WALShipper struct {
 	catchupFailures    int           // consecutive catch-up failures; reset on success
 	lastContactTime    atomic.Value  // time.Time: last successful barrier/handshake/catch-up
 	stopped            atomic.Bool
+
+	// onStateChange is called when the shipper transitions between states.
+	// Used to trigger immediate heartbeat on degradation/recovery.
+	// Set via SetOnStateChange. Nil = no callback.
+	onStateChange func(from, to ReplicaState)
+}
+
+// SetOnStateChange registers a callback for shipper state transitions.
+// The callback is invoked synchronously from markDegraded/markInSync.
+func (s *WALShipper) SetOnStateChange(fn func(from, to ReplicaState)) {
+	s.onStateChange = fn
 }

 const maxCatchupRetries = 3
@ -345,8 +356,11 @@ func (s *WALShipper) ensureCtrlConn() error {
 }

 func (s *WALShipper) markDegraded() {
-	s.state.Store(uint32(ReplicaDegraded))
-	log.Printf("wal_shipper: replica degraded (data=%s, ctrl=%s, state=%s)", s.dataAddr, s.controlAddr, s.State())
+	prev := ReplicaState(s.state.Swap(uint32(ReplicaDegraded)))
+	log.Printf("wal_shipper: replica degraded (data=%s, ctrl=%s, prev=%s)", s.dataAddr, s.controlAddr, prev)
+	if prev != ReplicaDegraded && s.onStateChange != nil {
+		s.onStateChange(prev, ReplicaDegraded)
+	}
 }

 // resetConnections closes both data and control connections for a clean retry.
@ -404,10 +418,13 @@ func (s *WALShipper) doReconnectAndCatchUp() error {
 }

 func (s *WALShipper) markInSync() {
-	s.state.Store(uint32(ReplicaInSync))
+	prev := ReplicaState(s.state.Swap(uint32(ReplicaInSync)))
 	s.catchupFailures = 0
 	s.touchContactTime()
-	log.Printf("wal_shipper: replica in-sync (data=%s, ctrl=%s)", s.dataAddr, s.controlAddr)
+	log.Printf("wal_shipper: replica in-sync (data=%s, ctrl=%s, prev=%s)", s.dataAddr, s.controlAddr, prev)
+	if prev != ReplicaInSync && s.onStateChange != nil {
+		s.onStateChange(prev, ReplicaInSync)
+	}
 }

 const catchupTimeout = 30 * time.Second
--- a/weed/storage/store_blockvol.go
+++ b/weed/storage/store_blockvol.go
@ -84,6 +84,15 @@ func (bs *BlockVolumeStore) ListBlockVolumes() []string {
 	return paths
 }

+// IterateBlockVolumes calls fn for each registered block volume.
+func (bs *BlockVolumeStore) IterateBlockVolumes(fn func(path string, vol *blockvol.BlockVol)) {
+	bs.mu.RLock()
+	defer bs.mu.RUnlock()
+	for path, vol := range bs.volumes {
+		fn(path, vol)
+	}
+}
+
 // CollectBlockVolumeHeartbeat returns status for all registered
 // block volumes, suitable for inclusion in a heartbeat message.
 func (bs *BlockVolumeStore) CollectBlockVolumeHeartbeat() []blockvol.BlockVolumeInfoMessage {