chore: archive superseded V2 design docs

Copies of design docs removed in Phase 09, preserved in sw-block/docs/archive/ for historical reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10 hours ago · c0a805184f
10 changed files with 2001 additions and 0 deletions
--- a/sw-block/docs/archive/design/README.md
+++ b/sw-block/docs/archive/design/README.md
@ -0,0 +1,28 @@
+# Design Archive
+
+This directory contains historical `sw-block` design/planning documents that are still worth keeping as references, but are no longer the main entrypoints for current work.
+
+Use `sw-block/design/` for active design and process documents.
+Use `sw-block/.private/phase/` for current phase contracts, logs, and slice-level execution packages.
+
+## Archived Here
+
+- `v2-production-roadmap.md`
+- `v2-engine-readiness-review.md`
+- `v2-engine-slicing-plan.md`
+- `v2-prototype-roadmap-and-gates.md`
+- `phase-07-service-slice-plan.md`
+- `phase-08-engine-skeleton-map.md`
+- `v2-first-slice-session-ownership.md`
+- `v2-first-slice-sender-ownership.md`
+- `a5-a8-traceability.md`
+
+## Why Archived
+
+These documents are useful for:
+
+1. historical decision context
+2. earlier slice/phase rationale
+3. traceability for passed reviews and planning gates
+
+They are not the canonical source for the current phase roadmap.
--- a/sw-block/docs/archive/design/a5-a8-traceability.md
+++ b/sw-block/docs/archive/design/a5-a8-traceability.md
@ -0,0 +1,117 @@
+# A5-A8 Acceptance Traceability
+
+Date: 2026-03-29
+Status: historical evidence traceability
+
+## Purpose
+
+Map each acceptance criterion to specific executable evidence.
+Two evidence layers:
+- **Simulator** (distsim): protocol-level proof
+- **Prototype** (enginev2): ownership/session-level proof
+
+---
+
+## A5: Non-Convergent Catch-Up Escalates Explicitly
+
+**Must prove**: tail-chasing or failed catch-up does not pretend success.
+
+**Pass condition**: explicit `CatchingUp → NeedsRebuild` transition.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| Tail-chasing converges or aborts | `TestS6_TailChasing_ConvergesOrAborts` | `cluster_test.go` | distsim | PASS |
+| Tail-chasing non-convergent → NeedsRebuild | `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS |
+| Catch-up timeout → NeedsRebuild | `TestP03_CatchupTimeout_EscalatesToNeedsRebuild` | `phase03_timeout_test.go` | distsim | PASS |
+| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS |
+| Flapping budget exceeded → NeedsRebuild | `TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS |
+| Catch-up converges or escalates (I3) | `TestI3_CatchUpConvergesOrEscalates` | `phase045_crash_test.go` | distsim | PASS |
+| Catch-up timeout in enginev2 | `TestE2E_NeedsRebuild_Escalation` | `p2_test.go` | enginev2 | PASS |
+
+**Verdict**: A5 is well-covered. Both simulator and prototype prove explicit escalation. No pretend-success path exists.
+
+---
+
+## A6: Recoverability Boundary Is Explicit
+
+**Must prove**: recoverable vs unrecoverable gap is decided explicitly.
+
+**Pass condition**: recovery aborts when reservation/payload availability is lost; rebuild is explicit fallback.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS |
+| WAL GC beyond replica → NeedsRebuild | `TestI5_CheckpointGC_PreservesAckedBoundary` | `phase045_crash_test.go` | distsim | PASS |
+| Rebuild from snapshot + tail | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS |
+| Smart WAL: resolvable → unresolvable | `TestP02_SmartWAL_RecoverableThenUnrecoverable` | `phase02_advanced_test.go` | distsim | PASS |
+| Time-varying payload availability | `TestP02_SmartWAL_TimeVaryingAvailability` | `phase02_advanced_test.go` | distsim | PASS |
+| RecoverableLSN is replayability proof | `RecoverableLSN()` in `storage.go` | `storage.go` | distsim | Implemented |
+| Handshake outcome: NeedsRebuild | `TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession` | `execution_test.go` | enginev2 | PASS |
+
+**Verdict**: A6 is covered. Recovery boundary is decided by explicit reservation + recoverability check, not by optimistic assumption. `RecoverableLSN()` verifies contiguous WAL coverage.
+
+---
+
+## A7: Historical Data Correctness Holds
+
+**Must prove**: recovered data for target LSN is historically correct; current extent cannot fake old history.
+
+**Pass condition**: snapshot + tail rebuild matches reference; current-extent reconstruction of old LSN fails correctness.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| Snapshot + tail matches reference | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS |
+| Historical state not reconstructable after GC | `TestA7_HistoricalState_NotReconstructableAfterGC` | `phase045_crash_test.go` | distsim | PASS |
+| `CanReconstructAt()` rejects faked history | `CanReconstructAt()` in `storage.go` | `storage.go` | distsim | Implemented |
+| Checkpoint does not leak applied state | `TestI2_CheckpointDoesNotLeakAppliedState` | `phase045_crash_test.go` | distsim | PASS |
+| Extent-referenced resolvable records | `TestExtentReferencedResolvableRecordsAreRecoverable` | `cluster_test.go` | distsim | PASS |
+| Extent-referenced unresolvable → rebuild | `TestExtentReferencedUnresolvableForcesRebuild` | `cluster_test.go` | distsim | PASS |
+| ACK'd flush recoverable after crash (I1) | `TestI1_AckedFlush_RecoverableAfterPrimaryCrash` | `phase045_crash_test.go` | distsim | PASS |
+
+**Verdict**: A7 is now covered with the Phase 4.5 crash-consistency additions. The critical gap ("current extent cannot fake old history") is proven by `CanReconstructAt()` + `TestA7_HistoricalState_NotReconstructableAfterGC`.
+
+---
+
+## A8: Durability Mode Semantics Are Correct
+
+**Must prove**: best_effort, sync_all, sync_quorum behave as intended under mixed replica states.
+
+**Pass condition**: sync_all strict, sync_quorum commits only with true durable quorum, invalid topology rejected.
+
+| Evidence | Test | File | Layer | Status |
+|----------|------|------|-------|--------|
+| sync_quorum continues with one lagging | `TestSyncQuorumContinuesWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS |
+| sync_all blocks with one lagging | `TestSyncAllBlocksWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS |
+| sync_quorum mixed states | `TestSyncQuorumWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS |
+| sync_all mixed states | `TestSyncAllBlocksWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS |
+| Barrier timeout: sync_all blocked | `TestP03_BarrierTimeout_SyncAll_Blocked` | `phase03_timeout_test.go` | distsim | PASS |
+| Barrier timeout: sync_quorum commits | `TestP03_BarrierTimeout_SyncQuorum_StillCommits` | `phase03_timeout_test.go` | distsim | PASS |
+| Promotion uses RecoverableLSN | `EvaluateCandidateEligibility()` | `cluster.go` | distsim | Implemented |
+| Promoted replica has committed prefix (I4) | `TestI4_PromotedReplica_HasCommittedPrefix` | `phase045_crash_test.go` | distsim | PASS |
+
+**Verdict**: A8 is well-covered. sync_all is strict (blocks on lagging), sync_quorum uses true durable quorum (not connection count). Promotion now uses `RecoverableLSN()` for committed-prefix check.
+
+---
+
+## Summary
+
+| Criterion | Simulator Evidence | Prototype Evidence | Status |
+|-----------|-------------------|-------------------|--------|
+| A5 (catch-up escalation) | 6 tests | 1 test | **Strong** |
+| A6 (recoverability boundary) | 6 tests + RecoverableLSN() | 1 test | **Strong** |
+| A7 (historical correctness) | 7 tests + CanReconstructAt() | — | **Strong** (new in Phase 4.5) |
+| A8 (durability modes) | 7 tests + RecoverableLSN() | — | **Strong** |
+
+**Total executable evidence**: 26 simulator tests + 2 prototype tests + 2 new storage methods.
+
+All A5-A8 acceptance criteria have direct test evidence. No criterion depends solely on design-doc claims.
+
+---
+
+## Still Open (Not Blocking)
+
+| Item | Priority | Why not blocking |
+|------|----------|-----------------|
+| Predicate exploration / adversarial search | P2 | Manual scenarios already cover known failure classes |
+| Catch-up convergence under sustained load | P2 | I3 proves escalation; load-rate modeling is optimization |
+| A5-A8 in a single grouped runner view | P3 | Traceability doc serves as grouped evidence for now |
--- a/sw-block/docs/archive/design/phase-07-service-slice-plan.md
+++ b/sw-block/docs/archive/design/phase-07-service-slice-plan.md
@ -0,0 +1,403 @@
+# Phase 07 Service-Slice Plan
+
+Date: 2026-03-30
+Status: historical phase-planning artifact
+Scope: `Phase 07 P0`
+
+## Purpose
+
+Define the first real-system service slice that will host the V2 engine, choose the first concrete integration path in the existing codebase, and map engine adapters onto real modules.
+
+This is a planning document. It does not claim the integration already works.
+
+## Decision
+
+The first service slice should be:
+
+- a single `blockvol` primary on a real volume server
+- with one replica target (`RF=2` path)
+- driven by the existing master heartbeat / assignment loop
+- using the V2 engine only for replication recovery ownership / planning / execution
+
+This is the narrowest real-system slice that still exercises:
+
+1. real assignment delivery
+2. real epoch and failover signals
+3. real volume-server lifecycle
+4. real WAL/checkpoint/base-image truth
+5. real changed-address / reconnect behavior
+
+It is narrow enough to avoid reopening the whole system, but real enough to stop hiding behind engine-local mocks.
+
+## Why This Slice
+
+This slice is the right first integration target because:
+
+1. `weed/server/master_grpc_server.go` already delivers block-volume assignments over heartbeat
+2. `weed/server/master_block_failover.go` already owns failover / promotion / pending rebuild decisions
+3. `weed/storage/blockvol/blockvol.go` already owns the current replication runtime (`shipperGroup`, receiver, WAL retention, checkpoint state)
+4. the existing V1/V1.5 failure history is concentrated in exactly this master <-> volume-server <-> blockvol path
+
+So this slice gives maximum validation value with minimum new surface.
+
+## First Concrete Integration Path
+
+The first integration path should be:
+
+1. master receives volume-server heartbeat
+2. master updates block registry and emits `BlockVolumeAssignment`
+3. volume server receives assignment
+4. block volume adapter converts assignment + local storage state into V2 engine inputs
+5. V2 engine drives sender/session/recovery state
+6. existing block-volume runtime executes the actual data-path work under engine decisions
+
+In code, that path starts here:
+
+- master side:
+  - `weed/server/master_grpc_server.go`
+  - `weed/server/master_block_failover.go`
+  - `weed/server/master_block_registry.go`
+- volume / storage side:
+  - `weed/storage/blockvol/blockvol.go`
+  - `weed/storage/blockvol/recovery.go`
+  - `weed/storage/blockvol/wal_shipper.go`
+  - assignment-handling code under `weed/storage/blockvol/`
+- V2 engine side:
+  - `sw-block/engine/replication/`
+
+## Service-Slice Boundaries
+
+### In-process placement
+
+The V2 engine should initially live:
+
+- in-process with the volume server / `blockvol` runtime
+- not in master
+- not as a separate service yet
+
+Reason:
+
+- the engine needs local access to storage truth and local recovery execution
+- master should remain control-plane authority, not recovery executor
+
+### Control-plane boundary
+
+Master remains authoritative for:
+
+1. epoch
+2. role / assignment
+3. promotion / failover decision
+4. replica membership
+
+The engine consumes these as control inputs. It does not replace master failover policy in `Phase 07`.
+
+### Control-Over-Heartbeat Upgrade Path
+
+For the first V2 product path, the recommended direction is:
+
+- reuse the existing master <-> volume-server heartbeat path as the control carrier
+- upgrade the block-specific control semantics carried on that path
+- do not immediately invent a separate control service or assignment channel
+
+Why:
+
+1. this is the real Seaweed path already carrying block assignments and confirmations today
+2. this gives the fastest route to a real integrated control path
+3. it preserves compatibility with existing Seaweed master/volume-server semantics while V2 hardens its own control truth
+
+Concretely, the current V1 path already provides:
+
+1. block assignments delivered in heartbeat responses from `weed/server/master_grpc_server.go`
+2. assignment application on the volume server in `weed/server/volume_grpc_client_to_master.go` and `weed/server/volume_server_block.go`
+3. assignment confirmation and address-change refresh driven by later heartbeats in `weed/server/master_grpc_server.go` and `weed/server/master_block_registry.go`
+4. immediate block heartbeat on selected shipper state changes in `weed/server/volume_grpc_client_to_master.go`
+
+What should be upgraded for V2 is not mainly the transport, but the control contract carried on it:
+
+1. stable `ReplicaID`
+2. explicit `Epoch`
+3. explicit role / assignment authority
+4. explicit apply/confirm semantics
+5. explicit stale assignment rejection
+6. explicit address-change refresh as endpoint change, not identity change
+
+Current cadence note:
+
+- the block volume heartbeat is periodic (`5 * sleepInterval`) with some immediate state-change heartbeats
+- this is acceptable as the first hardening carrier
+- it should not be assumed to be the final control responsiveness model
+
+Deferred design decision:
+
+- whether block control should eventually move beyond heartbeat-only carriage into a more explicit control/assignment channel should be decided only after the `Phase 08 P1` real control-delivery path exists and can be measured
+
+That later decision should be based on:
+
+1. failover / reassignment responsiveness
+2. assignment confirmation precision
+3. operational complexity
+4. whether heartbeat carriage remains too coarse for the block-control path
+
+Until then, the preferred direction is:
+
+- strengthen block control semantics over the existing heartbeat path
+- do not prematurely create a second control plane
+
+### Storage boundary
+
+`blockvol` remains authoritative for:
+
+1. WAL head / retention reality
+2. checkpoint/base-image reality
+3. actual catch-up streaming
+4. actual rebuild transfer / restore operations
+
+The engine consumes these as storage truth and recovery execution capabilities. It does not replace the storage backend in `Phase 07`.
+
+## First-Slice Identity Mapping
+
+This must be explicit in the first integration slice.
+
+For `RF=2` on the existing master / block registry path:
+
+- stable engine `ReplicaID` should be derived from:
+  - `<volume-name>/<replica-server-id>`
+- not from:
+  - `DataAddr`
+  - `CtrlAddr`
+  - heartbeat transport endpoint
+
+For this slice, the adapter should map:
+
+1. `ReplicaID`
+- from master/block-registry identity for the replica host entry
+
+2. `Endpoint`
+- from the current replica receiver/data/control addresses reported by the real runtime
+
+3. `Epoch`
+- from the confirmed master assignment for the volume
+
+4. `SessionKind`
+- from master-driven recovery intent / role transition outcome
+
+This is a hard first-slice requirement because address refresh must not collapse identity back into endpoint-shaped keys.
+
+## Adapter Mapping
+
+### 1. ControlPlaneAdapter
+
+Engine interface today:
+
+- `HandleHeartbeat(serverID, volumes)`
+- `HandleFailover(deadServerID)`
+
+Real mapping should be:
+
+- master-side source:
+  - `weed/server/master_grpc_server.go`
+  - `weed/server/master_block_failover.go`
+  - `weed/server/master_block_registry.go`
+- volume-server side sink:
+  - assignment receive/apply path in `weed/storage/blockvol/`
+
+Recommended real shape:
+
+- do not literally push raw heartbeat messages into the engine
+- instead introduce a thin adapter that converts confirmed master assignment state into:
+  - stable `ReplicaID`
+  - endpoint set
+  - epoch
+  - recovery target kind
+
+That keeps master as control owner and the engine as execution owner.
+
+Important note:
+
+- the adapter should treat heartbeat as the transport carrier, not as the final protocol shape
+- block-control semantics should be made explicit over that carrier
+- if a later phase concludes that heartbeat-only carriage is too coarse, that should be a separate design decision after the real hardening path is measured
+
+### 2. StorageAdapter
+
+Engine interface today:
+
+- `GetRetainedHistory()`
+- `PinSnapshot(lsn)` / `ReleaseSnapshot(pin)`
+- `PinWALRetention(startLSN)` / `ReleaseWALRetention(pin)`
+- `PinFullBase(committedLSN)` / `ReleaseFullBase(pin)`
+
+Real mapping should be:
+
+- retained history source:
+  - current WAL head/tail/checkpoint state from `weed/storage/blockvol/blockvol.go`
+  - recovery helpers in `weed/storage/blockvol/recovery.go`
+- WAL retention pin:
+  - existing retention-floor / replica-aware WAL retention machinery around `shipperGroup`
+- snapshot pin:
+  - existing snapshot/checkpoint artifacts in `blockvol`
+- full-base pin:
+  - explicit pinned full-extent export or equivalent consistent base handle from `blockvol`
+
+Important constraint:
+
+- `Phase 07` must not fake this by reconstructing `RetainedHistory` from tests or metadata alone
+
+### 3. Execution Driver / Executor hookup
+
+Engine side already has:
+
+- planner/executor split in `sw-block/engine/replication/driver.go`
+- stepwise executors in `sw-block/engine/replication/executor.go`
+
+Real mapping should be:
+
+- engine planner decides:
+  - zero-gap / catch-up / rebuild
+  - trusted-base requirement
+  - replayable-tail requirement
+- blockvol runtime performs:
+  - actual WAL catch-up transport
+  - actual snapshot/base transfer
+  - actual truncation / apply operations
+
+Recommended split:
+
+- engine owns contract and state transitions
+- blockvol adapter owns concrete I/O work
+
+## First-Slice Acceptance Rule
+
+For the first integration slice, this is a hard rule:
+
+- `blockvol` may execute recovery I/O
+- `blockvol` must not own recovery policy
+
+Concretely, `blockvol` must not decide:
+
+1. zero-gap vs catch-up vs rebuild
+2. trusted-base validity
+3. replayable-tail sufficiency
+4. whether rebuild fallback is required
+
+Those decisions must remain in the V2 engine.
+
+The bridge may translate engine decisions into concrete blockvol actions, but it must not re-decide recovery policy underneath the engine.
+
+## First Product Path
+
+The first product path should be:
+
+- `RF=2` block volume replication on the existing heartbeat/assignment loop
+- primary + one replica
+- failover / reconnect / changed-address handling
+- rebuild as the formal non-catch-up recovery path
+
+This is the right first path because it exercises the core correctness boundary without introducing N-replica coordination complexity too early.
+
+## What Must Be Replaced First
+
+Current engine-stage pieces that are still mock/test-only or too abstract:
+
+### Replace first
+
+1. `mockStorage` in engine tests
+- replace with a real `blockvol`-backed `StorageAdapter`
+
+2. synthetic control events in engine tests
+- replace with assignment-driven events from the real master/volume-server path
+
+3. convenience recovery completion wrappers
+- keep them test-only
+- real integration should use planner + executor + storage work loop
+
+### Can remain temporarily abstract in Phase 07 P0/P1
+
+1. `ControlPlaneAdapter` exact public shape
+- can remain thin while the integration path is being chosen
+
+2. async production scheduler details
+- executor can still be driven by a service loop before full background-task architecture is finalized
+
+## Recommended Concrete Modules
+
+### Engine stays here
+
+- `sw-block/engine/replication/`
+
+### First real adapter package should be added near blockvol
+
+Recommended initial location:
+
+- `weed/storage/blockvol/v2bridge/`
+
+Reason:
+
+- keeps V2 engine independent under `sw-block/`
+- keeps real-system glue close to blockvol storage truth
+- avoids copying engine logic into `weed/`
+
+Suggested contents:
+
+1. `control_adapter.go`
+- convert master assignment / local apply path into engine intents
+
+2. `storage_adapter.go`
+- expose retained history, pin/release, trusted-base export handles from real blockvol state
+
+3. `executor_bridge.go`
+- translate engine executor steps into actual blockvol recovery actions
+
+4. `observe_adapter.go`
+- map engine status/logs into service-visible diagnostics
+
+## First Failure Replay Set For Phase 07
+
+The first real-system replay set should be:
+
+1. changed-address restart
+- current risk: old identity/address coupling reappears in service glue
+
+2. stale epoch / stale result after failover
+- current risk: master and engine disagree on authority timing
+
+3. unreplayable-tail rebuild fallback
+- current risk: service glue over-trusts checkpoint/base availability
+
+4. plan/execution cleanup after resource failure
+- current risk: blockvol-side resource failures leave engine or service state dangling
+
+5. primary failover to replica with rebuild pending on old primary reconnect
+- current risk: old V1/V1.5 semantics leak back into reconnect handling
+
+## Non-Goals For This Slice
+
+Do not use `Phase 07` to:
+
+1. widen catch-up semantics
+2. add smart rebuild optimizations
+3. redesign all blockvol internals
+4. replace the full V1 runtime in one move
+5. claim production readiness
+
+## Deliverables For Phase 07 P0
+
+A good `P0` delivery should include:
+
+1. chosen service slice
+2. chosen integration path in the current repo
+3. adapter-to-module mapping
+4. list of test-only adapters to replace first
+5. first failure replay set
+6. explicit note of what remains outside this first slice
+
+## Short Form
+
+`Phase 07 P0` should start with:
+
+- engine in `sw-block/engine/replication/`
+- bridge in `weed/storage/blockvol/v2bridge/`
+- first real slice = blockvol primary + one replica on the existing master heartbeat / assignment path
+- `ReplicaID = <volume-name>/<replica-server-id>` for the first slice
+- `blockvol` executes I/O but does not own recovery policy
+- first product path = `RF=2` failover/reconnect/rebuild correctness
--- a/sw-block/docs/archive/design/phase-08-engine-skeleton-map.md
+++ b/sw-block/docs/archive/design/phase-08-engine-skeleton-map.md
@ -0,0 +1,301 @@
+# Phase 08 Engine Skeleton Map
+
+Date: 2026-03-31
+Status: historical phase map
+Purpose: provide a short structural map for the `Phase 08` hardening path so implementation can move faster without reopening accepted V2 boundaries
+
+## Scope
+
+This is not the final standalone `sw-block` architecture.
+
+It is the shortest useful engine skeleton for the accepted `Phase 08` hardening path:
+
+- `RF=2`
+- `sync_all`
+- existing `Seaweed` master / volume-server heartbeat path
+- V2 engine owns recovery policy
+- `blockvol` remains the execution backend
+
+## Module Map
+
+### 1. Control plane
+
+Role:
+
+- authoritative control truth
+
+Primary sources:
+
+- `weed/server/master_grpc_server.go`
+- `weed/server/master_block_registry.go`
+- `weed/server/master_block_failover.go`
+- `weed/server/volume_grpc_client_to_master.go`
+
+What it produces:
+
+- confirmed assignment
+- `Epoch`
+- target `Role`
+- failover / promotion / reassignment result
+- stable server identity
+
+### 2. Control bridge
+
+Role:
+
+- translate real control truth into V2 engine intent
+
+Primary files:
+
+- `weed/storage/blockvol/v2bridge/control.go`
+- `sw-block/bridge/blockvol/control_adapter.go`
+- entry path in `weed/server/volume_server_block.go`
+
+What it produces:
+
+- `AssignmentIntent`
+- stable `ReplicaID`
+- `Endpoint`
+- `SessionKind`
+
+### 3. Engine runtime
+
+Role:
+
+- recovery-policy core
+
+Primary files:
+
+- `sw-block/engine/replication/orchestrator.go`
+- `sw-block/engine/replication/driver.go`
+- `sw-block/engine/replication/executor.go`
+- `sw-block/engine/replication/sender.go`
+- `sw-block/engine/replication/history.go`
+
+What it decides:
+
+- zero-gap / catch-up / needs-rebuild
+- sender/session ownership
+- stale authority rejection
+- resource acquisition / release
+- rebuild source selection
+
+### 4. Storage bridge
+
+Role:
+
+- translate real blockvol storage truth and execution capability into engine-facing adapters
+
+Primary files:
+
+- `weed/storage/blockvol/v2bridge/reader.go`
+- `weed/storage/blockvol/v2bridge/pinner.go`
+- `weed/storage/blockvol/v2bridge/executor.go`
+- `sw-block/bridge/blockvol/storage_adapter.go`
+
+What it provides:
+
+- `RetainedHistory`
+- WAL retention pin / release
+- snapshot pin / release
+- full-base pin / release
+- WAL scan execution
+
+### 5. Block runtime
+
+Role:
+
+- execute real I/O
+
+Primary files:
+
+- `weed/storage/blockvol/blockvol.go`
+- `weed/storage/blockvol/replica_apply.go`
+- `weed/storage/blockvol/replica_barrier.go`
+- `weed/storage/blockvol/recovery.go`
+- `weed/storage/blockvol/rebuild.go`
+- `weed/storage/blockvol/wal_shipper.go`
+
+What it owns:
+
+- WAL
+- extent
+- flusher
+- checkpoint / superblock
+- receiver / shipper
+- rebuild server
+
+## Execution Order
+
+### Control path
+
+```text
+master heartbeat / failover truth
+  -> BlockVolumeAssignment
+  -> volume server ProcessAssignments
+  -> v2bridge control conversion
+  -> engine ProcessAssignment
+  -> sender/session state updated
+```
+
+### Catch-up path
+
+```text
+assignment accepted
+  -> engine reads retained history
+  -> engine plans catch-up
+  -> storage bridge pins WAL retention
+  -> engine executor drives v2bridge executor
+  -> blockvol scans WAL / ships entries
+  -> engine completes session
+```
+
+### Rebuild path
+
+```text
+assignment accepted
+  -> engine detects NeedsRebuild
+  -> engine selects rebuild source
+  -> storage bridge pins snapshot/full-base/tail
+  -> executor drives transfer path
+  -> blockvol performs restore / replay work
+  -> engine completes rebuild
+```
+
+### Local durability path
+
+```text
+WriteLBA / Trim
+  -> WAL append
+  -> shipping / barrier
+  -> client-visible durability decision
+  -> flusher writes extent
+  -> checkpoint advances
+  -> retention floor decides WAL reclaimability
+```
+
+## Interim Fields
+
+These are currently acceptable only as explicit hardening carry-forwards:
+
+### `localServerID`
+
+Current source:
+
+- `BlockService.listenAddr`
+
+Meaning:
+
+- temporary local identity source for replica/rebuild-side assignment translation
+
+Status:
+
+- interim only
+- should become registry-assigned stable server identity later
+
+### `CommittedLSN = CheckpointLSN`
+
+Current source:
+
+- `v2bridge.Reader` / `BlockVol.StatusSnapshot()`
+
+Meaning:
+
+- current V1-style interim mapping where committed truth collapses to local checkpoint truth
+
+Status:
+
+- not final V2 truth
+- must become a gate decision before a production-candidate phase
+
+### heartbeat as control carrier
+
+Current source:
+
+- existing master <-> volume-server heartbeat path
+
+Meaning:
+
+- current transport for assignment/control delivery
+
+Status:
+
+- acceptable as current carrier
+- not yet a final proof that no separate control channel will ever be needed
+
+## Hard Gates
+
+These should remain explicit in `Phase 08`:
+
+### Gate 1: committed truth
+
+Before production-candidate:
+
+- either separate `CommittedLSN` from `CheckpointLSN`
+- or explicitly bound the first candidate path to currently proven pre-checkpoint replay behavior
+
+### Gate 2: live control delivery
+
+Required:
+
+- real assignment delivery must reach the engine on the live path
+- not only converter-level proof
+
+### Gate 3: integrated catch-up closure
+
+Required:
+
+- engine -> executor -> `v2bridge` -> blockvol must be proven as one live chain
+- not planner proof plus direct WAL-scan proof as separate evidence
+
+### Gate 4: first rebuild execution path
+
+Required:
+
+- rebuild must not remain only a detection outcome
+- the chosen product path needs one real executable rebuild closure
+
+### Gate 5: unified replay
+
+Required:
+
+- after control and execution closure land, rerun the accepted failure-class set on the unified live path
+
+## Reuse Map
+
+### Reuse directly
+
+- `weed/server/master_grpc_server.go`
+- `weed/server/volume_grpc_client_to_master.go`
+- `weed/server/volume_server_block.go`
+- `weed/server/master_block_registry.go`
+- `weed/server/master_block_failover.go`
+- `weed/storage/blockvol/blockvol.go`
+- `weed/storage/blockvol/replica_apply.go`
+- `weed/storage/blockvol/replica_barrier.go`
+- `weed/storage/blockvol/v2bridge/`
+
+### Reuse as implementation reality, not truth
+
+- `shipperGroup`
+- `RetentionFloorFn`
+- `ReplicaReceiver`
+- checkpoint/superblock machinery
+- existing failover heuristics
+
+### Do not inherit as V2 semantics
+
+- address-shaped identity
+- old degraded/catch-up intuition from V1/V1.5
+- `CommittedLSN = CheckpointLSN` as final truth
+- blockvol-side recovery policy decisions
+
+## Short Rule
+
+Use this skeleton as:
+
+- a hardening map for the current product path
+
+Do not mistake it for:
+
+- the final standalone `sw-block` architecture
--- a/sw-block/docs/archive/design/v2-engine-readiness-review.md
+++ b/sw-block/docs/archive/design/v2-engine-readiness-review.md
@ -0,0 +1,170 @@
+# V2 Engine Readiness Review
+
+Date: 2026-03-29
+Status: historical readiness review
+Purpose: record the decision on whether the current V2 design + prototype + simulator stack is strong enough to begin real V2 engine slicing
+
+## Decision
+
+Current judgment:
+
+- proceed to real V2 engine planning
+- do not open a `V2.5` redesign track at this time
+
+This is a planning-readiness decision, not a production-readiness claim.
+
+## Why This Review Exists
+
+The project has now completed:
+
+1. design/FSM closure for the V2 line
+2. protocol simulation closure for:
+   - V1 / V1.5 / V2 comparison
+   - timeout/race behavior
+   - ownership/session semantics
+3. standalone prototype closure for:
+   - sender/session ownership
+   - execution authority
+   - recovery branching
+   - minimal historical-data proof
+   - prototype scenario closure
+4. `Phase 4.5` hardening for:
+   - bounded `CatchUp`
+   - first-class `Rebuild`
+   - crash-consistency / restart-recoverability
+   - `A5-A8` stronger evidence
+
+So the question is no longer:
+
+- "can the prototype be made richer?"
+
+The question is:
+
+- "is the evidence now strong enough to begin real engine slicing?"
+
+## Evidence Summary
+
+### 1. Design / Protocol
+
+Primary docs:
+
+- `sw-block/design/v2-acceptance-criteria.md`
+- `sw-block/design/v2-open-questions.md`
+- `sw-block/design/v2_scenarios.md`
+- `sw-block/design/v1-v15-v2-comparison.md`
+- `sw-block/docs/archive/design/v2-prototype-roadmap-and-gates.md`
+
+Judgment:
+
+- protocol story is coherent
+- acceptance set exists
+- major V1 / V1.5 failures are mapped into V2 scenarios
+
+### 2. Simulator
+
+Primary code/tests:
+
+- `sw-block/prototype/distsim/`
+- `sw-block/prototype/distsim/eventsim.go`
+- `learn/projects/sw-block/test/results/v2-simulation-review.md`
+
+Judgment:
+
+- strong enough for protocol/design validation
+- strong enough to challenge crash-consistency and liveness assumptions
+- not a substitute for real engine / hardware proof
+
+### 3. Prototype
+
+Primary code/tests:
+
+- `sw-block/prototype/enginev2/`
+- `sw-block/prototype/enginev2/acceptance_test.go`
+
+Judgment:
+
+- ownership is explicit and fenced
+- execution authority is explicit and fenced
+- bounded `CatchUp` is semantic, not documentary
+- `Rebuild` is a first-class sender-owned path
+- historical-data and recoverability reasoning are executable
+
+### 4. `A5-A8` Double Evidence
+
+Prototype-side grouped evidence:
+
+- `sw-block/prototype/enginev2/acceptance_test.go`
+
+Simulator-side grouped evidence:
+
+- `sw-block/docs/archive/design/a5-a8-traceability.md`
+- `sw-block/prototype/distsim/`
+
+Judgment:
+
+- the critical acceptance items that most affect engine risk now have materially stronger proof on both sides
+
+## What Is Good Enough Now
+
+The following are good enough to begin engine slicing:
+
+1. sender/session ownership model
+2. stale authority fencing
+3. recovery orchestration shape
+4. bounded `CatchUp` contract
+5. `Rebuild` as formal path
+6. committed/recoverable boundary thinking
+7. crash-consistency / restart-recoverability proof style
+
+## What Is Still Not Proven
+
+The following still require real engine work and later real-system validation:
+
+1. actual engine lifecycle integration
+2. real storage/backend implementation
+3. real control-plane integration
+4. real durability / fsync behavior under the actual engine
+5. real hardware timing / performance
+6. final production observability and failure handling
+
+These are expected gaps. They do not block engine planning.
+
+## Open Risks To Carry Forward
+
+These are not blockers, but they should remain explicit:
+
+1. prototype and simulator are still reduced models
+2. rebuild-source quality in the real engine will depend on actual checkpoint/base-image mechanics
+3. durability truth in the real engine must still be re-proven against actual persistence behavior
+4. predicate exploration can still grow, but should not block engine slicing
+
+## Engine-Planning Decision
+
+Decision:
+
+- start real V2 engine planning
+
+Reason:
+
+1. no current evidence points to a structural flaw requiring `V2.5`
+2. the remaining gaps are implementation/system gaps, not prototype ambiguity
+3. continuing to extend prototype/simulator breadth would have diminishing returns
+
+## Required Outputs After This Review
+
+1. `sw-block/docs/archive/design/v2-engine-slicing-plan.md`
+2. first real engine slice definition
+3. explicit non-goals for first engine stage
+4. explicit validation plan for engine slices
+
+## Non-Goals Of This Review
+
+This review does not claim:
+
+1. V2 is production-ready
+2. V2 should replace V1 immediately
+3. all design questions are forever closed
+
+It only claims:
+
+- the project now has enough evidence to begin disciplined real engine slicing
--- a/sw-block/docs/archive/design/v2-engine-slicing-plan.md
+++ b/sw-block/docs/archive/design/v2-engine-slicing-plan.md
@ -0,0 +1,191 @@
+# V2 Engine Slicing Plan
+
+Date: 2026-03-29
+Status: historical slicing plan
+Purpose: define the first real V2 engine slices after prototype and `Phase 4.5` closure
+
+## Goal
+
+Move from:
+
+- standalone design/prototype truth under `sw-block/prototype/`
+
+to:
+
+- a real V2 engine core under `sw-block/`
+
+without dragging V1.5 lifecycle assumptions into the implementation.
+
+## Planning Rules
+
+1. reuse V1 ideas and tests selectively, not structurally
+2. prefer narrow vertical slices over broad skeletons
+3. each slice must preserve the accepted V2 ownership/fencing model
+4. keep simulator/prototype as validation support, not as the implementation itself
+5. do not mix V2 engine work into `weed/storage/blockvol/`
+
+## First Engine Stage
+
+The first engine stage should build the control/recovery core, not the full storage engine.
+
+That means:
+
+1. per-replica sender identity
+2. one active recovery session per replica per epoch
+3. sender-owned execution authority
+4. explicit recovery outcomes:
+   - zero gap
+   - bounded catch-up
+   - rebuild
+5. rebuild execution shell only
+   - do not hard-code final snapshot + tail vs full base decision logic yet
+   - keep real rebuild-source choice tied to Slice 3 recoverability inputs
+
+## Recommended Slice Order
+
+### Slice 1: Engine Ownership Core
+
+Purpose:
+
+- carry the accepted `enginev2` ownership/fencing model into the real engine core
+
+Scope:
+
+1. stable per-replica sender object
+2. stable recovery-session object
+3. session identity fencing
+4. endpoint / epoch invalidation
+5. sender-group or equivalent ownership registry
+
+Acceptance:
+
+1. stale session results cannot mutate current authority
+2. changed-address and epoch-bump invalidation work in engine code
+3. the 4 V2-boundary ownership themes remain provable
+
+### Slice 2: Engine Recovery Execution Core
+
+Purpose:
+
+- move the prototype execution APIs into real engine behavior
+
+Scope:
+
+1. connect / handshake / catch-up flow
+2. bounded `CatchUp`
+3. explicit `NeedsRebuild`
+4. sender-owned rebuild execution path
+5. rebuild execution shell without final trusted-base selection policy
+
+Acceptance:
+
+1. bounded catch-up does not chase indefinitely
+2. rebuild is exclusive from catch-up
+3. session completion rules are explicit and fenced
+
+### Slice 3: Engine Data / Recoverability Core
+
+Purpose:
+
+- connect recovery behavior to real retained-history / checkpoint mechanics
+
+Scope:
+
+1. real recoverability decision inputs
+2. trusted-base decision for rebuild source
+3. minimal real checkpoint/base-image integration
+4. real truncation / safe-boundary handling
+
+This is the first slice that should decide, from real engine inputs, between:
+
+1. `snapshot + tail`
+2. `full base`
+
+Acceptance:
+
+1. engine can explain why recovery is allowed
+2. rebuild-source choice is explicit and testable
+3. historical correctness and truncation rules remain intact
+
+### Slice 4: Engine Integration Closure
+
+Purpose:
+
+- bind engine control/recovery core to real orchestration and validation surfaces
+
+Scope:
+
+1. real assignment/control intent entry path
+2. engine-facing observability
+3. focused real-engine tests for V2-boundary cases
+4. first integration review against real failure classes
+
+Acceptance:
+
+1. key V2-boundary failures are reproduced and closed in engine tests
+2. engine observability is good enough to debug ownership/recovery failures
+3. remaining gaps are system/performance gaps, not control-model ambiguity
+
+## What To Reuse
+
+Good reuse candidates:
+
+1. tests and failure cases from V1 / V1.5
+2. narrow utility/data helpers where not coupled to V1 lifecycle
+3. selected WAL/history concepts if they fit V2 ownership boundaries
+
+Do not structurally reuse:
+
+1. V1/V1.5 shipper lifecycle
+2. address-based identity assumptions
+3. `SetReplicaAddrs`-style behavior
+4. old recovery control structure
+
+## Where The Work Should Live
+
+Real V2 engine work should continue under:
+
+- `sw-block/`
+
+Recommended next area:
+
+- `sw-block/core/`
+or
+- `sw-block/engine/`
+
+Exact path can be chosen later, but it should remain separate from:
+
+- `sw-block/prototype/`
+- `weed/storage/blockvol/`
+
+## Validation Plan For Engine Slices
+
+Each engine slice should be validated at three levels:
+
+1. prototype alignment
+- does engine behavior preserve the accepted prototype invariant?
+
+2. focused engine tests
+- does the real engine slice enforce the same contract?
+
+3. scenario mapping
+- does at least one important V1/V1.5 failure class remain closed?
+
+## Non-Goals For First Engine Stage
+
+Do not try to do these immediately:
+
+1. full Smart WAL expansion
+2. performance optimization
+3. V1 replacement/migration plan
+4. full product integration
+5. all storage/backend redesign at once
+
+## Immediate Next Assignment
+
+The first concrete engine-planning task should be:
+
+1. choose the real V2 engine module location under `sw-block/`
+2. define Slice 1 file/module boundaries
+3. write a short engine ownership-core spec
+4. map 3-5 acceptance scenarios directly onto Slice 1 expectations
--- a/sw-block/docs/archive/design/v2-first-slice-sender-ownership.md
+++ b/sw-block/docs/archive/design/v2-first-slice-sender-ownership.md
@ -0,0 +1,159 @@
+# V2 First Slice: Per-Replica Sender/Session Ownership
+
+Date: 2026-03-27
+Status: historical first-slice note
+Depends-on: Q1 (recovery session), Q6 (orchestrator scope), Q7 (first slice)
+
+## Problem
+
+`SetReplicaAddrs()` replaces the entire `ShipperGroup` atomically. This causes:
+
+1. **State loss on topology change.** All shippers are destroyed and recreated.
+   Recovery state (`replicaFlushedLSN`, `lastContactTime`, catch-up progress) is lost.
+   After a changed-address restart, the new shipper starts from scratch.
+
+2. **No per-replica identity.** Shippers are identified by array index. The master
+   cannot target a specific replica for rebuild/catch-up — it must re-issue the
+   entire address set.
+
+3. **Background reconnect races.** A reconnect cycle may be in progress when
+   `SetReplicaAddrs` replaces the group. The in-progress reconnect's connection
+   objects become orphaned.
+
+## Design
+
+### Per-replica sender identity
+
+`ShipperGroup` changes from `[]*WALShipper` to `map[string]*WALShipper`, keyed by
+the replica's canonical data address. Each shipper stores its own `ReplicaID`.
+
+```go
+type WALShipper struct {
+    ReplicaID string // canonical data address — identity across reconnects
+    // ... existing fields
+}
+
+type ShipperGroup struct {
+    mu       sync.RWMutex
+    shippers map[string]*WALShipper // keyed by ReplicaID
+}
+```
+
+### ReconcileReplicas replaces SetReplicaAddrs
+
+Instead of replacing the entire group, `ReconcileReplicas` diffs old vs new:
+
+```
+ReconcileReplicas(newAddrs []ReplicaAddr):
+    for each existing shipper:
+        if NOT in newAddrs → Stop and remove
+    for each newAddr:
+        if matching shipper exists → keep (preserve state)
+        if no match → create new shipper
+```
+
+This preserves `replicaFlushedLSN`, `lastContactTime`, catch-up progress, and
+background reconnect goroutines for replicas that stay in the set.
+
+`SetReplicaAddrs` becomes a wrapper:
+```go
+func (v *BlockVol) SetReplicaAddrs(addrs []ReplicaAddr) {
+    if v.shipperGroup == nil {
+        v.shipperGroup = NewShipperGroup(nil)
+    }
+    v.shipperGroup.ReconcileReplicas(addrs, v.makeShipperFactory())
+}
+```
+
+### Changed-address restart flow
+
+1. Replica restarts on new port. Heartbeat reports new address.
+2. Master detects endpoint change (address differs, same volume).
+3. Master sends assignment update to primary with new replica address.
+4. Primary's `ReconcileReplicas` receives `[oldAddr1, newAddr2]`.
+5. Old shipper for the changed replica is stopped (old address gone from set).
+6. New shipper created with new address — but this is a fresh shipper.
+7. New shipper bootstraps: Disconnected → Connecting → CatchingUp → InSync.
+
+The improvement over V1.5: the **other** replicas in the set are NOT disturbed.
+Only the changed replica gets a fresh shipper. Recovery state for stable replicas
+is preserved.
+
+### Recovery session
+
+Each WALShipper already contains the recovery state machine:
+- `state` (Disconnected → Connecting → CatchingUp → InSync → Degraded → NeedsRebuild)
+- `replicaFlushedLSN` (authoritative progress)
+- `lastContactTime` (retention budget)
+- `catchupFailures` (escalation counter)
+- Background reconnect goroutine
+
+No separate `RecoverySession` object is needed. The WALShipper IS the per-replica
+recovery session. The state machine already tracks the session lifecycle.
+
+What changes: the session is no longer destroyed on topology change (unless the
+replica itself is removed from the set).
+
+### Coordinator vs primary responsibilities
+
+| Responsibility | Owner |
+|---------------|-------|
+| Endpoint truth (canonical address) | Coordinator (master) |
+| Assignment updates (add/remove replicas) | Coordinator |
+| Epoch authority | Coordinator |
+| Session creation trigger | Coordinator (via assignment) |
+| Session execution (reconnect, catch-up, barrier) | Primary (via WALShipper) |
+| Timeout enforcement | Primary |
+| Ordered receive/apply | Replica |
+| Barrier ack | Replica |
+| Heartbeat reporting | Replica |
+
+### Migration from current code
+
+| Current | V2 |
+|---------|-----|
+| `ShipperGroup.shippers []*WALShipper` | `ShipperGroup.shippers map[string]*WALShipper` |
+| `SetReplicaAddrs()` creates all new | `ReconcileReplicas()` diffs and preserves |
+| `StopAll()` in demote | `StopAll()` unchanged (stops all) |
+| `ShipAll(entry)` iterates slice | `ShipAll(entry)` iterates map values |
+| `BarrierAll(lsn)` parallel slice | `BarrierAll(lsn)` parallel map values |
+| `MinReplicaFlushedLSN()` iterates slice | Same, iterates map values |
+| `ShipperStates()` iterates slice | Same, iterates map values |
+| No per-shipper identity | `WALShipper.ReplicaID` = canonical data addr |
+
+### Files changed
+
+| File | Change |
+|------|--------|
+| `wal_shipper.go` | Add `ReplicaID` field, pass in constructor |
+| `shipper_group.go` | `map[string]*WALShipper`, `ReconcileReplicas`, update iterators |
+| `blockvol.go` | `SetReplicaAddrs` calls `ReconcileReplicas`, shipper factory |
+| `promotion.go` | No change (StopAll unchanged) |
+| `dist_group_commit.go` | No change (uses ShipperGroup API) |
+| `block_heartbeat.go` | No change (uses ShipperStates) |
+
+### Acceptance bar
+
+The following existing tests must continue to pass:
+- All CP13-1 through CP13-7 protocol tests (sync_all_protocol_test.go)
+- All adversarial tests (sync_all_adversarial_test.go)
+- All baseline tests (sync_all_bug_test.go)
+- All rebuild tests (rebuild_v1_test.go)
+
+The following CP13-8 tests validate the V2 improvement:
+- `TestCP13_SyncAll_ReplicaRestart_Rejoin` — changed-address recovery
+- `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` — V2 reconnect protocol
+- `TestAdversarial_CatchupMultipleDisconnects` — state preservation across reconnects
+
+New tests to add:
+- `TestReconcileReplicas_PreservesExistingShipper` — stable replica keeps state
+- `TestReconcileReplicas_RemovesStaleShipper` — removed replica stopped
+- `TestReconcileReplicas_AddsNewShipper` — new replica bootstraps
+- `TestReconcileReplicas_MixedUpdate` — one kept, one removed, one added
+
+## Non-goals for this slice
+
+- Smart WAL payload classes
+- Recovery reservation protocol
+- Full coordinator orchestration
+- New transport layer
--- a/sw-block/docs/archive/design/v2-first-slice-session-ownership.md
+++ b/sw-block/docs/archive/design/v2-first-slice-session-ownership.md
@ -0,0 +1,194 @@
+# V2 First Slice: Per-Replica Sender and Recovery Session Ownership
+
+Date: 2026-03-27
+Status: historical first-slice note
+
+## Purpose
+
+This document defines the first real V2 implementation slice.
+
+The slice is intentionally narrow:
+
+- per-replica sender ownership
+- explicit recovery session ownership
+- clear coordinator vs primary responsibility
+
+This is the first step toward a standalone V2 block engine under `sw-block/`.
+
+## Why This Slice First
+
+It directly addresses the clearest V1.5 structural limits:
+
+- sender identity loss when replica sets are refreshed
+- changed-address restart recovery complexity
+- repeated reconnect cycles without stable per-replica ownership
+- adversarial Phase 13 boundary tests that V1.5 cannot cleanly satisfy
+
+It also avoids jumping too early into:
+
+- Smart WAL
+- new backend storage layout
+- full production transport redesign
+
+## Core Decision
+
+Use:
+
+- **one sender owner per replica**
+- **at most one active recovery session per replica per epoch**
+
+Healthy replicas may only need their steady sender object.
+
+Degraded / reconnecting replicas gain an explicit recovery session owned by the primary.
+
+## Ownership Split
+
+### Coordinator
+
+Owns:
+
+- replica identity / endpoint truth
+- assignment updates
+- epoch authority
+- session creation / destruction intent
+
+Does not own:
+
+- byte-by-byte catch-up execution
+- local sender loop scheduling
+
+### Primary
+
+Owns:
+
+- per-replica sender objects
+- per-replica recovery session execution
+- reconnect / catch-up progress
+- timeout enforcement for active session
+- transition from:
+  - normal sender
+  - to recovery session
+  - back to normal sender
+
+### Replica
+
+Owns:
+
+- receive/apply path
+- barrier ack
+- heartbeat/reporting
+
+Replica remains passive from the recovery-orchestration point of view.
+
+## Data Model
+
+## Sender Owner
+
+Per replica, maintain a stable sender owner with:
+
+- replica logical ID
+- current endpoint
+- current epoch view
+- steady-state health/status
+- optional active recovery session reference
+
+## Recovery Session
+
+Per replica, per epoch:
+
+- `ReplicaID`
+- `Epoch`
+- `EndpointVersion` or equivalent endpoint truth
+- `State`
+  - `connecting`
+  - `catching_up`
+  - `in_sync`
+  - `needs_rebuild`
+- `StartLSN`
+- `TargetLSN`
+- timeout / deadline metadata
+
+## Session Rules
+
+1. only one active session per replica per epoch
+2. new assignment for same replica:
+- supersedes old session only if epoch/session generation is newer
+3. stale session must not continue after:
+- epoch bump
+- endpoint truth change
+- explicit coordinator replacement
+
+## Minimal State Transitions
+
+### Healthy path
+
+1. replica sender exists
+2. sender ships normally
+3. replica remains `InSync`
+
+### Recovery path
+
+1. sender detects or is told replica is not healthy
+2. coordinator provides valid assignment/endpoint truth
+3. primary creates recovery session
+4. session connects
+5. session catches up if recoverable
+6. on success:
+- session closes
+- steady sender resumes normal state
+
+### Rebuild path
+
+1. session determines catch-up is not sufficient
+2. session transitions to `needs_rebuild`
+3. higher layer rebuild flow takes over
+
+## What This Slice Does Not Include
+
+Not in the first slice:
+
+- Smart WAL payload classes in production
+- snapshot pinning / GC logic
+- new on-disk engine
+- frontend publication changes
+- full production event scheduler
+
+## Proposed V2 Workspace Target
+
+Do this under `sw-block/`, not `weed/storage/blockvol/`.
+
+Suggested area:
+
+- `sw-block/prototype/enginev2/`
+
+Suggested first files:
+
+- `sw-block/prototype/enginev2/session.go`
+- `sw-block/prototype/enginev2/sender.go`
+- `sw-block/prototype/enginev2/group.go`
+- `sw-block/prototype/enginev2/session_test.go`
+
+The first code does not need full storage I/O.
+It should prove ownership and transition shape first.
+
+## Acceptance For This Slice
+
+The slice is good enough when:
+
+1. sender identity is stable per replica
+2. changed-address reassignment updates the right sender owner
+3. multiple reconnect cycles do not lose recovery ownership
+4. stale session does not survive epoch bump
+5. the 4 Phase 13 V2-boundary tests have a clear path to become satisfiable
+
+## Relationship To Existing Simulator
+
+This slice should align with:
+
+- `v2-acceptance-criteria.md`
+- `v2-open-questions.md`
+- `v1-v15-v2-comparison.md`
+- `distsim` / `eventsim` behavior
+
+The simulator remains the design oracle.
+The first implementation slice should not contradict it.
--- a/sw-block/docs/archive/design/v2-production-roadmap.md
+++ b/sw-block/docs/archive/design/v2-production-roadmap.md
@ -0,0 +1,199 @@
+# V2 Production Roadmap
+
+Date: 2026-03-30
+Status: historical roadmap
+Purpose: define the path from the accepted V2 engine core to a production candidate
+
+## Current Position
+
+Completed:
+
+1. design / FSM closure
+2. simulator / protocol validation
+3. prototype closure
+4. evidence hardening
+5. engine core slices:
+   - Slice 1 ownership core
+   - Slice 2 recovery execution core
+   - Slice 3 data / recoverability core
+   - Slice 4 integration closure
+
+Current stage:
+
+- entering broader engine implementation
+
+This means the main risk is no longer:
+
+- whether the V2 idea stands up
+
+The main risk is:
+
+- whether the accepted engine core can be turned into a real system without reintroducing V1/V1.5 structure and semantics
+
+## Roadmap Summary
+
+1. Phase 06: broader engine implementation stage
+2. Phase 07: real-system integration / product-path decision
+3. Phase 08: pre-production hardening
+4. Phase 09: performance / scale / soak validation
+5. Phase 10: production candidate and rollout gate
+
+## Phase 06
+
+### Goal
+
+Connect the accepted engine core to:
+
+1. real control truth
+2. real storage truth
+3. explicit engine execution steps
+
+### Outputs
+
+1. control-plane adapter into the engine core
+2. storage/base/recoverability adapters
+3. explicit execution-driver model where synchronous helpers are no longer sufficient
+4. validation against selected real failure classes
+
+### Gate
+
+At the end of Phase 06, the project should be able to say:
+
+- the engine core can live inside a real system shape
+
+## Phase 07
+
+### Goal
+
+Move from engine-local correctness to a real runnable subsystem.
+
+### Outputs
+
+1. service-style runnable engine slice
+2. integration with real control and storage surfaces
+3. crash/failover/restart integration tests
+4. decision on the first viable product path
+
+### Gate
+
+At the end of Phase 07, the project should be able to say:
+
+- the engine can run as a real subsystem, not only as an isolated core
+
+## Phase 08
+
+### Goal
+
+Turn correctness into operational safety.
+
+### Outputs
+
+1. observability hardening
+2. operator/debug flows
+3. recovery/runbook procedures
+4. config surface cleanup
+5. realistic durability/restart validation
+
+### Gate
+
+At the end of Phase 08, the project should be able to say:
+
+- operators can run, debug, and recover the system safely
+
+## Phase 09
+
+### Goal
+
+Prove viability under load and over time.
+
+### Outputs
+
+1. throughput / latency baselines
+2. rebuild / catch-up cost characterization
+3. steady-state overhead measurement
+4. soak testing
+5. scale and failure-under-load validation
+
+### Gate
+
+At the end of Phase 09, the project should be able to say:
+
+- the design is not only correct, but viable at useful scale and duration
+
+## Phase 10
+
+### Goal
+
+Produce a controlled production candidate.
+
+### Outputs
+
+1. feature-gated production candidate
+2. rollback strategy
+3. migration/coexistence plan with V1
+4. staged rollout plan
+5. production acceptance checklist
+
+### Gate
+
+At the end of Phase 10, the project should be able to say:
+
+- the system is ready for a controlled production rollout
+
+## Cross-Phase Rules
+
+### Rule 1: Do not reopen protocol shape casually
+
+The accepted core should remain stable unless new implementation evidence forces a change.
+
+### Rule 2: Use V1 as validation source, not design template
+
+Use:
+
+1. `learn/projects/sw-block/`
+2. `weed/storage/block*`
+
+for:
+
+1. failure gates
+2. constraints
+3. integration references
+
+Do not use them as the default V2 architecture template.
+
+### Rule 3: Keep `CatchUp` narrow
+
+Do not let later implementation phases re-expand `CatchUp` into a broad, optimistic, long-lived recovery mode.
+
+### Rule 4: Keep evidence quality ahead of object growth
+
+New work should preferentially improve:
+
+1. traceability
+2. diagnosability
+3. real-failure validation
+4. operational confidence
+
+not simply add new objects, states, or mechanisms.
+
+## Production Readiness Ladder
+
+The project should move through this ladder explicitly:
+
+1. proof-of-design
+2. proof-of-engine-shape
+3. proof-of-runnable-engine-stage
+4. proof-of-operable-system
+5. proof-of-viable-production-candidate
+
+Current ladder position:
+
+- between `2` and `3`
+- engine core accepted; broader runnable engine stage underway
+
+## Next Documents To Maintain
+
+1. `sw-block/.private/phase/phase-06.md`
+2. `sw-block/docs/archive/design/v2-engine-readiness-review.md`
+3. `sw-block/docs/archive/design/v2-engine-slicing-plan.md`
+4. this roadmap
--- a/sw-block/docs/archive/design/v2-prototype-roadmap-and-gates.md
+++ b/sw-block/docs/archive/design/v2-prototype-roadmap-and-gates.md
@ -0,0 +1,239 @@
+# V2 Prototype Roadmap And Gates
+
+Date: 2026-03-27
+Status: historical prototype roadmap
+Purpose: define the remaining prototype roadmap, the validation gates between stages, and the decision point between real V2 engine work and possible V2.5 redesign
+
+## Current Position
+
+V2 design/FSM/simulator work is sufficiently closed for serious prototyping, but not frozen against later `V2.5` adjustments.
+
+Current state:
+
+- design proof: high
+- execution proof: medium
+- data/recovery proof: low
+- prototype end-to-end proof: low
+
+Rough prototype progress:
+
+- `25%` to `35%`
+
+This is early executable prototype, not engine-ready prototype.
+
+## Roadmap Goal
+
+Answer this question with prototype evidence:
+
+- can V2 become a real engine path?
+- or should it become `V2.5` before real implementation begins?
+
+## Step 1: Execution Authority Closure
+
+Purpose:
+
+- finish the sender / recovery-session authority model so stale work is unambiguously rejected
+
+Scope:
+
+1. ownership-only `AttachSession()` / `SupersedeSession()`
+2. execution begins only through execution APIs
+3. stale handshake / progress / completion fenced by `sessionID`
+4. endpoint bump / epoch bump invalidate execution authority
+5. sender-group preserve-or-kill behavior is explicit
+
+Done when:
+
+1. all execution APIs are sender-gated and reject stale `sessionID`
+2. session creation is separated from execution start
+3. phase ordering is enforced
+4. endpoint bump / epoch bump invalidate execution authority correctly
+5. mixed add/remove/update reconciliation preserves or kills state exactly as intended
+
+Main files:
+
+- `sw-block/prototype/enginev2/`
+- `sw-block/prototype/distsim/`
+- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
+
+Key gate:
+
+- old recovery work cannot mutate current sender state at any execution stage
+
+## Step 2: Orchestrated Recovery Prototype
+
+Purpose:
+
+- move from good local sender APIs to an actual prototype recovery flow driven by assignment/update intent
+
+Scope:
+
+1. assignment/update intent creates or supersedes recovery attempts
+2. reconnect / reassignment / catch-up / rebuild decision path
+3. sender-group becomes orchestration entry point
+4. explicit outcome branching:
+   - zero-gap fast completion
+   - positive-gap catch-up
+   - unrecoverable gap -> `NeedsRebuild`
+
+Done when:
+
+1. the prototype expresses a realistic recovery flow from topology/control intent
+2. sender-group drives recovery creation, not only unit helpers
+3. recovery outcomes are explicit and testable
+4. orchestrator responsibility is clear enough to narrow `v2-open-questions.md` item 6
+
+Key gate:
+
+- recovery control is no longer scattered across helper calls; it has one clear orchestration path
+
+## Step 3: Minimal Historical Data Prototype
+
+Purpose:
+
+- prove the recovery model against real data-history assumptions, not only control logic
+
+Scope:
+
+1. minimal WAL/history model, not full engine
+2. enough to exercise:
+   - catch-up range
+   - retained prefix/window
+   - rebuild fallback
+   - historical correctness at target LSN
+3. enough reservation/recoverability state to make recovery explicit
+
+Done when:
+
+1. the prototype can prove why a gap is recoverable or unrecoverable
+2. catch-up and rebuild decisions are backed by minimal data/history state
+3. `v2-open-questions.md` items 3, 4, 5 are closed or sharply narrowed
+4. prototype evidence strengthens acceptance criteria `A5`, `A6`, and `A7`
+
+Key gate:
+
+- the prototype must explain why recovery is allowed, not just that policy says it is
+
+## Step 4: Prototype Scenario Closure
+
+Purpose:
+
+- make the prototype itself demonstrate the V2 story end-to-end
+
+Scope:
+
+1. map key V2 scenarios onto the prototype
+2. express the 4 V2-boundary cases against prototype behavior
+3. add one small end-to-end harness inside `sw-block/prototype/`
+4. align prototype evidence with acceptance criteria
+
+Done when:
+
+1. prototype behavior can be reviewed scenario-by-scenario
+2. key V1/V1.5 failures have prototype equivalents
+3. prototype outcomes match intended V2 design claims
+4. remaining gaps are clearly real-engine gaps, not protocol/prototype ambiguity
+
+Key gate:
+
+- a reviewer can trace:
+  - acceptance criteria -> scenario -> prototype behavior
+  without hand-waving
+
+## Gates
+
+### Gate 1: Design Closed Enough
+
+Status:
+
+- mostly passed
+
+Meaning:
+
+1. acceptance criteria exist
+2. core simulator exists
+3. ownership gap from V1.5 is understood
+
+### Gate 2: Execution Authority Closed
+
+Passes after Step 1.
+
+Meaning:
+
+- stale execution results cannot mutate current authority
+
+### Gate 3: Orchestrated Recovery Closed
+
+Passes after Step 2.
+
+Meaning:
+
+- recovery flow is controlled by one coherent orchestration model
+
+### Gate 4: Historical Data Model Closed
+
+Passes after Step 3.
+
+Meaning:
+
+- catch-up vs rebuild is backed by executable data-history logic
+
+### Gate 5: Prototype Convincing
+
+Passes after Step 4.
+
+Meaning:
+
+- enough evidence exists to choose:
+  - real V2 engine path
+  - or `V2.5` redesign
+
+## Decision Gate After Step 4
+
+### Path A: Real V2 Engine Planning
+
+Choose this if:
+
+1. prototype control logic is coherent
+2. recovery boundary is explicit
+3. boundary cases are convincing
+4. no major structural flaw remains
+
+Outputs:
+
+1. real engine slicing plan
+2. migration/integration plan into future standalone `sw-block`
+3. explicit non-goals for first production version
+
+### Path B: V2.5 Redesign
+
+Choose this if the prototype reveals:
+
+1. ownership/orchestration still too fragile
+2. recovery boundary still too implicit
+3. historical correctness model too costly or too unclear
+4. too much complexity leaks into the hot path
+
+Output:
+
+- write `V2.5` as a design/prototype correction before engine work
+
+## What Not To Do Yet
+
+1. no Smart WAL expansion beyond what Step 3 minimally needs
+2. no backend/storage-engine redesign
+3. no V1 production integration
+4. no frontend/wire protocol work
+5. no performance optimization as a primary goal
+
+## Practical Summary
+
+Current sequence:
+
+1. finish execution authority
+2. build orchestrated recovery
+3. add minimal historical-data proof
+4. close key scenarios against the prototype
+5. decide:
+   - V2 engine
+   - or `V2.5`