Engine executors now have IO interfaces for real bridge I/O:
- CatchUpExecutor.IO (CatchUpIO): StreamWALEntries
- RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot,
StreamWALEntries (for tail replay)
When IO is set, executor calls real bridge I/O during execution.
When IO is nil, executor uses caller-supplied progress (test mode).
RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge.
v2bridge.Executor now implements both interfaces:
- StreamWALEntries: real ScanFrom
- TransferFullBase: validates extent accessible
- TransferSnapshot: validates checkpoint accessible
Chain tests wire IO:
- CatchUpClosure: exec.IO = executor → real WAL scan through engine
- RebuildClosure: exec.IO = executor → real transfer through engine
This closes the engine → executor → v2bridge → blockvol chain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finding 1: Identity no longer address-derived
- ReplicaAddr.ServerID field added (stable server identity from registry)
- BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path)
- ControlBridge uses ServerID, NOT address, for ReplicaID
- Missing ServerID → replica skipped (fail closed), logged
Finding 2: Wired into real ProcessAssignments
- BlockService.v2Bridge field initialized in StartBlockService
- ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment
BEFORE existing V1 processing (parallel, not replacing yet)
- Logged at glog V(1)
Finding 3: Fail-closed on missing identity
- Empty ServerID in ReplicaAddrs → replica skipped with log
- Empty ReplicaServerID in scalar path → no replica created
- Test: MissingServerID_FailsClosed verifies both paths
7 tests: StableServerID, AddressChange_IdentityPreserved,
MultiReplica_StableServerIDs, MissingServerID_FailsClosed,
EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CatchUpExecutor.OnStep: optional callback fired between executor-managed
progress steps. Enables deterministic fault injection (epoch bump)
between steps without racing or manual sender calls.
E2_EpochBump_MidExecutorLoop:
- Executor runs 5 progress steps
- OnStep hook bumps epoch after step 1 (after 2 successful steps)
- Executor's own loop detects invalidation at step 2's check
- Resources released by executor's release path (not manual cancel)
- Log shows session_invalidated + exec_resources_released
This closes the remaining FC2 gap: invalidation is now detected
and cleaned up by the executor itself, not by external code.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Planner/executor contract:
- RebuildExecutor.Execute() takes no arguments — consumes plan-bound
RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN
- RecoveryPlan binds all rebuild targets at plan time
- Executor cannot re-derive policy from caller-supplied history
Catch-up timing:
- Removed unused completeTick parameter from CatchUpExecutor.Execute
- Per-step ticks synthesized as startTick + stepIndex + 1
- API shape matches implementation
New test: PlanExecuteConsistency_RebuildCannotSwitchSource
- Plans snapshot+tail, then mutates storage history
- Executor succeeds using plan-bound values (not re-derived)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full-base rebuild resource:
- StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image
- PlanRebuild full_base branch now acquires FullBasePin
- RecoveryPlan.FullBasePin field, released by ReleasePlan
Session cleanup on resource failure:
- PlanRecovery invalidates session when WAL pin fails
(no dangling live session after failed resource acquisition)
3 new tests:
- PlanRebuild_FullBase_PinsBaseImage: pin acquired + released
- PlanRebuild_FullBase_PinFailure: logged + error
- PlanRecovery_WALPinFailure_CleansUpSession: session invalidated,
sender disconnected (no dangling state)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap
Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)
Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason
All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.
1 new test: EndpointChange_LogsInvalidation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
recovery is or is not allowed
14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed
Engine module at 44 tests (12 + 18 + 14).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
(recoveredTo - startLSN), not the replica's pre-existing prefix
Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer
3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registry is now keyed by stable ReplicaID, not by address.
DataAddr changes preserve sender identity — the core V2 invariant.
Changes:
- ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint
- AssignmentIntent.Replicas uses []ReplicaAssignment
- Registry.Reconcile takes []ReplicaAssignment
- Tests use stable IDs ("replica-1", "r1") independent of addresses
New test: ChangedDataAddr_PreservesSenderIdentity
- Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2)
- Sender pointer preserved, session invalidated, new session attached
- This is the exact V1/V1.5 regression that V2 must fix
doc.go: clarified Slice 1 core vs carried-forward files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All mutable state on Sender and Session is now unexported:
- Sender.state, .epoch, .endpoint, .session, .stopped → accessors
- Session.id, .phase, .kind, etc. → read-only accessors
- Session() replaced by SessionSnapshot() (returns disconnected copy)
- SessionID() and HasActiveSession() for common queries
- AttachSession returns (sessionID, error) not (*Session, error)
- SupersedeSession returns sessionID not *Session
Budget configuration via SessionOption:
- WithBudget(CatchUpBudget) passed to AttachSession
- No direct field mutation on session from external code
New test: Encapsulation_SnapshotIsReadOnly proves snapshot
mutation does not leak back to sender state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frozen target is now unconditional:
- FrozenTargetLSN field on RecoverySession, set by BeginCatchUp
- RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget
- Catch-up is always a bounded (R, H0] contract
Rebuild completion exclusivity:
- CompleteSessionByID explicitly rejects SessionRebuild by kind
- Rebuild sessions can ONLY complete via CompleteRebuild
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IsRecoverable now verifies three conditions:
- startExclusive >= tailLSN (not recycled)
- endInclusive <= headLSN (within WAL)
- all LSNs in range exist contiguously (no holes)
StateAt now uses base snapshot captured during AdvanceTail:
- returns nil for LSNs before snapshot boundary (unreconstructable)
- correctly includes block state from recycled entries via snapshot
5 new tests: end-beyond-head, missing entries, state after tail
advance, nil before snapshot, block last written before tail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>