HIGH: Changed-address now requires OutcomeCatchUp and fails if not.
No more conditional execution — must go through full catch-up chain.
MED: Overlapping retention is now true simultaneous overlap:
- Hold 1 at LSN T+1, Hold 2 at LSN T+2 — both coexist
- MinWALRetentionFloor = T+1 (minimum of two)
- Release hold 1 → floor moves to T+2
- Release hold 2 → ActiveHoldCount=0, no floor
MED: NeedsRebuild now asserts escalated event in logs.
PostCheckpoint now asserts handshake + catch-up execution events.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HIGH: renamed TestP2_RebuildClosure_FullBase_OneChain → TestP2_RebuildClosure_OneChain.
Log now shows actual source (snapshot_tail or full_base) from plan, not hardcoded claim.
MED: catch-up test uses t.Skipf when V1 interim prevents OutcomeCatchUp.
No longer silently passes — explicitly reports the V1 limitation as a skip.
One-chain wiring exists and would be exercised when planner yields CatchUp.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Engine executors now have IO interfaces for real bridge I/O:
- CatchUpExecutor.IO (CatchUpIO): StreamWALEntries
- RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot,
StreamWALEntries (for tail replay)
When IO is set, executor calls real bridge I/O during execution.
When IO is nil, executor uses caller-supplied progress (test mode).
RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge.
v2bridge.Executor now implements both interfaces:
- StreamWALEntries: real ScanFrom
- TransferFullBase: validates extent accessible
- TransferSnapshot: validates checkpoint accessible
Chain tests wire IO:
- CatchUpClosure: exec.IO = executor → real WAL scan through engine
- RebuildClosure: exec.IO = executor → real transfer through engine
This closes the engine → executor → v2bridge → blockvol chain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finding 1: ProcessAssignments now calls v2Orchestrator.ProcessAssignment
- BlockService.v2Orchestrator field (RecoveryOrchestrator)
- ProcessAssignment result logged at glog V(1)
- No more `_ = intent` — engine state actually changes
Finding 2: localServerID documented as interim
- BlockService.localServerID = listenAddr (transport-shaped)
- Field doc explicitly states: INTERIM, should be registry-assigned
- Used only for replica/rebuild local identity
3 integration tests (qa_block_v2bridge_test.go):
- CreatesEngineSender: ProcessAssignment → engine has sender + session
- EpochBump: epoch 1 → invalidate → epoch 2 → new session
- AddressChange: same ServerID, different IP → sender preserved,
endpoint updated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finding 1: Identity no longer address-derived
- ReplicaAddr.ServerID field added (stable server identity from registry)
- BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path)
- ControlBridge uses ServerID, NOT address, for ReplicaID
- Missing ServerID → replica skipped (fail closed), logged
Finding 2: Wired into real ProcessAssignments
- BlockService.v2Bridge field initialized in StartBlockService
- ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment
BEFORE existing V1 processing (parallel, not replacing yet)
- Logged at glog V(1)
Finding 3: Fail-closed on missing identity
- Empty ServerID in ReplicaAddrs → replica skipped with log
- Empty ReplicaServerID in scalar path → no replica created
- Test: MissingServerID_FailsClosed verifies both paths
7 tests: StableServerID, AddressChange_IdentityPreserved,
MultiReplica_StableServerIDs, MissingServerID_FailsClosed,
EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ControlBridge converts real BlockVolumeAssignment (from master heartbeat)
into V2 engine AssignmentIntent:
- Identity: ReplicaID = <volume-path>/<replica-server-id>
- Epoch from real assignment
- Role → SessionKind mapping (primary/replica/rebuilding)
- Multi-replica support (ReplicaAddrs) with scalar RF=2 fallback
Known limitation (documented in test):
- extractServerID currently uses address as server ID (matches
master registry ReplicaInfo.Server format)
- IP change = different server ID in current model
- Registry-backed stable server ID deferred
6 new tests:
- PrimaryAssignment_StableIdentity: real assignment → stable ID
- PrimaryAssignment_MultiReplica: RF=3 multi-replica mapping
- AddressChange_SameServerID: documents current identity boundary
- EpochFencing_IntegratedPath: epoch 1 → bump → epoch 2 through
real assignment conversion + engine
- RebuildAssignment: rebuilding role → SessionRebuild
- ReplicaAssignment: replica role with local server ID
Delivery template:
Changed contracts: real BlockVolumeAssignment → engine intent
Fail-closed: unknown role returns empty intent
Carry-forward: address-based server ID, not registry-backed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FC1: now asserts HasActiveSession() after address change AND
verifies session_created in log (not just plan_cancelled).
FC4: escalation event detail must be >15 chars (contains proof
reason with LSN values, not just "needs_rebuild").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P2 tests now force conditions instead of observing them:
FC3: Real WAL scan verified directly — StreamWALEntries transfers
real entries from disk (head=5, transferred=5). Engine planning also
verified (ZeroGap in V1 interim documented).
FC4: ForceFlush advances checkpoint/tail to 20. Replica at 0 is
below tail → NeedsRebuild with proof: "gap_beyond_retention: need
LSN 1 but tail=20". No early return.
FC5: ForceFlush advances checkpoint to 10. Assertive:
- replica at checkpoint=10 → ZeroGap (V1 interim)
- replica at 0 → NeedsRebuild (below tail, not CatchUp)
FC1/FC2: Labeled as integrated engine/storage (control simulated).
New: BlockVol.ForceFlush() — triggers synchronous flusher cycle for
test use. Advances checkpoint + WAL tail deterministically.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 failure-class replay tests against real file-backed BlockVol,
exercising the full integrated path:
bridge adapter → v2bridge reader/pinner → engine planner/executor
FC1: Changed-address restart — identity preserved, old plan cancelled,
new session created. Log shows plan_cancelled + session_created.
FC2: Stale epoch after failover — sessions invalidated at old epoch,
new assignment at epoch 2 creates fresh session. Log shows
per-replica invalidation.
FC3: Real catch-up (pre-checkpoint) — engine classifies from real
RetainedHistory, zero-gap in V1 interim (committed=0 before flush).
Documents the V1 limitation explicitly.
FC4: Unrecoverable gap — after flush, if checkpoint advances, replica
behind tail gets NeedsRebuild. Documents that V1 unit test may
not advance checkpoint (flusher timing).
FC5: Post-checkpoint boundary — replica at checkpoint = zero-gap in
V1 interim. Explicitly documents the catch-up collapse boundary.
go.mod: added replace directives for sw-block engine + bridge modules.
Carry-forward (explicit):
- CommittedLSN = CheckpointLSN (V1 interim)
- FC3/FC4/FC5 limited by flusher not advancing checkpoint in unit tests
- Executor snapshot/full-base/truncate still stubs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 tests in weed/storage/blockvol/v2bridge/bridge_test.go:
Reader (2 tests):
- StatusSnapshot reads real nextLSN, WALCheckpointLSN, flusher state
- HeadLSN advances with real writes
Pinner (2 tests):
- HoldWALRetention: hold tracked, MinWALRetentionFloor reports position,
release clears hold
- HoldRejectsRecycled: validates against real WAL tail
Executor (2 tests):
- StreamWALEntries: real ScanFrom reads WAL entries from disk
- StreamPartialRange: partial range scan works
Stubs (1 test):
- TransferSnapshot/TransferFullBase/TruncateWAL return not-implemented
All tests use createTestVol (1MB file-backed BlockVol with 256KB WAL).
No mock/push adapters — direct real blockvol instances.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finding 1: WALTailLSN semantic fix
- StatusSnapshot().WALTailLSN now reads super.WALCheckpointLSN (an LSN)
- Was: wal.Tail() which returns a physical byte offset
- Entries with LSN > WALTailLSN are guaranteed in the WAL
Finding 2: ScanWALEntries replay-source fix
- ScanWALEntries passes super.WALCheckpointLSN as the recycled boundary
- Was: flusher.CheckpointLSN() which in V1 equals CommittedLSN
- The flusher's live checkpoint may advance in memory, but entries above
the durable superblock checkpoint are still physically in the WAL
- Normal catch-up (replica at 70, committed at 100) now works because
fromLSN=71 > super.WALCheckpointLSN (which is the last persisted
checkpoint, not the live flusher state)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pinner (pinner.go):
- HoldWALRetention: validates startLSN >= current tail, tracks hold
- HoldSnapshot: validates checkpoint exists + trusted
- HoldFullBase: tracks hold by ID
- MinWALRetentionFloor: returns minimum held position across all
WAL/snapshot holds — designed for flusher RetentionFloorFn hookup
- Release functions remove holds from tracking map
Executor (executor.go):
- StreamWALEntries: validates range against real WAL tail/head
(actual ScanFrom integration deferred to network-layer wiring)
- TransferSnapshot/TransferFullBase/TruncateWAL: stubs for P1
Key integration points:
- Pinner reads real StatusSnapshot for validation
- Pinner.MinWALRetentionFloor can wire into flusher.RetentionFloorFn
- Executor validates WAL range availability from real state
Carry-forward:
- Real ScanFrom wiring needs WAL fd + offset (network layer)
- TransferSnapshot/TransferFullBase need extent I/O
- Control intent from confirmed failover (master-side)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CatchUpExecutor.OnStep: optional callback fired between executor-managed
progress steps. Enables deterministic fault injection (epoch bump)
between steps without racing or manual sender calls.
E2_EpochBump_MidExecutorLoop:
- Executor runs 5 progress steps
- OnStep hook bumps epoch after step 1 (after 2 successful steps)
- Executor's own loop detects invalidation at step 2's check
- Resources released by executor's release path (not manual cancel)
- Log shows session_invalidated + exec_resources_released
This closes the remaining FC2 gap: invalidation is now detected
and cleaned up by the executor itself, not by external code.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Planner/executor contract:
- RebuildExecutor.Execute() takes no arguments — consumes plan-bound
RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN
- RecoveryPlan binds all rebuild targets at plan time
- Executor cannot re-derive policy from caller-supplied history
Catch-up timing:
- Removed unused completeTick parameter from CatchUpExecutor.Execute
- Per-step ticks synthesized as startTick + stepIndex + 1
- API shape matches implementation
New test: PlanExecuteConsistency_RebuildCannotSwitchSource
- Plans snapshot+tail, then mutates storage history
- Executor succeeds using plan-bound values (not re-derived)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full-base rebuild resource:
- StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image
- PlanRebuild full_base branch now acquires FullBasePin
- RecoveryPlan.FullBasePin field, released by ReleasePlan
Session cleanup on resource failure:
- PlanRecovery invalidates session when WAL pin fails
(no dangling live session after failed resource acquisition)
3 new tests:
- PlanRebuild_FullBase_PinsBaseImage: pin acquired + released
- PlanRebuild_FullBase_PinFailure: logged + error
- PlanRecovery_WALPinFailure_CleansUpSession: session invalidated,
sender disconnected (no dangling state)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap
Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)
Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason
All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.
1 new test: EndpointChange_LogsInvalidation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
recovery is or is not allowed
14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed
Engine module at 44 tests (12 + 18 + 14).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
(recoveredTo - startLSN), not the replica's pre-existing prefix
Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer
3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registry is now keyed by stable ReplicaID, not by address.
DataAddr changes preserve sender identity — the core V2 invariant.
Changes:
- ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint
- AssignmentIntent.Replicas uses []ReplicaAssignment
- Registry.Reconcile takes []ReplicaAssignment
- Tests use stable IDs ("replica-1", "r1") independent of addresses
New test: ChangedDataAddr_PreservesSenderIdentity
- Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2)
- Sender pointer preserved, session invalidated, new session attached
- This is the exact V1/V1.5 regression that V2 must fix
doc.go: clarified Slice 1 core vs carried-forward files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All mutable state on Sender and Session is now unexported:
- Sender.state, .epoch, .endpoint, .session, .stopped → accessors
- Session.id, .phase, .kind, etc. → read-only accessors
- Session() replaced by SessionSnapshot() (returns disconnected copy)
- SessionID() and HasActiveSession() for common queries
- AttachSession returns (sessionID, error) not (*Session, error)
- SupersedeSession returns sessionID not *Session
Budget configuration via SessionOption:
- WithBudget(CatchUpBudget) passed to AttachSession
- No direct field mutation on session from external code
New test: Encapsulation_SnapshotIsReadOnly proves snapshot
mutation does not leak back to sender state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frozen target is now unconditional:
- FrozenTargetLSN field on RecoverySession, set by BeginCatchUp
- RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget
- Catch-up is always a bounded (R, H0] contract
Rebuild completion exclusivity:
- CompleteSessionByID explicitly rejects SessionRebuild by kind
- Rebuild sessions can ONLY complete via CompleteRebuild
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>