Pinner (pinner.go):
- HoldWALRetention: validates startLSN >= current tail, tracks hold
- HoldSnapshot: validates checkpoint exists + trusted
- HoldFullBase: tracks hold by ID
- MinWALRetentionFloor: returns minimum held position across all
WAL/snapshot holds — designed for flusher RetentionFloorFn hookup
- Release functions remove holds from tracking map
Executor (executor.go):
- StreamWALEntries: validates range against real WAL tail/head
(actual ScanFrom integration deferred to network-layer wiring)
- TransferSnapshot/TransferFullBase/TruncateWAL: stubs for P1
Key integration points:
- Pinner reads real StatusSnapshot for validation
- Pinner.MinWALRetentionFloor can wire into flusher.RetentionFloorFn
- Executor validates WAL range availability from real state
Carry-forward:
- Real ScanFrom wiring needs WAL fd + offset (network layer)
- TransferSnapshot/TransferFullBase need extent I/O
- Control intent from confirmed failover (master-side)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CatchUpExecutor.OnStep: optional callback fired between executor-managed
progress steps. Enables deterministic fault injection (epoch bump)
between steps without racing or manual sender calls.
E2_EpochBump_MidExecutorLoop:
- Executor runs 5 progress steps
- OnStep hook bumps epoch after step 1 (after 2 successful steps)
- Executor's own loop detects invalidation at step 2's check
- Resources released by executor's release path (not manual cancel)
- Log shows session_invalidated + exec_resources_released
This closes the remaining FC2 gap: invalidation is now detected
and cleaned up by the executor itself, not by external code.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Planner/executor contract:
- RebuildExecutor.Execute() takes no arguments — consumes plan-bound
RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN
- RecoveryPlan binds all rebuild targets at plan time
- Executor cannot re-derive policy from caller-supplied history
Catch-up timing:
- Removed unused completeTick parameter from CatchUpExecutor.Execute
- Per-step ticks synthesized as startTick + stepIndex + 1
- API shape matches implementation
New test: PlanExecuteConsistency_RebuildCannotSwitchSource
- Plans snapshot+tail, then mutates storage history
- Executor succeeds using plan-bound values (not re-derived)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full-base rebuild resource:
- StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image
- PlanRebuild full_base branch now acquires FullBasePin
- RecoveryPlan.FullBasePin field, released by ReleasePlan
Session cleanup on resource failure:
- PlanRecovery invalidates session when WAL pin fails
(no dangling live session after failed resource acquisition)
3 new tests:
- PlanRebuild_FullBase_PinsBaseImage: pin acquired + released
- PlanRebuild_FullBase_PinFailure: logged + error
- PlanRecovery_WALPinFailure_CleansUpSession: session invalidated,
sender disconnected (no dangling state)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap
Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)
Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason
All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.
1 new test: EndpointChange_LogsInvalidation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
recovery is or is not allowed
14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed
Engine module at 44 tests (12 + 18 + 14).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
(recoveredTo - startLSN), not the replica's pre-existing prefix
Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer
3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registry is now keyed by stable ReplicaID, not by address.
DataAddr changes preserve sender identity — the core V2 invariant.
Changes:
- ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint
- AssignmentIntent.Replicas uses []ReplicaAssignment
- Registry.Reconcile takes []ReplicaAssignment
- Tests use stable IDs ("replica-1", "r1") independent of addresses
New test: ChangedDataAddr_PreservesSenderIdentity
- Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2)
- Sender pointer preserved, session invalidated, new session attached
- This is the exact V1/V1.5 regression that V2 must fix
doc.go: clarified Slice 1 core vs carried-forward files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All mutable state on Sender and Session is now unexported:
- Sender.state, .epoch, .endpoint, .session, .stopped → accessors
- Session.id, .phase, .kind, etc. → read-only accessors
- Session() replaced by SessionSnapshot() (returns disconnected copy)
- SessionID() and HasActiveSession() for common queries
- AttachSession returns (sessionID, error) not (*Session, error)
- SupersedeSession returns sessionID not *Session
Budget configuration via SessionOption:
- WithBudget(CatchUpBudget) passed to AttachSession
- No direct field mutation on session from external code
New test: Encapsulation_SnapshotIsReadOnly proves snapshot
mutation does not leak back to sender state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frozen target is now unconditional:
- FrozenTargetLSN field on RecoverySession, set by BeginCatchUp
- RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget
- Catch-up is always a bounded (R, H0] contract
Rebuild completion exclusivity:
- CompleteSessionByID explicitly rejects SessionRebuild by kind
- Rebuild sessions can ONLY complete via CompleteRebuild
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IsRecoverable now verifies three conditions:
- startExclusive >= tailLSN (not recycled)
- endInclusive <= headLSN (within WAL)
- all LSNs in range exist contiguously (no holes)
StateAt now uses base snapshot captured during AdvanceTail:
- returns nil for LSNs before snapshot boundary (unreconstructable)
- correctly includes block state from recycled entries via snapshot
5 new tests: end-beyond-head, missing entries, state after tail
advance, nil before snapshot, block last written before tail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AllocateBlockVolumeResponse used bs.ListenAddr() to derive replica
addresses. When the VS binds to ":port" (no explicit IP), host
resolved to empty string, producing ":dataPort" as the replica
address. This ":port" propagated through master assignments to both
primary and replica sides.
Now canonicalizes empty/wildcard host using PreferredOutboundIP()
before constructing replication addresses. Also exported
PreferredOutboundIP for use by the server package.
This is the source fix — all downstream paths (heartbeat, API
response, assignment) inherit the canonical address.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setupReplicaReceiver now reads back canonical addresses from
the ReplicaReceiver (which applies CP13-2 canonicalization)
instead of storing raw assignment addresses in replStates.
This fixes the API-level leak where replica_data_addr showed
":port" instead of "ip:port" in /block/volumes responses,
even though the engine-level CP13-2 fix was working.
New BlockVol.ReplicaReceiverAddr() returns canonical addresses
from the running receiver. Falls back to assignment addresses
if receiver didn't report.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rebuildFullExtent updated superblock.WALCheckpointLSN but not the
flusher's internal checkpointLSN. NewReplicaReceiver then read
stale 0 from flusher.CheckpointLSN(), causing post-rebuild
flushedLSN to be wrong.
Added Flusher.SetCheckpointLSN() and call it after rebuild
superblock persist. TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint
flips FAIL→PASS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test used createSyncAllPair(t) but discarded the replica
return value, leaving the volume file open. On Windows this
caused TempDir cleanup failure. All 7 CP13-1 baseline FAILs
now PASS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds per-replica state reporting in heartbeat so master can identify
which specific replica needs rebuild, not just a volume-level boolean.
New ReplicaShipperStatus{DataAddr, State, FlushedLSN} type reported
via ReplicaShipperStates field on BlockVolumeInfoMessage. Populated
from ShipperGroup.ShipperStates() on each heartbeat. Scales to RF=3+.
V1 constraints (explicit):
- NeedsRebuild cleared only by control-plane reassignment (no local exit)
- Post-rebuild replica re-enters as Disconnected/bootstrap, not InSync
- flushedLSN = checkpointLSN after rebuild (durable baseline only)
4 new tests: heartbeat per-replica state, NeedsRebuild reporting,
rebuild-complete-reenters-InSync (full cycle), epoch mismatch abort.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Flusher now holds WAL entries needed by recoverable replicas.
Both AdvanceTail (physical space) and checkpointLSN (scan gate)
are gated by the minimum flushed LSN across catch-up-eligible
replicas.
New methods on ShipperGroup:
- MinRecoverableFlushedLSN() (uint64, bool): pure read, returns
min flushed LSN across InSync/Degraded/Disconnected/CatchingUp
replicas with known progress. Excludes NeedsRebuild.
- EvaluateRetentionBudgets(timeout): separate mutation step,
escalates replicas that exceed walRetentionTimeout (5m default)
to NeedsRebuild, releasing their WAL hold.
Flusher integration: evaluates budgets then queries floor on each
flush cycle. If floor < maxLSN, holds both checkpoint and tail.
Extent writes proceed normally (reads work), only WAL reclaim
is deferred.
LastContactTime on WALShipper: updated on barrier success,
handshake success, and catch-up completion. Not on Ship (TCP
write only). Avoids misclassifying idle-but-healthy replicas.
CP13-6 ships with timeout budget only. walRetentionMaxBytes
is deferred (documented as partial slice).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concurrent WriteLBA/Trim calls could deliver WAL entries to replicas
out of LSN order: two goroutines allocate LSN 4 and 5 concurrently,
but LSN 5 could reach the replica first via ShipAll, causing the
replica to reject it as an LSN gap.
shipMu now wraps nextLSN.Add + wal.Append + ShipAll in both
WriteLBA and Trim, guaranteeing LSN-ordered delivery to replicas
under concurrent writers.
The dirty map update and WAL pressure check happen after shipMu
is released — they don't need ordering guarantees.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
doReconnectAndCatchUp() now uses the replicaFlushedLSN returned by
the reconnect handshake as the catch-up start point, not the
shipper's stale cached value. The replica may have less durable
progress than the shipper last knew.
ReplicaReceiver initialization: flushedLSN now set from the
volume's checkpoint LSN (durable by definition), not nextLSN
(which includes unflushed entries). receivedLSN still uses
nextLSN-1 since those entries are in the WAL buffer even if
not yet synced.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated 3 reconnect tests to stop/restart the ReplicaReceiver on
the same addresses WITHOUT calling SetReplicaAddr. This preserves
the shipper object, its ReplicaFlushedLSN, HasFlushedProgress flag,
and catch-up state across the disconnect/reconnect cycle.
All 3 tests now PASS:
- TestReconnect_CatchupFromRetainedWal
- CatchupReplay_DataIntegrity_AllBlocksMatch
- CatchupReplay_DuplicateEntry_Idempotent
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the sync_all reconnect protocol: when a degraded shipper
reconnects, it performs a handshake (ResumeShipReq/Resp) to
determine the replica's durable progress, then streams missed
WAL entries to close the gap before resuming live shipping.
New wire messages:
- MsgResumeShipReq (0x03): primary sends epoch, headLSN, retainStart
- MsgResumeShipResp (0x04): replica returns status + flushedLSN
- MsgCatchupDone (0x05): marks end of catch-up stream
Decision matrix after handshake:
- R == H: already caught up → InSync
- S <= R+1 <= H: recoverable gap → CatchingUp → stream → InSync
- R+1 < S: gap exceeds retained WAL → NeedsRebuild
- R > H: impossible progress → NeedsRebuild
WALAccess interface: narrow abstraction (RetainedRange + StreamEntries)
avoids coupling shipper to raw WAL internals.
Bootstrap vs reconnect split: fresh shippers (HasFlushedProgress=false)
use CP13-4 bootstrap path. Previously-synced shippers use handshake.
Catch-up retry budget: maxCatchupRetries=3 before NeedsRebuild.
ReplicaReceiver now initializes receivedLSN/flushedLSN from volume's
nextLSN on construction (handles receiver restart on existing volume).
TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers flips FAIL→PASS.
All previously-passing baseline tests remain green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces binary degraded flag with ReplicaState type:
Disconnected, Connecting, CatchingUp, InSync, Degraded, NeedsRebuild.
Ship() allowed from Disconnected (bootstrap: data must flow before
first barrier) and InSync (steady state). Ship does NOT change state.
Barrier() gating:
- InSync: proceed normally
- Disconnected: bootstrap path (connect + barrier)
- Degraded: reconnect both data+ctrl connections, then barrier
- Connecting/CatchingUp/NeedsRebuild: rejected immediately
Only barrier success grants InSync. Reconnect alone does not.
IsDegraded() now means "not sync-eligible" (any non-InSync state).
InSyncCount() added to ShipperGroup.
dist_group_commit.go: removed AllDegraded short-circuit that
prevented bootstrap. Barrier attempts always run — individual
shippers handle their own state-based gating.
8 CP13-4 tests + TestBarrier_RejectsReplicaNotInSync flips FAIL→PASS.
All previously-passing baseline tests remain green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Barrier response extended from 1-byte status to 9-byte payload
carrying the replica's durable WAL progress (FlushedLSN). Updated
only after successful fd.Sync(), never on receive/append/send.
Replica side: new flushedLSN field on ReplicaReceiver, advanced
only in handleBarrier after proven contiguous receipt + sync.
max() guard prevents regression.
Shipper side: new replicaFlushedLSN (authoritative) replacing
ShippedLSN (diagnostic only). Monotonic CAS update from barrier
response. hasFlushedProgress flag tracks whether replica supports
the extended protocol.
ShipperGroup: MinReplicaFlushedLSN() returns (uint64, bool) —
minimum across shippers with known progress. (0, false) for empty
groups or legacy replicas.
Backward compat: 1-byte legacy responses decoded as FlushedLSN=0.
Legacy replicas explicitly excluded from sync_all correctness.
7 new tests: roundtrip, backward compat, flush-only-after-sync,
not-on-receive, shipper update, monotonicity, group minimum.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ReplicaReceiver.DataAddr()/CtrlAddr() now return canonical ip:port
instead of raw listener addresses that may be wildcard (:port,
0.0.0.0:port, [::]:port).
New canonicalizeListenerAddr() resolves wildcard IPs using the
provided advertised host (from VS listen address). Falls back to
outbound-IP detection when no advertised host is available.
NewReplicaReceiver accepts optional advertisedHost parameter for
multi-NIC correctness. In production, the assignment path already
provides canonical addresses; this fix ensures test patterns with
:0 bind also produce routable addresses.
7 new tests. TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind flips
from FAIL to PASS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same-epoch reconciliation now trusts reported roles first:
- one claims primary, other replica → trust roles
- both claim primary → WALHeadLSN heuristic tiebreak
- both claim replica → keep existing, log ambiguity
Replaced addServerAsReplica with upsertServerAsReplica: checks
for existing replica entry by server name before appending.
Prevents duplicate ReplicaInfo rows during restart/replay windows.
2 new tests: role-trusted same-epoch, duplicate replica prevention.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>