pingqiu
cfec3bff4a
fix: update contract.go field source docs to match P1 implementation
BlockVolState field mapping now matches actual StatusSnapshot():
- WALTailLSN ← super.WALCheckpointLSN (was: flusher.RetentionFloor)
- CommittedLSN ← flusher.CheckpointLSN() V1 interim (was: distCommit)
- CheckpointTrusted ← super.Validate()==nil (was: superblock.Valid)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 days ago
pingqiu
8c326c871c
feat: add contract interfaces and pin/release via release-func pattern (Phase 07 P1)
E5 handoff contract (contract.go):
- BlockVolReader: ReadState() → BlockVolState from real blockvol
- BlockVolPinner: HoldWALRetention/HoldSnapshot/HoldFullBase → release func
- BlockVolExecutor: StreamWALEntries/TransferSnapshot/TransferFullBase/TruncateWAL
- Clear import direction: weed-side imports sw-block, not reverse
StorageAdapter refactored:
- Consumes BlockVolReader + BlockVolPinner interfaces
- Pin/release uses release-func pattern (not map-based tracking)
- PushStorageAdapter for tests (push-based, no blockvol dependency)
10 bridge tests:
- 4 control adapter (identity, address change, role mapping, primary)
- 4 storage adapter (retained history, WAL pin reject, snapshot reject, symmetry)
- 1 E2E (assignment → adapter → engine → plan → execute → InSync)
- 1 contract interface verification
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 days ago
pingqiu
05daede7f9
feat: add V2 bridge adapters for blockvol (Phase 07 P0)
Creates sw-block/bridge/blockvol/ — concrete adapters connecting
the V2 engine to real blockvol storage and control-plane state.
control_adapter.go:
- MakeReplicaID: volume-name/server-id (NOT address-derived)
- ToAssignmentIntent: maps master assignment → engine intent
- Role → SessionKind translation (pure mapping, no policy)
storage_adapter.go:
- BlockVolState: maps to real blockvol fields (WAL head/tail,
committed, checkpoint) — NOT reconstructed from metadata
- GetRetainedHistory from real state
- PinSnapshot rejects untrusted checkpoint
- PinWALRetention rejects recycled range
- PinFullBase / ReleaseFullBase
8 bridge tests:
- StableIdentity: ReplicaID = vol/server (not address)
- AddressChangePreservesIdentity: same ID, different address
- RebuildRoleMapping: "rebuilding" → SessionRebuild
- PrimaryNoRecovery: no recovery targets for primary
- RetainedHistoryFromRealState: all fields from BlockVolState
- WALPinRejectsRecycled: tail validation
- SnapshotPinRejectsInvalid: trust validation
- E2E_AssignmentToRecovery: master assignment → adapter →
engine intent → plan → execute → InSync
Adapter replacement order:
P0: control_adapter + storage_adapter (this delivery)
P1: executor_bridge + observe_adapter (deferred)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 days ago
pingqiu
4df61f290b
fix: true mid-executor invalidation test via OnStep hook
CatchUpExecutor.OnStep: optional callback fired between executor-managed
progress steps. Enables deterministic fault injection (epoch bump)
between steps without racing or manual sender calls.
E2_EpochBump_MidExecutorLoop:
- Executor runs 5 progress steps
- OnStep hook bumps epoch after step 1 (after 2 successful steps)
- Executor's own loop detects invalidation at step 2's check
- Resources released by executor's release path (not manual cancel)
- Log shows session_invalidated + exec_resources_released
This closes the remaining FC2 gap: invalidation is now detected
and cleaned up by the executor itself, not by external code.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 days ago
pingqiu
5b63d34d6b
fix: snapshot+tail WAL pin failure cleanup + true mid-executor epoch test
Finding 1: PlanRebuild snapshot+tail WAL pin failure now fail-closed
- InvalidateSession("wal_pin_failed_during_rebuild", StateNeedsRebuild)
- Snapshot pin released, session invalidated, no dangling state
- New test: E2_RebuildWALPinFailure_SessionCleaned
Finding 2: True mid-executor invalidation test
- Executor makes 2 successful progress steps (60, 70)
- Epoch bumps BETWEEN steps (real mid-execution)
- Third progress step fails — session invalidated
- Resources released via executor cancel
- New test: E2_EpochBump_AfterExecutorProgress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
332f598606
fix: close P3 failure classes — session cleanup, causal logging, CancelPlan
Finding 1: PlanRebuild now invalidates session on pin failure
- FullBasePin failure → InvalidateSession("full_base_pin_failed", StateNeedsRebuild)
- SnapshotPin failure → InvalidateSession("snapshot_pin_failed", StateNeedsRebuild)
- No dangling rebuild session after resource acquisition failure
Finding 2: Rebuild source logging shows causal reason
- plan_rebuild_full_base now logs: untrusted_checkpoint,
trusted_checkpoint_unreplayable_tail, or no_checkpoint
Finding 3: CancelPlan for address-change cleanup
- New RecoveryDriver.CancelPlan(plan, reason): releases resources +
invalidates session + logs plan_cancelled with reason
- Changed-address test uses CancelPlan (not manual ReleasePlan)
Finding 4: Executor-level epoch-bump test
- Executor's mid-step invalidation detection catches stale session
- Resources released via executor release path, not manual cancel
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
56afa55f13
feat: add P3 failure-class validation through planner/executor (Phase 06)
6 new tests (validation_test.go) mapped to tester expectations E1-E5:
E1/FC1: Changed-address restart through planner/executor
- Active session invalidated by address change
- Sender identity preserved, old plan resources released
- Log shows: endpoint_changed → new session → plan → execute
E2/FC2: Epoch bump mid-execution step
- Partial progress, epoch bumps between steps
- Further progress rejected, executor cancels with resource release
- Log shows: session_invalidated + exec_resources_released
E3/FC5: Cross-layer proof — trusted base + unreplayable tail
- Storage: checkpoint=50, tail=80 → unreplayable
- RebuildSourceDecision → FullBase (not SnapshotTail)
- FullBasePin acquired, executed through RebuildExecutor, released
- Log shows: plan_rebuild_full_base (observable reason)
E4/FC8: Rebuild fallback when trusted-base proof fails
- Untrusted checkpoint → full-base, full-base pin fails → error
- Untrusted checkpoint → full-base, full-base pin succeeds → InSync
- Log shows: full_base_pin_failed
E5: Observability — full recovery chain logged
- Verifies 7 required log events from assignment through completion
Delivery template:
Changed contracts: P3 validates planner/executor path, not convenience
Fail-closed: epoch bump mid-step releases resources + logs cause
Resources: cross-layer proof chain validated end-to-end
Carry-forward: FC3/FC4/FC6/FC7 sufficient from prior phases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
f5c0aab454
fix: rebuild executor consumes bound plan, fix catch-up timing
Planner/executor contract:
- RebuildExecutor.Execute() takes no arguments — consumes plan-bound
RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN
- RecoveryPlan binds all rebuild targets at plan time
- Executor cannot re-derive policy from caller-supplied history
Catch-up timing:
- Removed unused completeTick parameter from CatchUpExecutor.Execute
- Per-step ticks synthesized as startTick + stepIndex + 1
- API shape matches implementation
New test: PlanExecuteConsistency_RebuildCannotSwitchSource
- Plans snapshot+tail, then mutates storage history
- Executor succeeds using plan-bound values (not re-derived)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
50442acb2e
feat: add stepwise executor with release symmetry (Phase 06 P2)
New: executor.go — CatchUpExecutor + RebuildExecutor
Replaces convenience wrappers with stepwise execution that owns
resource lifecycle on every exit path.
CatchUpExecutor.Execute:
1. BeginCatchUp (freezes target)
2. Stepwise RecordCatchUpProgress + CheckBudget per step
3. RecordTruncation (if required)
4. CompleteSessionByID
5. Release resources (success or failure)
RebuildExecutor.Execute:
1. BeginConnect + RecordHandshake
2. SelectRebuildFromHistory
3. BeginRebuildTransfer + progress
4. BeginRebuildTailReplay + progress (snapshot+tail)
5. CompleteRebuild
6. Release resources (success or failure)
Both executors:
- Release all pins on every exit path (success, failure, cancellation)
- Check session validity mid-execution (detect epoch bump / endpoint change)
- Log resource release with causal reason
14 new tests (executor_test.go), mapped to tester expectations:
- E1: Partial catch-up failure releases WAL pin (2 tests)
- E2: Partial rebuild failure releases all pins (1 test)
- E3: Epoch bump / cancel releases resources (3 tests)
- E4: Successful execution releases resources (2 tests)
- E5: Stepwise not convenience (2 tests)
Delivery template:
Changed contracts: executor owns resource lifecycle (not caller)
Fail-closed: session check mid-execution, release on every error
Resources: WAL/snapshot/full-base pins released on all exit paths
Carry-forward: CompleteCatchUp/CompleteRebuild remain test-only
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
45bf111ce8
fix: derive WAL pin from actual replay need, PlanRebuild fails closed
WAL pin tied to actual recovery contract:
- Truncation-only (replica ahead): no WAL pin acquired
- Real catch-up: pins from replicaFlushedLSN (actual replay start)
- Logs distinguish plan_truncate_only from plan_catchup
PlanRebuild precondition checks:
- Error on missing sender
- Error on no active session
- Error on non-rebuild session kind
- All fail closed with clear error messages
4 new tests:
- ReplicaAhead_NoWALPin: truncation-only, no WAL resources
- PlanRebuild_MissingSender: returns error
- PlanRebuild_NoSession: returns error
- PlanRebuild_NonRebuildSession: returns error
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
d4f7697dd8
fix: add full-base pin and clean up session on WAL pin failure
Full-base rebuild resource:
- StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image
- PlanRebuild full_base branch now acquires FullBasePin
- RecoveryPlan.FullBasePin field, released by ReleasePlan
Session cleanup on resource failure:
- PlanRecovery invalidates session when WAL pin fails
(no dangling live session after failed resource acquisition)
3 new tests:
- PlanRebuild_FullBase_PinsBaseImage: pin acquired + released
- PlanRebuild_FullBase_PinFailure: logged + error
- PlanRecovery_WALPinFailure_CleansUpSession: session invalidated,
sender disconnected (no dangling state)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
f73a3fdab2
feat: add storage/control adapters and recovery driver (Phase 06 P0/P1)
Phase 06 module boundaries:
adapter.go — StorageAdapter + ControlPlaneAdapter interfaces:
- GetRetainedHistory: real WAL retention state
- PinSnapshot / ReleaseSnapshot: rebuild resource management
- PinWALRetention / ReleaseWALRetention: catch-up resource management
- HandleHeartbeat / HandleFailover: control-plane event conversion
driver.go — RecoveryDriver replaces synchronous convenience:
- PlanRecovery: connect + handshake from storage state + acquire resources
- PlanRebuild: acquire snapshot + WAL pins for rebuild
- ReleasePlan: release all acquired resources
Convenience flow classification:
- ProcessAssignment, UpdateSenderEpoch, InvalidateEpoch → stepwise engine tasks
- ExecuteRecovery → planner (connect + classify)
- CompleteCatchUp, CompleteRebuild → TEST-ONLY convenience
7 new tests (driver_test.go):
- CatchUp plan + execute with WAL pin
- ZeroGap plan (no resources pinned)
- NeedsRebuild → rebuild plan with resource acquisition
- WAL pin failure → logged + error
- Snapshot pin failure → logged + error
- ReplicaAhead truncation through driver
- Cross-layer: storage proves recoverability, engine consumes proof
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
512bb5bcf6
fix: orchestrator owns full catch-up contract (budget + truncation)
CompleteCatchUp now integrates:
- BeginCatchUp with start tick (freezes target)
- RecordCatchUpProgress (skips if already converged, e.g., truncation-only)
- CheckBudget at completion tick (escalates to NeedsRebuild + logs)
- RecordTruncation before completion (logs truncation_recorded)
- Logs causal reason for every rejection/escalation
CatchUpOptions: StartTick/CompleteTick (separate) + TruncateLSN.
3 new orchestrator-level tests:
- ReplicaAhead_TruncateViaOrchestrator: truncation through entry path
- ReplicaAhead_NoTruncate_CompletionRejected: logs completion_rejected
- BudgetEscalation_ViaOrchestrator: budget violation → NeedsRebuild + logs
Observability tests relabeled as sender-level (not entry-path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
adaff8ddb3
fix: only log endpoint_changed when endpoint actually changed
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
5cdee4a011
fix: orchestrator owns zero-gap completion and per-replica invalidation logging
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap
Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)
Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason
All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.
1 new test: EndpointChange_LogsInvalidation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
47238df0d7
fix: add RecoveryOrchestrator as real integrated entry path
New: orchestrator.go — RecoveryOrchestrator drives recovery lifecycle
from assignment through execution to completion/escalation:
- ProcessAssignment: reconcile + session creation + auto-log
- ExecuteRecovery: connect → handshake from RetainedHistory → outcome
- CompleteCatchUp: begin catch-up → progress → complete + auto-log
- CompleteRebuild: connect → handshake → history-driven source →
transfer → tail replay → complete + auto-log
- InvalidateEpoch: invalidate stale sessions + auto-log
All integration tests rewritten to use orchestrator as entry path.
No direct sender API calls in recovery lifecycle.
SessionSnapshot now includes: TruncateRequired/ToLSN/Recorded,
RebuildSource, RebuildPhase.
RecoveryLog is auto-populated by orchestrator at every transition.
7 integration tests via orchestrator:
- ChangedAddress, NeedsRebuild→Rebuild, EpochBump, MultiReplica
- Observability: session snapshot, rebuild snapshot, auto-populated log
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
7436b3b79c
feat: add integration closure and observability (Phase 05 Slice 4)
New files:
- observe.go: RegistryStatus, SenderStatus, RecoveryLog for debugging
- integration_test.go: V2-boundary integration tests through real
engine entry path
Observability:
- Registry.Status() returns full snapshot: per-sender state, session
snapshots, counts by category (InSync, Recovering, Rebuilding)
- RecoveryLog: append-only event log for recovery lifecycle debugging
Integration tests (6):
- ChangedAddress_FullFlow: initial recovery → address change →
sender preserved → new session → recovery with proof
- NeedsRebuild_ThenRebuildAssignment: catch-up fails → NeedsRebuild
→ rebuild assignment → history-driven source → InSync
- EpochBump_DuringRecovery: mid-recovery epoch bump → old session
rejected → new assignment at new epoch → InSync
- MultiReplica_MixedOutcomes: 3 replicas, 3 outcomes via
RetainedHistory proofs, registry status verified
- RegistryStatus_Snapshot: observability snapshot structure
- RecoveryLog: event recording and filtering
Engine module at 54 tests (12 + 18 + 18 + 6).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
4d06622c01
fix: add nil check for RetainedHistory in sender APIs
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
cc8c529962
fix: connect recovery decisions to RetainedHistory, fix rebuild source
RetainedHistory as engine input:
- RecordHandshakeFromHistory: sender-level API consuming RetainedHistory
directly, returns RecoverabilityProof alongside outcome
- SelectRebuildFromHistory: sender-level API consuming RetainedHistory
for rebuild-source decision
RebuildSourceDecision soundness:
- Now requires BOTH trusted checkpoint AND replayable tail
(CheckpointLSN >= TailLSN and CommittedLSN <= HeadLSN)
- Trusted checkpoint with unreplayable tail falls back to full_base
4 new tests:
- TrustedCheckpoint_UnreplayableTail (the regression case)
- SenderDriven_CatchUp (history → proof → outcome → complete)
- SenderDriven_Rebuild_SnapshotTail (history → source → rebuild)
- SenderDriven_Rebuild_FallsBackToFullBase (unreplayable tail)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
ff7ea41099
feat: add engine data/recoverability core (Phase 05 Slice 3)
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
recovery is or is not allowed
14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed
Engine module at 44 tests (12 + 18 + 14).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
368a956aee
fix: correct catch-up entry counting and rebuild transfer gate
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
(recoveredTo - startLSN), not the replica's pre-existing prefix
Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer
3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
930de4ba78
feat: add Slice 2 recovery execution tests (Phase 05)
15 new engine-level recovery execution tests:
- Zero-gap / catch-up / needs-rebuild branching (3 tests)
- Stale execution rejection during active recovery (2 tests)
- Bounded catch-up: frozen target, duration, entries, stall (5 tests)
- Completion before convergence rejected
- Rebuild exclusivity: catch-up APIs excluded (1 test)
- Rebuild lifecycle: snapshot+tail, full base, stale ID (3 tests)
- Assignment-driven recovery flow
Engine module now at 27 tests (12 Slice 1 + 15 Slice 2).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
61e9408261
fix: separate stable ReplicaID from Endpoint in registry
Registry is now keyed by stable ReplicaID, not by address.
DataAddr changes preserve sender identity — the core V2 invariant.
Changes:
- ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint
- AssignmentIntent.Replicas uses []ReplicaAssignment
- Registry.Reconcile takes []ReplicaAssignment
- Tests use stable IDs ("replica-1", "r1") independent of addresses
New test: ChangedDataAddr_PreservesSenderIdentity
- Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2)
- Sender pointer preserved, session invalidated, new session attached
- This is the exact V1/V1.5 regression that V2 must fix
doc.go: clarified Slice 1 core vs carried-forward files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
bb24b4b039
fix: encapsulate engine sender/session authority state
All mutable state on Sender and Session is now unexported:
- Sender.state, .epoch, .endpoint, .session, .stopped → accessors
- Session.id, .phase, .kind, etc. → read-only accessors
- Session() replaced by SessionSnapshot() (returns disconnected copy)
- SessionID() and HasActiveSession() for common queries
- AttachSession returns (sessionID, error) not (*Session, error)
- SupersedeSession returns sessionID not *Session
Budget configuration via SessionOption:
- WithBudget(CatchUpBudget) passed to AttachSession
- No direct field mutation on session from external code
New test: Encapsulation_SnapshotIsReadOnly proves snapshot
mutation does not leak back to sender state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
20d70f9fb6
feat: add V2 engine replication core (Phase 05 Slice 1)
Creates sw-block/engine/replication/ — the real V2 engine ownership core,
promoted from sw-block/prototype/enginev2/ with all accepted invariants.
Files:
- types.go: Endpoint, ReplicaState, SessionKind, SessionPhase, FSM transitions
- sender.go: per-replica Sender with full execution + rebuild APIs
- session.go: Session with identity, phases, frozen target, truncation, budget
- registry.go: Registry with reconcile + assignment intent + epoch invalidation
- budget.go: CatchUpBudget (duration, entries, stall detection)
- rebuild.go: RebuildState FSM (snapshot+tail vs full base)
- outcome.go: HandshakeResult + ClassifyRecoveryOutcome
Tests (ownership_test.go, 13 tests):
- Changed-address invalidation (A10)
- Stale session ID rejected at all APIs (A3)
- Stale completion after supersede (A3)
- Epoch bump invalidates all sessions (A3)
- Stale assignment epoch rejected
- Rebuild exclusivity (catch-up APIs rejected)
- Rebuild full lifecycle
- Frozen target rejects chase (A5)
- Budget violation escalates (A5)
- E2E: 3 replicas, 3 outcomes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 days ago
pingqiu
26a1b33c2e
feat: add A5-A8 acceptance traceability and rebuild-source evidence
Cleanup: removed redundant TargetLSNAtStart from CatchUpBudget.
FrozenTargetLSN on RecoverySession is the single source of truth.
Acceptance traceability (acceptance_test.go):
- A5: 3 evidence tests (unrecoverable gap, budget escalation, frozen target)
- A6: 2 evidence tests (exact boundary, contiguity required)
- A7: 3 evidence tests (snapshot history, catch-up replay, truncation)
- A8: 2 evidence tests (convergence required, truncation required)
Rebuild-source decision evidence:
- snapshot_tail when trusted base exists
- full_base when no snapshot or untrusted
- 3 explicit tests
13 new tests total.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 days ago
pingqiu
8f5070679c
fix: make frozen target intrinsic and rebuild completion exclusive
Frozen target is now unconditional:
- FrozenTargetLSN field on RecoverySession, set by BeginCatchUp
- RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget
- Catch-up is always a bounded (R, H0] contract
Rebuild completion exclusivity:
- CompleteSessionByID explicitly rejects SessionRebuild by kind
- Rebuild sessions can ONLY complete via CompleteRebuild
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 days ago
pingqiu
8e4028758f
fix: make rebuild path exclusive, enforce phase discipline, require tick for stall budget
Rebuild exclusivity:
- BeginCatchUp rejects SessionRebuild ("must use rebuild APIs")
- RecordCatchUpProgress rejects SessionRebuild
- Rebuild sessions can only be completed via CompleteRebuild
- All legacy rebuild-through-catch-up paths in tests converted
Phase discipline:
- SelectRebuildSource requires session.Phase == PhaseHandshake
- Cannot skip BeginConnect + RecordHandshake
Stall budget:
- RecordCatchUpProgress requires tick parameter when
ProgressDeadlineTicks > 0 (no silent stall budget bypass)
3 new tests: rebuild exclusivity (catch-up APIs rejected),
rebuild source requires handshake phase, stall budget requires tick.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 days ago
pingqiu
5b66a85f92
fix: wire rebuild FSM into sender, enforce frozen target, fix entry counting
Rebuild execution path:
- newRecoverySession auto-initializes RebuildState for SessionRebuild
- Sender rebuild APIs: SelectRebuildSource, BeginRebuildTransfer,
RecordRebuildTransferProgress, BeginRebuildTailReplay,
RecordRebuildTailProgress, CompleteRebuild
- All rebuild APIs are sender-authority-gated by sessionID
- E2E rebuild test now drives through rebuild FSM, not catch-up APIs
Bounded CatchUp enforcement:
- BeginCatchUp freezes TargetLSNAtStart from session.TargetLSN
- RecordCatchUpProgress rejects progress beyond frozen target
- Entry counting uses LSN delta (recoveredTo - previous), not call count
- Merged RecordCatchUpProgressAt into RecordCatchUpProgress (tick param)
5 new tests: target-frozen enforcement, sender-level rebuild via
rebuild APIs, reject non-rebuild, reject stale ID on rebuild.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 days ago
pingqiu
3f0048cbd9
feat: add bounded CatchUp budget and Rebuild mode state machine (Phase 4.5 P0)
Bounded CatchUp:
- CatchUpBudget: MaxDurationTicks, MaxEntries, ProgressDeadlineTicks
- BudgetCheck: runtime consumption tracker (StartTick, EntriesReplayed, LastProgressTick)
- Sender.CheckBudget: evaluates budget, escalates to NeedsRebuild on violation
- RecordCatchUpProgressAt: tracks progress tick for stall detection
- BeginCatchUp accepts optional startTick for budget tracking
Rebuild state machine:
- RebuildSource: snapshot_tail (preferred) vs full_base (fallback)
- RebuildPhase: init → source_select → transfer → tail_replay → completed|aborted
- SelectSource: chooses based on snapshot availability
- Phase ordering enforced, transfer regression rejected
- ReadyToComplete validates target reached
13 new tests: budget enforcement (duration, entries, stall, no-budget),
sender budget integration, rebuild lifecycle (snapshot+tail, full base,
abort, phase order, regression), E2E bounded catch-up → rebuild.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 days ago
pingqiu
90c39b549d
feat: add prototype scenario closure (Phase 04 P4)
Maps V2 acceptance criteria A1-A7, A10 to enginev2 prototype evidence.
Adds 4 V2-boundary scenarios against the prototype.
Scenario tests:
- A1: committed data survives promotion (WAL truncation boundary)
- A2: uncommitted data truncated, not revived
- A3: stale epoch fenced at sender + session + assignment layers
- A4: short-gap catch-up with WAL-backed proof + data verification
- A5: unrecoverable gap escalates to NeedsRebuild with proof
- A6: recoverability boundary exact (tail +/- 1 LSN)
- A7: historical data correct after tail advancement (snapshot)
- A10: changed-address → invalidation → new assignment → recovery
V2-boundary scenarios:
- NeedsRebuild persists across topology update
- catch-up does not overwrite safe data
- 5 disconnect/reconnect cycles preserve sender identity
- full V2 harness: 3 replicas, 3 outcomes (zero-gap, catch-up, rebuild)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 days ago
pingqiu
942a0b7da7
fix: strengthen IsRecoverable contiguity check and StateAt snapshot correctness
IsRecoverable now verifies three conditions:
- startExclusive >= tailLSN (not recycled)
- endInclusive <= headLSN (within WAL)
- all LSNs in range exist contiguously (no holes)
StateAt now uses base snapshot captured during AdvanceTail:
- returns nil for LSNs before snapshot boundary (unreconstructable)
- correctly includes block state from recycled entries via snapshot
5 new tests: end-beyond-head, missing entries, state after tail
advance, nil before snapshot, block last written before tail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 days ago
pingqiu
c89709e47e
feat: add WAL history model and recoverability proof (Phase 04 P3)
Adds minimal historical-data prototype to enginev2:
- WALHistory: retained-prefix model with Append, Commit, AdvanceTail,
Truncate, EntriesInRange, IsRecoverable, StateAt
- MakeHandshakeResult connects WAL state to outcome classification
- RecordTruncation execution API for divergent tail cleanup
- CompleteSessionByID gates on truncation when required
- Zero-gap requires exact equality (FlushedLSN == CommittedLSN)
- Replica-ahead classified as CatchUp with mandatory truncation
15 new tests: WAL basics, provable recoverability, unprovable gap,
exact boundary, truncation enforcement, WAL-backed end-to-end
recovery with data verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 days ago
pingqiu
edec7098e8
feat: add V2 protocol simulator and enginev2 sender/session prototype
Adds sw-block/ directory with:
- distsim: protocol correctness simulator (96 tests)
- cluster model with epoch fencing, barrier semantics, commit modes
- endpoint identity, control-plane flow, candidate eligibility
- timeout events, timer races, same-tick ordering
- session ownership tracking with ID-based stale fencing
- enginev2: standalone V2 sender/session implementation (63 tests)
- per-replica Sender with identity-preserving reconciliation
- RecoverySession with FSM phase transitions and session ID
- execution APIs: BeginConnect, RecordHandshake, BeginCatchUp,
RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated
- recovery outcome branching: zero-gap, catch-up, needs-rebuild
- assignment-intent orchestration with epoch fencing
- design docs: acceptance criteria, open questions, first-slice spec,
protocol development process
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 days ago