seaweedfs

Commit Graph

Author	SHA1	Message	Date
pingqiu	cfec3bff4a	fix: update contract.go field source docs to match P1 implementation BlockVolState field mapping now matches actual StatusSnapshot(): - WALTailLSN ← super.WALCheckpointLSN (was: flusher.RetentionFloor) - CommittedLSN ← flusher.CheckpointLSN() V1 interim (was: distCommit) - CheckpointTrusted ← super.Validate()==nil (was: superblock.Valid) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	3 days ago
pingqiu	8c326c871c	feat: add contract interfaces and pin/release via release-func pattern (Phase 07 P1) E5 handoff contract (contract.go): - BlockVolReader: ReadState() → BlockVolState from real blockvol - BlockVolPinner: HoldWALRetention/HoldSnapshot/HoldFullBase → release func - BlockVolExecutor: StreamWALEntries/TransferSnapshot/TransferFullBase/TruncateWAL - Clear import direction: weed-side imports sw-block, not reverse StorageAdapter refactored: - Consumes BlockVolReader + BlockVolPinner interfaces - Pin/release uses release-func pattern (not map-based tracking) - PushStorageAdapter for tests (push-based, no blockvol dependency) 10 bridge tests: - 4 control adapter (identity, address change, role mapping, primary) - 4 storage adapter (retained history, WAL pin reject, snapshot reject, symmetry) - 1 E2E (assignment → adapter → engine → plan → execute → InSync) - 1 contract interface verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	3 days ago
pingqiu	05daede7f9	feat: add V2 bridge adapters for blockvol (Phase 07 P0) Creates sw-block/bridge/blockvol/ — concrete adapters connecting the V2 engine to real blockvol storage and control-plane state. control_adapter.go: - MakeReplicaID: volume-name/server-id (NOT address-derived) - ToAssignmentIntent: maps master assignment → engine intent - Role → SessionKind translation (pure mapping, no policy) storage_adapter.go: - BlockVolState: maps to real blockvol fields (WAL head/tail, committed, checkpoint) — NOT reconstructed from metadata - GetRetainedHistory from real state - PinSnapshot rejects untrusted checkpoint - PinWALRetention rejects recycled range - PinFullBase / ReleaseFullBase 8 bridge tests: - StableIdentity: ReplicaID = vol/server (not address) - AddressChangePreservesIdentity: same ID, different address - RebuildRoleMapping: "rebuilding" → SessionRebuild - PrimaryNoRecovery: no recovery targets for primary - RetainedHistoryFromRealState: all fields from BlockVolState - WALPinRejectsRecycled: tail validation - SnapshotPinRejectsInvalid: trust validation - E2E_AssignmentToRecovery: master assignment → adapter → engine intent → plan → execute → InSync Adapter replacement order: P0: control_adapter + storage_adapter (this delivery) P1: executor_bridge + observe_adapter (deferred) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	3 days ago
pingqiu	4df61f290b	fix: true mid-executor invalidation test via OnStep hook CatchUpExecutor.OnStep: optional callback fired between executor-managed progress steps. Enables deterministic fault injection (epoch bump) between steps without racing or manual sender calls. E2_EpochBump_MidExecutorLoop: - Executor runs 5 progress steps - OnStep hook bumps epoch after step 1 (after 2 successful steps) - Executor's own loop detects invalidation at step 2's check - Resources released by executor's release path (not manual cancel) - Log shows session_invalidated + exec_resources_released This closes the remaining FC2 gap: invalidation is now detected and cleaned up by the executor itself, not by external code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	3 days ago
pingqiu	5b63d34d6b	fix: snapshot+tail WAL pin failure cleanup + true mid-executor epoch test Finding 1: PlanRebuild snapshot+tail WAL pin failure now fail-closed - InvalidateSession("wal_pin_failed_during_rebuild", StateNeedsRebuild) - Snapshot pin released, session invalidated, no dangling state - New test: E2_RebuildWALPinFailure_SessionCleaned Finding 2: True mid-executor invalidation test - Executor makes 2 successful progress steps (60, 70) - Epoch bumps BETWEEN steps (real mid-execution) - Third progress step fails — session invalidated - Resources released via executor cancel - New test: E2_EpochBump_AfterExecutorProgress Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	332f598606	fix: close P3 failure classes — session cleanup, causal logging, CancelPlan Finding 1: PlanRebuild now invalidates session on pin failure - FullBasePin failure → InvalidateSession("full_base_pin_failed", StateNeedsRebuild) - SnapshotPin failure → InvalidateSession("snapshot_pin_failed", StateNeedsRebuild) - No dangling rebuild session after resource acquisition failure Finding 2: Rebuild source logging shows causal reason - plan_rebuild_full_base now logs: untrusted_checkpoint, trusted_checkpoint_unreplayable_tail, or no_checkpoint Finding 3: CancelPlan for address-change cleanup - New RecoveryDriver.CancelPlan(plan, reason): releases resources + invalidates session + logs plan_cancelled with reason - Changed-address test uses CancelPlan (not manual ReleasePlan) Finding 4: Executor-level epoch-bump test - Executor's mid-step invalidation detection catches stale session - Resources released via executor release path, not manual cancel Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	56afa55f13	feat: add P3 failure-class validation through planner/executor (Phase 06) 6 new tests (validation_test.go) mapped to tester expectations E1-E5: E1/FC1: Changed-address restart through planner/executor - Active session invalidated by address change - Sender identity preserved, old plan resources released - Log shows: endpoint_changed → new session → plan → execute E2/FC2: Epoch bump mid-execution step - Partial progress, epoch bumps between steps - Further progress rejected, executor cancels with resource release - Log shows: session_invalidated + exec_resources_released E3/FC5: Cross-layer proof — trusted base + unreplayable tail - Storage: checkpoint=50, tail=80 → unreplayable - RebuildSourceDecision → FullBase (not SnapshotTail) - FullBasePin acquired, executed through RebuildExecutor, released - Log shows: plan_rebuild_full_base (observable reason) E4/FC8: Rebuild fallback when trusted-base proof fails - Untrusted checkpoint → full-base, full-base pin fails → error - Untrusted checkpoint → full-base, full-base pin succeeds → InSync - Log shows: full_base_pin_failed E5: Observability — full recovery chain logged - Verifies 7 required log events from assignment through completion Delivery template: Changed contracts: P3 validates planner/executor path, not convenience Fail-closed: epoch bump mid-step releases resources + logs cause Resources: cross-layer proof chain validated end-to-end Carry-forward: FC3/FC4/FC6/FC7 sufficient from prior phases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	f5c0aab454	fix: rebuild executor consumes bound plan, fix catch-up timing Planner/executor contract: - RebuildExecutor.Execute() takes no arguments — consumes plan-bound RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN - RecoveryPlan binds all rebuild targets at plan time - Executor cannot re-derive policy from caller-supplied history Catch-up timing: - Removed unused completeTick parameter from CatchUpExecutor.Execute - Per-step ticks synthesized as startTick + stepIndex + 1 - API shape matches implementation New test: PlanExecuteConsistency_RebuildCannotSwitchSource - Plans snapshot+tail, then mutates storage history - Executor succeeds using plan-bound values (not re-derived) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	50442acb2e	feat: add stepwise executor with release symmetry (Phase 06 P2) New: executor.go — CatchUpExecutor + RebuildExecutor Replaces convenience wrappers with stepwise execution that owns resource lifecycle on every exit path. CatchUpExecutor.Execute: 1. BeginCatchUp (freezes target) 2. Stepwise RecordCatchUpProgress + CheckBudget per step 3. RecordTruncation (if required) 4. CompleteSessionByID 5. Release resources (success or failure) RebuildExecutor.Execute: 1. BeginConnect + RecordHandshake 2. SelectRebuildFromHistory 3. BeginRebuildTransfer + progress 4. BeginRebuildTailReplay + progress (snapshot+tail) 5. CompleteRebuild 6. Release resources (success or failure) Both executors: - Release all pins on every exit path (success, failure, cancellation) - Check session validity mid-execution (detect epoch bump / endpoint change) - Log resource release with causal reason 14 new tests (executor_test.go), mapped to tester expectations: - E1: Partial catch-up failure releases WAL pin (2 tests) - E2: Partial rebuild failure releases all pins (1 test) - E3: Epoch bump / cancel releases resources (3 tests) - E4: Successful execution releases resources (2 tests) - E5: Stepwise not convenience (2 tests) Delivery template: Changed contracts: executor owns resource lifecycle (not caller) Fail-closed: session check mid-execution, release on every error Resources: WAL/snapshot/full-base pins released on all exit paths Carry-forward: CompleteCatchUp/CompleteRebuild remain test-only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	45bf111ce8	fix: derive WAL pin from actual replay need, PlanRebuild fails closed WAL pin tied to actual recovery contract: - Truncation-only (replica ahead): no WAL pin acquired - Real catch-up: pins from replicaFlushedLSN (actual replay start) - Logs distinguish plan_truncate_only from plan_catchup PlanRebuild precondition checks: - Error on missing sender - Error on no active session - Error on non-rebuild session kind - All fail closed with clear error messages 4 new tests: - ReplicaAhead_NoWALPin: truncation-only, no WAL resources - PlanRebuild_MissingSender: returns error - PlanRebuild_NoSession: returns error - PlanRebuild_NonRebuildSession: returns error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	d4f7697dd8	fix: add full-base pin and clean up session on WAL pin failure Full-base rebuild resource: - StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image - PlanRebuild full_base branch now acquires FullBasePin - RecoveryPlan.FullBasePin field, released by ReleasePlan Session cleanup on resource failure: - PlanRecovery invalidates session when WAL pin fails (no dangling live session after failed resource acquisition) 3 new tests: - PlanRebuild_FullBase_PinsBaseImage: pin acquired + released - PlanRebuild_FullBase_PinFailure: logged + error - PlanRecovery_WALPinFailure_CleansUpSession: session invalidated, sender disconnected (no dangling state) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	f73a3fdab2	feat: add storage/control adapters and recovery driver (Phase 06 P0/P1) Phase 06 module boundaries: adapter.go — StorageAdapter + ControlPlaneAdapter interfaces: - GetRetainedHistory: real WAL retention state - PinSnapshot / ReleaseSnapshot: rebuild resource management - PinWALRetention / ReleaseWALRetention: catch-up resource management - HandleHeartbeat / HandleFailover: control-plane event conversion driver.go — RecoveryDriver replaces synchronous convenience: - PlanRecovery: connect + handshake from storage state + acquire resources - PlanRebuild: acquire snapshot + WAL pins for rebuild - ReleasePlan: release all acquired resources Convenience flow classification: - ProcessAssignment, UpdateSenderEpoch, InvalidateEpoch → stepwise engine tasks - ExecuteRecovery → planner (connect + classify) - CompleteCatchUp, CompleteRebuild → TEST-ONLY convenience 7 new tests (driver_test.go): - CatchUp plan + execute with WAL pin - ZeroGap plan (no resources pinned) - NeedsRebuild → rebuild plan with resource acquisition - WAL pin failure → logged + error - Snapshot pin failure → logged + error - ReplicaAhead truncation through driver - Cross-layer: storage proves recoverability, engine consumes proof Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	512bb5bcf6	fix: orchestrator owns full catch-up contract (budget + truncation) CompleteCatchUp now integrates: - BeginCatchUp with start tick (freezes target) - RecordCatchUpProgress (skips if already converged, e.g., truncation-only) - CheckBudget at completion tick (escalates to NeedsRebuild + logs) - RecordTruncation before completion (logs truncation_recorded) - Logs causal reason for every rejection/escalation CatchUpOptions: StartTick/CompleteTick (separate) + TruncateLSN. 3 new orchestrator-level tests: - ReplicaAhead_TruncateViaOrchestrator: truncation through entry path - ReplicaAhead_NoTruncate_CompletionRejected: logs completion_rejected - BudgetEscalation_ViaOrchestrator: budget violation → NeedsRebuild + logs Observability tests relabeled as sender-level (not entry-path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	adaff8ddb3	fix: only log endpoint_changed when endpoint actually changed ProcessAssignment now compares pre/post endpoint state before logging session_invalidated with "endpoint_changed" reason. Normal session supersede (same endpoint, assignment_intent) no longer mislabeled as endpoint change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	5cdee4a011	fix: orchestrator owns zero-gap completion and per-replica invalidation logging Zero-gap completion: - ExecuteRecovery auto-completes zero-gap sessions (no sender call needed) - RecoveryResult.FinalState = StateInSync for zero-gap Epoch transition: - UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log - InvalidateEpoch: per-replica session_invalidated events (not aggregate) Endpoint-change invalidation: - ProcessAssignment detects session ID change from endpoint update - Logs per-replica session_invalidated with "endpoint_changed" reason All integration tests now use orchestrator exclusively for core lifecycle. No direct sender API calls for recovery execution in integration tests. 1 new test: EndpointChange_LogsInvalidation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	47238df0d7	fix: add RecoveryOrchestrator as real integrated entry path New: orchestrator.go — RecoveryOrchestrator drives recovery lifecycle from assignment through execution to completion/escalation: - ProcessAssignment: reconcile + session creation + auto-log - ExecuteRecovery: connect → handshake from RetainedHistory → outcome - CompleteCatchUp: begin catch-up → progress → complete + auto-log - CompleteRebuild: connect → handshake → history-driven source → transfer → tail replay → complete + auto-log - InvalidateEpoch: invalidate stale sessions + auto-log All integration tests rewritten to use orchestrator as entry path. No direct sender API calls in recovery lifecycle. SessionSnapshot now includes: TruncateRequired/ToLSN/Recorded, RebuildSource, RebuildPhase. RecoveryLog is auto-populated by orchestrator at every transition. 7 integration tests via orchestrator: - ChangedAddress, NeedsRebuild→Rebuild, EpochBump, MultiReplica - Observability: session snapshot, rebuild snapshot, auto-populated log Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	7436b3b79c	feat: add integration closure and observability (Phase 05 Slice 4) New files: - observe.go: RegistryStatus, SenderStatus, RecoveryLog for debugging - integration_test.go: V2-boundary integration tests through real engine entry path Observability: - Registry.Status() returns full snapshot: per-sender state, session snapshots, counts by category (InSync, Recovering, Rebuilding) - RecoveryLog: append-only event log for recovery lifecycle debugging Integration tests (6): - ChangedAddress_FullFlow: initial recovery → address change → sender preserved → new session → recovery with proof - NeedsRebuild_ThenRebuildAssignment: catch-up fails → NeedsRebuild → rebuild assignment → history-driven source → InSync - EpochBump_DuringRecovery: mid-recovery epoch bump → old session rejected → new assignment at new epoch → InSync - MultiReplica_MixedOutcomes: 3 replicas, 3 outcomes via RetainedHistory proofs, registry status verified - RegistryStatus_Snapshot: observability snapshot structure - RecoveryLog: event recording and filtering Engine module at 54 tests (12 + 18 + 18 + 6). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	4d06622c01	fix: add nil check for RetainedHistory in sender APIs RecordHandshakeFromHistory and SelectRebuildFromHistory now return an error instead of panicking on nil history input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	cc8c529962	fix: connect recovery decisions to RetainedHistory, fix rebuild source RetainedHistory as engine input: - RecordHandshakeFromHistory: sender-level API consuming RetainedHistory directly, returns RecoverabilityProof alongside outcome - SelectRebuildFromHistory: sender-level API consuming RetainedHistory for rebuild-source decision RebuildSourceDecision soundness: - Now requires BOTH trusted checkpoint AND replayable tail (CheckpointLSN >= TailLSN and CommittedLSN <= HeadLSN) - Trusted checkpoint with unreplayable tail falls back to full_base 4 new tests: - TrustedCheckpoint_UnreplayableTail (the regression case) - SenderDriven_CatchUp (history → proof → outcome → complete) - SenderDriven_Rebuild_SnapshotTail (history → source → rebuild) - SenderDriven_Rebuild_FallsBackToFullBase (unreplayable tail) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	ff7ea41099	feat: add engine data/recoverability core (Phase 05 Slice 3) New file: history.go — RetainedHistory connects recovery decisions to actual WAL retention state: - IsRecoverable: checks gap against tail/head boundaries - MakeHandshakeResult: generates HandshakeResult from retention state - RebuildSourceDecision: chooses snapshot+tail vs full base from checkpoint state (trusted vs untrusted) - ProveRecoverability: generates explicit proof explaining why recovery is or is not allowed 14 new tests (recoverability_test.go): - Recoverable/unrecoverable gap (exact boundary, beyond head) - Trusted/untrusted/no checkpoint → rebuild source selection - Handshake from retained history → outcome classification - Recoverability proofs (zero-gap, ahead, within retention, beyond) - E2E: two replicas driven by retained history (catch-up + rebuild) - Truncation required for replica ahead of committed Engine module at 44 tests (12 + 18 + 14). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	368a956aee	fix: correct catch-up entry counting and rebuild transfer gate Entry counting: - Session.setRange now initializes recoveredTo = startLSN - RecordCatchUpProgress delta counts only actual catch-up work (recoveredTo - startLSN), not the replica's pre-existing prefix Rebuild transfer gate: - BeginTailReplay requires TransferredTo >= SnapshotLSN - Prevents tail replay on incomplete base transfer 3 new regression tests: - BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget) - BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget) - Rebuild_PartialTransfer_BlocksTailReplay Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	930de4ba78	feat: add Slice 2 recovery execution tests (Phase 05) 15 new engine-level recovery execution tests: - Zero-gap / catch-up / needs-rebuild branching (3 tests) - Stale execution rejection during active recovery (2 tests) - Bounded catch-up: frozen target, duration, entries, stall (5 tests) - Completion before convergence rejected - Rebuild exclusivity: catch-up APIs excluded (1 test) - Rebuild lifecycle: snapshot+tail, full base, stale ID (3 tests) - Assignment-driven recovery flow Engine module now at 27 tests (12 Slice 1 + 15 Slice 2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	61e9408261	fix: separate stable ReplicaID from Endpoint in registry Registry is now keyed by stable ReplicaID, not by address. DataAddr changes preserve sender identity — the core V2 invariant. Changes: - ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint - AssignmentIntent.Replicas uses []ReplicaAssignment - Registry.Reconcile takes []ReplicaAssignment - Tests use stable IDs ("replica-1", "r1") independent of addresses New test: ChangedDataAddr_PreservesSenderIdentity - Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2) - Sender pointer preserved, session invalidated, new session attached - This is the exact V1/V1.5 regression that V2 must fix doc.go: clarified Slice 1 core vs carried-forward files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	bb24b4b039	fix: encapsulate engine sender/session authority state All mutable state on Sender and Session is now unexported: - Sender.state, .epoch, .endpoint, .session, .stopped → accessors - Session.id, .phase, .kind, etc. → read-only accessors - Session() replaced by SessionSnapshot() (returns disconnected copy) - SessionID() and HasActiveSession() for common queries - AttachSession returns (sessionID, error) not (Session, error) - SupersedeSession returns sessionID not Session Budget configuration via SessionOption: - WithBudget(CatchUpBudget) passed to AttachSession - No direct field mutation on session from external code New test: Encapsulation_SnapshotIsReadOnly proves snapshot mutation does not leak back to sender state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	20d70f9fb6	feat: add V2 engine replication core (Phase 05 Slice 1) Creates sw-block/engine/replication/ — the real V2 engine ownership core, promoted from sw-block/prototype/enginev2/ with all accepted invariants. Files: - types.go: Endpoint, ReplicaState, SessionKind, SessionPhase, FSM transitions - sender.go: per-replica Sender with full execution + rebuild APIs - session.go: Session with identity, phases, frozen target, truncation, budget - registry.go: Registry with reconcile + assignment intent + epoch invalidation - budget.go: CatchUpBudget (duration, entries, stall detection) - rebuild.go: RebuildState FSM (snapshot+tail vs full base) - outcome.go: HandshakeResult + ClassifyRecoveryOutcome Tests (ownership_test.go, 13 tests): - Changed-address invalidation (A10) - Stale session ID rejected at all APIs (A3) - Stale completion after supersede (A3) - Epoch bump invalidates all sessions (A3) - Stale assignment epoch rejected - Rebuild exclusivity (catch-up APIs rejected) - Rebuild full lifecycle - Frozen target rejects chase (A5) - Budget violation escalates (A5) - E2E: 3 replicas, 3 outcomes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	4 days ago
pingqiu	26a1b33c2e	feat: add A5-A8 acceptance traceability and rebuild-source evidence Cleanup: removed redundant TargetLSNAtStart from CatchUpBudget. FrozenTargetLSN on RecoverySession is the single source of truth. Acceptance traceability (acceptance_test.go): - A5: 3 evidence tests (unrecoverable gap, budget escalation, frozen target) - A6: 2 evidence tests (exact boundary, contiguity required) - A7: 3 evidence tests (snapshot history, catch-up replay, truncation) - A8: 2 evidence tests (convergence required, truncation required) Rebuild-source decision evidence: - snapshot_tail when trusted base exists - full_base when no snapshot or untrusted - 3 explicit tests 13 new tests total. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	5 days ago
pingqiu	8f5070679c	fix: make frozen target intrinsic and rebuild completion exclusive Frozen target is now unconditional: - FrozenTargetLSN field on RecoverySession, set by BeginCatchUp - RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget - Catch-up is always a bounded (R, H0] contract Rebuild completion exclusivity: - CompleteSessionByID explicitly rejects SessionRebuild by kind - Rebuild sessions can ONLY complete via CompleteRebuild Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	5 days ago
pingqiu	8e4028758f	fix: make rebuild path exclusive, enforce phase discipline, require tick for stall budget Rebuild exclusivity: - BeginCatchUp rejects SessionRebuild ("must use rebuild APIs") - RecordCatchUpProgress rejects SessionRebuild - Rebuild sessions can only be completed via CompleteRebuild - All legacy rebuild-through-catch-up paths in tests converted Phase discipline: - SelectRebuildSource requires session.Phase == PhaseHandshake - Cannot skip BeginConnect + RecordHandshake Stall budget: - RecordCatchUpProgress requires tick parameter when ProgressDeadlineTicks > 0 (no silent stall budget bypass) 3 new tests: rebuild exclusivity (catch-up APIs rejected), rebuild source requires handshake phase, stall budget requires tick. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	5 days ago
pingqiu	5b66a85f92	fix: wire rebuild FSM into sender, enforce frozen target, fix entry counting Rebuild execution path: - newRecoverySession auto-initializes RebuildState for SessionRebuild - Sender rebuild APIs: SelectRebuildSource, BeginRebuildTransfer, RecordRebuildTransferProgress, BeginRebuildTailReplay, RecordRebuildTailProgress, CompleteRebuild - All rebuild APIs are sender-authority-gated by sessionID - E2E rebuild test now drives through rebuild FSM, not catch-up APIs Bounded CatchUp enforcement: - BeginCatchUp freezes TargetLSNAtStart from session.TargetLSN - RecordCatchUpProgress rejects progress beyond frozen target - Entry counting uses LSN delta (recoveredTo - previous), not call count - Merged RecordCatchUpProgressAt into RecordCatchUpProgress (tick param) 5 new tests: target-frozen enforcement, sender-level rebuild via rebuild APIs, reject non-rebuild, reject stale ID on rebuild. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	5 days ago
pingqiu	3f0048cbd9	feat: add bounded CatchUp budget and Rebuild mode state machine (Phase 4.5 P0) Bounded CatchUp: - CatchUpBudget: MaxDurationTicks, MaxEntries, ProgressDeadlineTicks - BudgetCheck: runtime consumption tracker (StartTick, EntriesReplayed, LastProgressTick) - Sender.CheckBudget: evaluates budget, escalates to NeedsRebuild on violation - RecordCatchUpProgressAt: tracks progress tick for stall detection - BeginCatchUp accepts optional startTick for budget tracking Rebuild state machine: - RebuildSource: snapshot_tail (preferred) vs full_base (fallback) - RebuildPhase: init → source_select → transfer → tail_replay → completed\|aborted - SelectSource: chooses based on snapshot availability - Phase ordering enforced, transfer regression rejected - ReadyToComplete validates target reached 13 new tests: budget enforcement (duration, entries, stall, no-budget), sender budget integration, rebuild lifecycle (snapshot+tail, full base, abort, phase order, regression), E2E bounded catch-up → rebuild. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	5 days ago
pingqiu	90c39b549d	feat: add prototype scenario closure (Phase 04 P4) Maps V2 acceptance criteria A1-A7, A10 to enginev2 prototype evidence. Adds 4 V2-boundary scenarios against the prototype. Scenario tests: - A1: committed data survives promotion (WAL truncation boundary) - A2: uncommitted data truncated, not revived - A3: stale epoch fenced at sender + session + assignment layers - A4: short-gap catch-up with WAL-backed proof + data verification - A5: unrecoverable gap escalates to NeedsRebuild with proof - A6: recoverability boundary exact (tail +/- 1 LSN) - A7: historical data correct after tail advancement (snapshot) - A10: changed-address → invalidation → new assignment → recovery V2-boundary scenarios: - NeedsRebuild persists across topology update - catch-up does not overwrite safe data - 5 disconnect/reconnect cycles preserve sender identity - full V2 harness: 3 replicas, 3 outcomes (zero-gap, catch-up, rebuild) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	5 days ago
pingqiu	942a0b7da7	fix: strengthen IsRecoverable contiguity check and StateAt snapshot correctness IsRecoverable now verifies three conditions: - startExclusive >= tailLSN (not recycled) - endInclusive <= headLSN (within WAL) - all LSNs in range exist contiguously (no holes) StateAt now uses base snapshot captured during AdvanceTail: - returns nil for LSNs before snapshot boundary (unreconstructable) - correctly includes block state from recycled entries via snapshot 5 new tests: end-beyond-head, missing entries, state after tail advance, nil before snapshot, block last written before tail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	5 days ago
pingqiu	c89709e47e	feat: add WAL history model and recoverability proof (Phase 04 P3) Adds minimal historical-data prototype to enginev2: - WALHistory: retained-prefix model with Append, Commit, AdvanceTail, Truncate, EntriesInRange, IsRecoverable, StateAt - MakeHandshakeResult connects WAL state to outcome classification - RecordTruncation execution API for divergent tail cleanup - CompleteSessionByID gates on truncation when required - Zero-gap requires exact equality (FlushedLSN == CommittedLSN) - Replica-ahead classified as CatchUp with mandatory truncation 15 new tests: WAL basics, provable recoverability, unprovable gap, exact boundary, truncation enforcement, WAL-backed end-to-end recovery with data verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	6 days ago
pingqiu	edec7098e8	feat: add V2 protocol simulator and enginev2 sender/session prototype Adds sw-block/ directory with: - distsim: protocol correctness simulator (96 tests) - cluster model with epoch fencing, barrier semantics, commit modes - endpoint identity, control-plane flow, candidate eligibility - timeout events, timer races, same-tick ordering - session ownership tracking with ID-based stale fencing - enginev2: standalone V2 sender/session implementation (63 tests) - per-replica Sender with identity-preserving reconciliation - RecoverySession with FSM phase transitions and session ID - execution APIs: BeginConnect, RecordHandshake, BeginCatchUp, RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated - recovery outcome branching: zero-gap, catch-up, needs-rebuild - assignment-intent orchestration with epoch fencing - design docs: acceptance criteria, open questions, first-slice spec, protocol development process Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	6 days ago

34 Commits (cd8bfb21d4a5d039b91dcd1cce71ff67ba5b2e4f)