Browse Source

feat: add V2 protocol simulator and enginev2 sender/session prototype

Adds sw-block/ directory with:

- distsim: protocol correctness simulator (96 tests)
  - cluster model with epoch fencing, barrier semantics, commit modes
  - endpoint identity, control-plane flow, candidate eligibility
  - timeout events, timer races, same-tick ordering
  - session ownership tracking with ID-based stale fencing

- enginev2: standalone V2 sender/session implementation (63 tests)
  - per-replica Sender with identity-preserving reconciliation
  - RecoverySession with FSM phase transitions and session ID
  - execution APIs: BeginConnect, RecordHandshake, BeginCatchUp,
    RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated
  - recovery outcome branching: zero-gap, catch-up, needs-rebuild
  - assignment-intent orchestration with epoch fencing

- design docs: acceptance criteria, open questions, first-slice spec,
  protocol development process

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feature/sw-block
pingqiu 2 days ago
parent
commit
edec7098e8
  1. 4
      sw-block/.gocache_v2/README
  2. 1
      sw-block/.gocache_v2/trim.txt
  3. 27
      sw-block/.private/README.md
  4. 36
      sw-block/.private/phase/README.md
  5. 97
      sw-block/.private/phase/phase-01-decisions.md
  6. 67
      sw-block/.private/phase/phase-01-log.md
  7. 11
      sw-block/.private/phase/phase-01-v2-scenarios.md
  8. 164
      sw-block/.private/phase/phase-01.md
  9. 51
      sw-block/.private/phase/phase-02-decisions.md
  10. 93
      sw-block/.private/phase/phase-02-log.md
  11. 191
      sw-block/.private/phase/phase-02.md
  12. 97
      sw-block/.private/phase/phase-03-decisions.md
  13. 36
      sw-block/.private/phase/phase-03-log.md
  14. 193
      sw-block/.private/phase/phase-03.md
  15. 97
      sw-block/.private/phase/phase-04-decisions.md
  16. 46
      sw-block/.private/phase/phase-04-log.md
  17. 153
      sw-block/.private/phase/phase-04.md
  18. 49
      sw-block/.private/phase/phase-04a-decisions.md
  19. 22
      sw-block/.private/phase/phase-04a-log.md
  20. 113
      sw-block/.private/phase/phase-04a.md
  21. 18
      sw-block/README.md
  22. 26
      sw-block/design/README.md
  23. 288
      sw-block/design/protocol-development-process.md
  24. 252
      sw-block/design/protocol-version-simulation.md
  25. 314
      sw-block/design/v1-v15-v2-comparison.md
  26. 281
      sw-block/design/v1-v15-v2-simulator-goals.md
  27. 280
      sw-block/design/v2-acceptance-criteria.md
  28. 234
      sw-block/design/v2-dist-fsm.md
  29. 159
      sw-block/design/v2-first-slice-sender-ownership.md
  30. 193
      sw-block/design/v2-first-slice-session-ownership.md
  31. 161
      sw-block/design/v2-open-questions.md
  32. 239
      sw-block/design/v2-prototype-roadmap-and-gates.md
  33. 249
      sw-block/design/v2-scenario-sources-from-v1.md
  34. 638
      sw-block/design/v2_scenarios.md
  35. 359
      sw-block/design/wal-replication-v2-orchestrator.md
  36. 632
      sw-block/design/wal-replication-v2-state-machine.md
  37. 401
      sw-block/design/wal-replication-v2.md
  38. 349
      sw-block/design/wal-v1-to-v2-mapping.md
  39. 277
      sw-block/design/wal-v2-tiny-prototype.md
  40. 14
      sw-block/private/README.md
  41. 23
      sw-block/prototype/README.md
  42. 1120
      sw-block/prototype/distsim/cluster.go
  43. 1004
      sw-block/prototype/distsim/cluster_test.go
  44. BIN
      sw-block/prototype/distsim/distsim.test.exe
  45. 266
      sw-block/prototype/distsim/eventsim.go
  46. 213
      sw-block/prototype/distsim/phase02_advanced_test.go
  47. 445
      sw-block/prototype/distsim/phase02_candidate_test.go
  48. 371
      sw-block/prototype/distsim/phase02_network_test.go
  49. 359
      sw-block/prototype/distsim/phase02_test.go
  50. 434
      sw-block/prototype/distsim/phase02_v1_failures_test.go
  51. 287
      sw-block/prototype/distsim/phase03_p2_race_test.go
  52. 281
      sw-block/prototype/distsim/phase03_race_test.go
  53. 333
      sw-block/prototype/distsim/phase03_timeout_test.go
  54. 243
      sw-block/prototype/distsim/phase04a_ownership_test.go
  55. 102
      sw-block/prototype/distsim/protocol.go
  56. 84
      sw-block/prototype/distsim/protocol_test.go
  57. 256
      sw-block/prototype/distsim/random.go
  58. 43
      sw-block/prototype/distsim/random_test.go
  59. 95
      sw-block/prototype/distsim/reference.go
  60. 66
      sw-block/prototype/distsim/reference_test.go
  61. 581
      sw-block/prototype/distsim/simulator.go
  62. 285
      sw-block/prototype/distsim/simulator_test.go
  63. 129
      sw-block/prototype/distsim/storage.go
  64. 64
      sw-block/prototype/enginev2/assignment.go
  65. 420
      sw-block/prototype/enginev2/execution_test.go
  66. 3
      sw-block/prototype/enginev2/go.mod
  67. 39
      sw-block/prototype/enginev2/outcome.go
  68. 482
      sw-block/prototype/enginev2/p2_test.go
  69. 347
      sw-block/prototype/enginev2/sender.go
  70. 119
      sw-block/prototype/enginev2/sender_group.go
  71. 203
      sw-block/prototype/enginev2/sender_group_test.go
  72. 407
      sw-block/prototype/enginev2/sender_test.go
  73. 151
      sw-block/prototype/enginev2/session.go
  74. 162
      sw-block/prototype/fsmv2/apply.go
  75. 37
      sw-block/prototype/fsmv2/events.go
  76. 73
      sw-block/prototype/fsmv2/fsm.go
  77. 95
      sw-block/prototype/fsmv2/fsm_test.go
  78. BIN
      sw-block/prototype/fsmv2/fsmv2.test.exe
  79. 37
      sw-block/prototype/run-tests.ps1
  80. 148
      sw-block/prototype/volumefsm/events.go
  81. 38
      sw-block/prototype/volumefsm/format.go
  82. 142
      sw-block/prototype/volumefsm/model.go
  83. 421
      sw-block/prototype/volumefsm/model_test.go
  84. 70
      sw-block/prototype/volumefsm/recovery.go
  85. 61
      sw-block/prototype/volumefsm/scenario.go
  86. BIN
      sw-block/prototype/volumefsm/volumefsm.test.exe
  87. 17
      sw-block/test/README.md
  88. 1675
      sw-block/test/test_db.md
  89. 105
      sw-block/test/test_db_v2.md
  90. 115
      sw-block/test/v2_selected.md

4
sw-block/.gocache_v2/README

@ -0,0 +1,4 @@
This directory holds cached build artifacts from the Go build system.
Run "go clean -cache" if the directory is getting too large.
Run "go clean -fuzzcache" to delete the fuzz cache.
See go.dev to learn more about Go.

1
sw-block/.gocache_v2/trim.txt

@ -0,0 +1 @@
1774577367

27
sw-block/.private/README.md

@ -0,0 +1,27 @@
# .private
Private working area for `sw-block`.
Use this for:
- phase development notes
- roadmap/progress tracking
- draft handoff notes
- temporary design comparisons
- prototype scratch work not ready for `design/` or `prototype/`
Recommended layout:
- `.private/phase/`: phase-by-phase development notes
- `.private/roadmap/`: short-term and medium-term execution notes
- `.private/handoff/`: notes for `sw`, `qa`, or future sessions
Phase protocol:
- each phase should normally have:
- `phase-xx.md`
- `phase-xx-log.md`
- `phase-xx-decisions.md`
- details are defined in `.private/phase/README.md`
Promotion rules:
- stable vision/design docs go to `../design/`
- real prototype code stays in `../prototype/`
- `.private/` is for working material, not source of truth

36
sw-block/.private/phase/README.md

@ -0,0 +1,36 @@
# Phase Dev
Use this directory for private phase development notes.
## Phase Protocol
Each phase should use this file set:
- `phase-01.md`
- plan
- scope
- progress
- active tasks
- exit criteria
- `phase-01-log.md`
- dated development log
- experiments
- test runs
- failures and findings
- `phase-01-decisions.md`
- key algorithm decisions
- tradeoffs
- rejected alternatives
Suggested naming pattern:
- `phase-01.md`
- `phase-01-log.md`
- `phase-01-decisions.md`
- `phase-02.md`
- `phase-02-log.md`
- `phase-02-decisions.md`
Rule of use:
1. if it is what we are doing -> `phase-xx.md`
2. if it is what happened -> `phase-xx-log.md`
3. if it is why we chose something -> `phase-xx-decisions.md`

97
sw-block/.private/phase/phase-01-decisions.md

@ -0,0 +1,97 @@
# Phase 01 Decisions
Date: 2026-03-26
Status: active
## Purpose
Capture the key design decisions made during Phase 01 simulator work.
## Initial Decisions
### 1. `design/` vs `.private/phase/`
Decision:
- `sw-block/design/` holds shared design truth
- `sw-block/.private/phase/` holds execution planning and progress
Reason:
- design backlog and execution checklist should not be mixed
### 2. Scenario source of truth
Decision:
- `sw-block/design/v2_scenarios.md` is the scenario backlog and coverage matrix
Reason:
- all contributors need one visible scenario list
### 3. Phase 01 priority
Decision:
- first close:
- `S19`
- `S20`
Reason:
- they are the biggest remaining distributed lineage/partition scenarios
### 4. Current simulator scope
Decision:
- use the simulator as a V2 design-validation tool, not a product/perf harness
Reason:
- current goal is correctness and protocol coverage, not productization
### 5. Phase execution format
Decision:
- keep phase execution in three files:
- `phase-xx.md`
- `phase-xx-log.md`
- `phase-xx-decisions.md`
Reason:
- separates plan, evidence, and reasoning
- reduces drift between roadmap and findings
### 6. Design backlog vs execution plan
Decision:
- `sw-block/design/v2_scenarios.md` remains the source of truth for scenario backlog and coverage
- `.private/phase/phase-01.md` is the execution layer for `sw`
Reason:
- design truth should be stable and shareable
- execution tasks should be easier to edit without polluting design docs
### 7. Immediate Phase 01 priorities
Decision:
- prioritize:
- `S19` chain of custody across multiple promotions
- `S20` live partition with competing writes
Reason:
- these are the biggest remaining distributed-lineage gaps after current simulator milestone
### 8. Coverage status should be conservative
Decision:
- mark scenarios as `partial` unless the test actually exercises the core protocol obligation, not just a simplified happy path
Reason:
- avoids overstating simulator coverage
- keeps the backlog honest for follow-up strengthening
### 9. Protocol-version comparison belongs in the simulator
Decision:
- compare `V1`, `V1.5`, and `V2` using the same scenario set where possible
Reason:
- this is the clearest way to show:
- where V1 breaks
- where V1.5 improves but still strains
- why V2 is architecturally cleaner

67
sw-block/.private/phase/phase-01-log.md

@ -0,0 +1,67 @@
# Phase 01 Log
Date: 2026-03-26
Status: active
## Log Protocol
Use dated entries like:
## 2026-03-26
- work completed
- tests run
- failures found
- seeds/traces worth keeping
- follow-up items
## Initial State
- Phase 01 created from the earlier `phase-01-v2-scenarios.md` working note
- scenario source of truth remains:
- `sw-block/design/v2_scenarios.md`
- current active asks for `sw`:
- `S19`
- `S20`
## 2026-03-26
- created Phase 01 file set:
- `phase-01.md`
- `phase-01-log.md`
- `phase-01-decisions.md`
- promoted scenario execution checklist into `phase-01.md`
- kept `sw-block/design/v2_scenarios.md` as the shared backlog and coverage matrix
- current simulator milestone:
- `fsmv2` passing
- `volumefsm` passing
- `distsim` passing
- randomized `distsim` seeds passing
- event/interleaving simulator work present in `sw-block/prototype/distsim/simulator.go`
- current immediate development priority for `sw`:
- implement `S19`
- implement `S20`
- `sw` added Phase 01 P0/P1 scenario tests in `distsim`:
- `S19`
- `S20`
- `S5`
- `S6`
- `S18`
- stronger `S12`
- review result:
- `S19` looks solid
- stronger `S12` now looks solid
- `S20`, `S5`, `S6`, `S18` are better classified as `partial` than fully closed
- updated `v2_scenarios.md` coverage matrix to reflect actual status
- next development focus:
- P2 scenarios
- stronger versions of current partial scenarios
- added protocol-version comparison design:
- `sw-block/design/protocol-version-simulation.md`
- added minimal protocol policy prototype in `distsim`:
- `ProtocolV1`
- `ProtocolV15`
- `ProtocolV2`
- focused on:
- catch-up policy
- tail-chasing outcome policy
- restart/rejoin policy

11
sw-block/.private/phase/phase-01-v2-scenarios.md

@ -0,0 +1,11 @@
# Deprecated
This file is deprecated.
Use instead:
- `phase-01.md`
- `phase-01-log.md`
- `phase-01-decisions.md`
The scenario source of truth remains:
- `sw-block/design/v2_scenarios.md`

164
sw-block/.private/phase/phase-01.md

@ -0,0 +1,164 @@
# Phase 01
Date: 2026-03-26
Status: completed
Purpose: drive V2 simulator development by closing the scenario backlog in `sw-block/design/v2_scenarios.md`
## Goal
Make the V2 simulator cover the important protocol scenarios as explicitly as possible.
This phase is about:
- simulator fidelity
- scenario coverage
- invariant quality
This phase is not about:
- product integration
- SPDK
- raw allocator
- production transport
## Source Of Truth
Design/source-of-truth:
- `sw-block/design/v2_scenarios.md`
Prototype code:
- `sw-block/prototype/fsmv2/`
- `sw-block/prototype/volumefsm/`
- `sw-block/prototype/distsim/`
## Assigned Tasks For `sw`
### P0
1. `S19` chain of custody across multiple promotions
- add fixed test(s)
- verify committed data from `A -> B -> C`
- update coverage matrix
2. `S20` live partition with competing writes
- add fixed test(s)
- stale side must not advance committed lineage
- update coverage matrix
### P1
3. `S5` flapping replica stays recoverable
- repeated disconnect/reconnect
- no unnecessary rebuild while recovery remains possible
4. `S6` tail-chasing under load
- primary keeps writing while replica catches up
- explicit outcome:
- converge and promote
- or abort to rebuild
5. `S18` primary restart without failover
- same-lineage restart behavior
- no stale session assumptions
6. stronger `S12`
- more than one promotion candidate
- choose valid lineage, not merely highest apparent LSN
### P2
7. protocol-version comparison support
- model:
- `V1`
- `V1.5`
- `V2`
- use the same scenario set to show:
- V1 breaks
- V1.5 improves but still strains
- V2 handles recovery more explicitly
8. richer Smart WAL scenarios
- time-varying `ExtentReferenced` availability
- recoverable then unrecoverable transitions
9. delayed/drop network scenarios beyond simple disconnect
10. multi-node reservation expiry / rebuild timeout cases
## Invariants To Preserve
After every scenario or random run, preserve:
1. committed data is durable per policy
2. uncommitted data is not revived as committed
3. stale epoch traffic does not mutate current lineage
4. recovered/promoted node matches reference state at target `LSN`
5. committed prefix remains contiguous
## Required Updates Per Task
For each completed scenario:
1. add or update test(s)
2. update `sw-block/design/v2_scenarios.md`
- package
- test name
- status
3. note any missing simulator capability
## Current Progress
Already in place before this phase:
- `fsmv2` local FSM prototype
- `volumefsm` orchestrator prototype
- `distsim` distributed simulator
- randomized `distsim` runs
- first event/interleaving simulator work in `distsim/simulator.go`
Open focus:
- `S19` covered in `distsim`
- `S20` partially covered in `distsim`
- `S5` partially covered in `distsim`
- `S6` partially covered in `distsim`
- `S18` partially covered in `distsim`
- stronger `S12` covered in `distsim`
- protocol-version comparison design added in:
- `sw-block/design/protocol-version-simulation.md`
- remaining focus is now P2 plus stronger versions of partial scenarios
## Phase Status
### P0
- `S19` chain of custody across multiple promotions: done
- `S20` live partition with competing writes: partial
### P1
- `S5` flapping replica stays recoverable: partial
- `S6` tail-chasing under load: partial
- `S18` primary restart without failover: partial
- stronger `S12`: done
### P2
- active next step:
- protocol-version comparison support
- stronger versions of current partial scenarios
## Exit Criteria
Phase 01 is done when:
1. `S19` and `S20` are covered
2. `S5`, `S6`, `S18`, and stronger `S12` are at least partially covered
3. coverage matrix in `v2_scenarios.md` is current
4. random simulation still passes after added scenarios
## Completion Note
Phase 01 completed with:
- `S19` covered
- stronger `S12` covered
- `S20`, `S5`, `S6`, `S18` strengthened but correctly left as `partial`
Next execution phase:
- `sw-block/.private/phase/phase-02.md`

51
sw-block/.private/phase/phase-02-decisions.md

@ -0,0 +1,51 @@
# Phase 02 Decisions
Date: 2026-03-26
Status: active
## Decision 1: Extend `distsim` Instead Of Forking A New Protocol Simulator
Reason:
- current `distsim` already has:
- node/storage model
- coordinator/epoch model
- reference oracle
- randomized runs
- the missing layer is protocol-state fidelity, not a new simulation foundation
Implication:
- add lightweight per-node replication state and protocol decisions to `distsim`
- do not build a separate fourth simulator yet
## Decision 2: Keep Coverage Status Conservative
Reason:
- `S20`, `S6`, and `S18` currently prove important safety properties
- but they do not yet fully assert message-level or explicit state-transition behavior
Implication:
- leave them `partial` until the model can assert protocol behavior directly
## Decision 3: Use Versioned Scenario Comparison To Justify V2
Reason:
- the simulator should not only say "V2 works"
- it should show:
- where `V1` fails
- where `V1.5` improves but still strains
- why `V2` is worth the complexity
Implication:
- Phase 02 includes explicit `V1` / `V1.5` / `V2` scenario comparison work
## Decision 4: V2 Must Not Be Described As "Always Catch-Up"
Reason:
- that wording is too optimistic and hides the real V2 design rule
- V2 is better because it makes recoverability explicit, not because it retries forever
Implication:
- describe V2 as:
- catch-up if explicitly recoverable
- otherwise explicit rebuild
- keep this wording consistent in tests and docs

93
sw-block/.private/phase/phase-02-log.md

@ -0,0 +1,93 @@
# Phase 02 Log
Date: 2026-03-26
Status: active
## 2026-03-26
- Phase 02 created to move `distsim` from final-state safety validation toward explicit protocol-state simulation.
- Initial focus:
- close `S20`, `S6`, and `S18` at protocol level
- compare `V1`, `V1.5`, and `V2` on the same scenarios
- Known model gap at phase start:
- current `distsim` is strong at final-state safety invariants
- current `distsim` is weaker at mid-flow protocol assertions and message-level rejection reasons
- Phase 02 progress now in place:
- delivery accept/reject tracking
- protocol-level stale-epoch rejection assertions
- explicit non-convergent catch-up state transition assertions
- initial version-comparison tests for disconnect, tail-chasing, and restart/rejoin policy
- Next simulator target:
- reproduce real `V1.5` address-instability and control-plane-recovery failures as named scenarios
- Immediate coding asks for `sw`:
- changed-address restart failure in `V1.5`
- same-address transient outage comparison across `V1` / `V1.5` / `V2`
- slow control-plane reassignment scenario derived from `CP13-8 T4b`
- Local housekeeping done:
- corrected V2 wording from "always catch-up" to "catch-up if explicitly recoverable; otherwise rebuild"
- added explicit brief-disconnect and changed-address restart policy helpers
- verified `distsim` test suite still passes with the Windows-safe runner
- Scenario status update:
- `S20` now covered via protocol-level stale-traffic rejection + committed-prefix stability
- `S6` now covered via explicit `CatchingUp -> NeedsRebuild` assertions
- `S18` now covered via explicit stale `MsgBarrierAck` rejection + prefix stability
- Next asks for `sw` after this closure:
- changed-address restart scenario tied directly to `CP13-8 T4b`
- same-address transient outage comparison across `V1` / `V1.5` / `V2`
- slow control-plane reassignment scenario
- Smart WAL recoverable -> unrecoverable transition scenarios
- Additional closure completed:
- `S5` now covered with both:
- repeated recoverable flapping
- budget-exceeded escalation to `NeedsRebuild`
- Smart WAL transitions now exercised with:
- recoverable -> unrecoverable during active recovery
- mixed `WALInline` + `ExtentReferenced` success
- time-varying payload availability
- Updated next asks for `sw`:
- changed-address restart scenario tied directly to `CP13-8 T4b`
- same-address transient outage comparison across `V1` / `V1.5` / `V2`
- slow control-plane reassignment scenario
- delayed/drop network beyond simple disconnect
- multi-node reservation expiry / rebuild timeout cases
- Additional Phase 02 coverage delivered:
- delayed stale messages after promote/failover
- delayed stale barrier ack rejection
- selective write-drop with barrier delivery under `sync_all`
- multi-node mixed reservation expiry outcome
- multi-node `NeedsRebuild` / snapshot rebuild recovery
- partial rebuild timeout / retry completion
- Remaining asks are now narrower:
- changed-address restart scenario tied directly to `CP13-8 T4b`
- same-address transient outage comparison across `V1` / `V1.5` / `V2`
- slow control-plane reassignment scenario
- stronger coordinator candidate-selection scenarios
- Additional closure after review:
- safe default promotion selector now refuses `NeedsRebuild` candidates
- explicit desperate-promotion API separated from safe selection
- changed-address and slow-control-plane comparison tests now prove actual data divergence / healing, not only policy shape
- New next-step assignment:
- strengthen model depth around endpoint identity and control-plane reassignment
- replace abstract repair helpers with more explicit event flow where practical
- reduce direct recovery state injection in comparison tests
- extend candidate selection from ranking into validity rules
## 2026-03-27
- Phase 02 core simulator hardening is effectively complete.
- Delivered since the previous checkpoint:
- endpoint identity / endpoint-version modeling
- stale-endpoint rejection in delivery path
- heartbeat -> coordinator detect -> assignment-update control-plane flow
- recovery-session trigger API for `V1.5` and `V2`
- explicit candidate eligibility checks:
- running
- epoch alignment
- state eligibility
- committed-prefix sufficiency
- safe default promotion now rejects candidates without the committed prefix
- Current `distsim` status at latest review:
- 73 tests passing
- Manager bookkeeping decision:
- keep Phase 02 active only for doc maintenance / wrap-up
- treat further simulator depth as likely Phase 03 work, not unbounded Phase 02 scope creep

191
sw-block/.private/phase/phase-02.md

@ -0,0 +1,191 @@
# Phase 02
Date: 2026-03-27
Status: active
Purpose: extend the V2 simulator from final-state safety checking into protocol-state simulation that can reproduce `V1`, `V1.5`, and `V2` behavior on the same scenarios
## Goal
Make the simulator model enough node-local replication state and message-level behavior to:
1. reproduce `V1` / `V1.5` failure modes
2. show why those failures are structural
3. close the current `partial` V2 scenarios with stronger protocol assertions
This phase is about:
- protocol-version comparison
- per-node replication state
- message-level fencing / accept / reject behavior
- explicit catch-up abort / rebuild transitions
This phase is not about:
- product integration
- production transport
- SPDK
- raw allocator
## Source Of Truth
Design/source-of-truth:
- `sw-block/design/v2_scenarios.md`
- `sw-block/design/protocol-version-simulation.md`
- `sw-block/design/v1-v15-v2-simulator-goals.md`
Prototype code:
- `sw-block/prototype/distsim/`
## Assigned Tasks For `sw`
### P0
1. Add per-node replication state to `distsim`
- minimum states:
- `InSync`
- `Lagging`
- `CatchingUp`
- `NeedsRebuild`
- `Rebuilding`
- keep state lightweight; do not clone full `fsmv2` into `distsim`
2. Add message-level protocol decisions
- stale-epoch write / ship / barrier traffic must be explicitly rejected
- record whether a message was:
- accepted
- rejected by epoch
- rejected by state
3. Add explicit catch-up abort / rebuild entry
- non-convergent catch-up must move to explicit modeled failure:
- `NeedsRebuild`
- or equivalent abort outcome
### P1
4. Re-close `S20` at protocol level
- stale-side writes must go through protocol delivery path
- prove stale-side traffic cannot advance committed lineage
5. Re-close `S6` at protocol level
- assert explicit abort/escalation on non-convergence
- not only final-state safety
6. Re-close `S18` at protocol level
- assert committed-prefix behavior around delayed old ack / restart races
- not only final-state oracle checks
### P2
7. Expand protocol-version comparison
- run selected scenarios under:
- `V1`
- `V1.5`
- `V2`
- at minimum:
- brief disconnect
- restart with changed address
- tail-chasing
8. Add V1.5-derived failure scenarios
- replica restart with changed receiver address
- same-address transient outage
- slow control-plane recovery vs fast local reconnect
9. Prepare richer recovery modeling
- time-varying recoverability
- reservation loss during active catch-up
- rebuild timeout / retry in mixed-state cluster
## Invariants To Preserve
After every scenario or random run, preserve:
1. committed data is durable per policy
2. uncommitted data is not revived as committed
3. stale epoch traffic does not mutate current lineage
4. recovered/promoted node matches reference state at target `LSN`
5. committed prefix remains contiguous
6. protocol-state transitions are explicit, not inferred from final data only
## Required Updates Per Task
For each completed task:
1. add or update test(s)
2. update `sw-block/design/v2_scenarios.md`
- package
- test name
- status
- source if new scenario was derived from V1/V1.5 behavior
3. add a short note to:
- `sw-block/.private/phase/phase-02-log.md`
4. if a design choice changed, record it in:
- `sw-block/.private/phase/phase-02-decisions.md`
## Current Progress
Already in place before this phase:
- `distsim` final-state safety invariants
- randomized simulation
- event/interleaving simulator work
- initial `ProtocolVersion` / policy scaffold
- `S19` covered
- stronger `S12` covered
Known partials to close in this phase:
- none in the current named backlog slice
Delivered in this phase so far:
- delivery accept/reject tracking added
- protocol-level rejection assertions added
- explicit `CatchingUp -> NeedsRebuild` state transition tested
- selected protocol-version comparison tests added
- `S20`, `S6`, and `S18` moved from `partial` to `covered`
- Smart WAL transition scenarios added
- `S5` moved from `partial` to `covered`
- endpoint identity / endpoint-version modeling added
- explicit heartbeat -> detect -> assignment-update control-plane flow added for changed-address restart
- explicit recovery-session triggers added for `V1.5` and `V2`
- promotion selection now uses explicit eligibility, including committed-prefix gating
- safe and desperate promotion paths are separated
- full `distsim` suite at latest review: 73 tests passing
Remaining focus for `sw`:
- Phase 02 core scope is now largely delivered
- remaining work should be treated as future-strengthening, not baseline closure
- if more simulator depth is needed next, it should likely start as Phase 03:
- timeout semantics
- timer races
- richer event/interleaving behavior
- stronger endpoint/control-plane realism beyond the current abstract model
## Immediate Next Tasks For `sw`
1. Add a documented compare artifact for new scenarios
- for each new `V1` / `V1.5` / `V2` comparison:
- record scenario name
- what fails in `V1`
- what improves in `V1.5`
- what is explicit in `V2`
- keep `sw-block/design/v1-v15-v2-comparison.md` updated
2. Keep the coverage matrix honest
- do not mark a scenario `covered` unless the test asserts protocol behavior directly
- final-state oracle checks alone are not enough
3. Prepare Phase 03 proposal instead of broadening ad hoc
- if more depth is needed, define it cleanly first:
- timers / timeout events
- event ordering races
- richer endpoint lifecycle
- recovery-session uniqueness across competing triggers
## Exit Criteria
Phase 02 is done when:
1. `S5`, `S6`, `S18`, and `S20` are covered at protocol level
2. `distsim` can reproduce at least one `V1` failure, one `V1.5` failure, and the corresponding `V2` behavior on the same named scenario
3. protocol-level rejection/accept behavior is asserted in tests, not only inferred from final-state oracle checks
4. coverage matrix in `v2_scenarios.md` is current
5. changed-address and reconnect scenarios are modeled through explicit endpoint / control-plane behavior rather than helper-only abstraction
6. promotion selection uses explicit eligibility, including committed-prefix safety

97
sw-block/.private/phase/phase-03-decisions.md

@ -0,0 +1,97 @@
# Phase 03 Decisions
Date: 2026-03-27
Status: initial
## Why Phase 03 Exists
Phase 02 already covered the main protocol-state story:
- V1 / V1.5 / V2 comparison
- stale traffic rejection
- catch-up vs rebuild
- changed-address restart control-plane flow
- committed-prefix-safe promotion eligibility
The next simulator problems are different:
- timer semantics
- timeout races
- event ordering under contention
That deserves a separate phase so the model boundary stays clear.
## Initial Boundary
### `distsim`
Keep for:
- protocol correctness
- reference-state validation
- recoverability logic
- promotion / lineage rules
### `eventsim`
Grow for:
- explicit event queue behavior
- timeout events
- equal-time scheduling choices
- race exploration
## Working Rule
Do not move all scenarios into `eventsim`.
Only move or duplicate scenarios when:
- timer or event ordering is the real bug surface
- `distsim` abstraction hides the important behavior
## Accepted Phase 03 Decisions
### Same-tick rule
Within one tick:
- data/message delivery is evaluated before timeout firing
Meaning:
- if an ack arrives in the same tick as a timeout deadline, the ack wins and may cancel the timeout
This is now an explicit simulator rule, not accidental behavior.
### Timeout authority
Not every timeout that reaches its deadline still has authority to mutate state.
So we now distinguish:
- `FiredTimeouts`
- timeout had authority and changed the model
- `IgnoredTimeouts`
- timeout reached deadline but was stale and ignored
This keeps replay/debug output honest.
### Late barrier ack rule
Once a barrier instance times out:
- it is marked expired
- late ack for that barrier instance is rejected
That prevents a stale ack from reviving old durability state.
### Review gate rule for timer work
Timer/race work is easy to get subtly wrong while still having green tests.
So timer-related work is not accepted until:
- code path is reviewed
- tests assert the real protocol obligation
- stale and authoritative timer behavior are clearly distinguished

36
sw-block/.private/phase/phase-03-log.md

@ -0,0 +1,36 @@
# Phase 03 Log
Date: 2026-03-27
Status: active
## 2026-03-27
- Phase 03 created after Phase 02 core scope was effectively delivered.
- Reason for new phase:
- remaining simulator work is about timer semantics and race behavior, not basic protocol-state coverage
- Initial target:
- define `distsim` vs `eventsim` split more clearly
- add explicit timeout semantics
- add timer-race scenarios without bloating `distsim` ad hoc
- P0 delivered:
- timeout model added for barrier / catch-up / reservation
- timeout-backed scenarios added
- same-tick ordering rule defined as data-before-timers
- First review result:
- timeout semantics accepted only after making cancellation model-driven
- late barrier ack after timeout required explicit rejection
- P0 hardening delivered:
- recovery timeout cancellation moved into model logic
- stale late barrier ack rejected via expired-barrier tracking
- stale vs authoritative timeout distinction added:
- `FiredTimeouts`
- `IgnoredTimeouts`
- P1 delivered and reviewed:
- promotion vs stale timeout race
- rebuild completion vs epoch bump race
- trace builder moved into reusable code
- Current suite state at latest accepted review:
- 86 `distsim` tests passing
- Manager decision:
- Phase 03 P0/P1 are accepted
- next work should move to deliberate P2 selection rather than broadening the phase ad hoc

193
sw-block/.private/phase/phase-03.md

@ -0,0 +1,193 @@
# Phase 03
Date: 2026-03-27
Status: active
Purpose: define the next simulator tier after Phase 02, focused on timeout semantics, timer races, and a cleaner split between protocol simulation and event/interleaving simulation
## Goal
Phase 03 exists to cover behavior that current `distsim` still abstracts away:
1. timeout semantics
2. timer races
3. event ordering under competing triggers
4. clearer separation between:
- protocol / lineage simulation
- event / race simulation
This phase should not reopen already-closed Phase 02 protocol scope unless a clear bug is found.
## Why A New Phase
Phase 02 already delivered:
- protocol-state assertions
- V1 / V1.5 / V2 comparison scenarios
- endpoint identity modeling
- control-plane assignment-update flow
- committed-prefix-aware promotion eligibility
What remains is different in character:
- timers
- delayed events racing with each other
- timeout-triggered state changes
- more explicit event scheduling
That deserves a new phase boundary.
## Source Of Truth
Design/source-of-truth:
- `sw-block/design/v2_scenarios.md`
- `sw-block/design/v2-dist-fsm.md`
- `sw-block/design/v2-scenario-sources-from-v1.md`
- `sw-block/design/v1-v15-v2-comparison.md`
Current prototype base:
- `sw-block/prototype/distsim/`
- `sw-block/prototype/distsim/simulator.go`
## Scope
### In scope
1. timeout semantics
- barrier timeout
- catch-up timeout
- reservation expiry timeout
- rebuild timeout
2. timer races
- delayed ack vs timeout
- timeout vs promotion
- reconnect vs timeout
- catch-up completion vs expiry
- rebuild completion vs epoch bump
3. simulator split clarification
- `distsim` keeps:
- protocol correctness
- lineage
- recoverability
- reference-state checking
- `eventsim` grows into:
- event scheduling
- timer firing
- same-time interleavings
- race exploration
### Out of scope
- production integration
- real transport
- real disk timings
- SPDK
- raw allocator
## Assigned Tasks For `sw`
### P0
1. Write a concrete `eventsim` scope note in code/docs
- define what stays in `distsim`
- define what moves to `eventsim`
- avoid overlap and duplicated semantics
2. Add minimal timeout event model
- first-class timeout event type(s)
- at minimum:
- barrier timeout
- catch-up timeout
- reservation expiry
3. Add timeout-backed scenarios
- stale delayed ack vs timeout
- catch-up timeout before convergence
- reservation expiry during active recovery
### P1
4. Add race-focused tests
- promotion vs delayed stale ack
- rebuild completion vs epoch bump
- reconnect success vs timeout firing
5. Keep traces debuggable
- failing runs must dump:
- seed
- event order
- timer events
- node states
- committed prefix
### P2
6. Decide whether selected `distsim` scenarios should also exist in `eventsim`
- only when timer/event ordering is the real point
- do not duplicate every scenario blindly
## Current Progress
Delivered in this phase so far:
- `eventsim` scope note added in code
- explicit timeout model added:
- barrier timeout
- catch-up timeout
- reservation timeout
- timeout-backed scenarios added and reviewed
- same-tick rule made explicit:
- data before timers
- recovery timeout cancellation is now model-driven, not test-driven
- stale barrier ack after timeout is explicitly rejected
- stale timeouts are separated from authoritative timeouts:
- `FiredTimeouts`
- `IgnoredTimeouts`
- race-focused scenarios added and reviewed:
- promotion vs stale catch-up timeout
- promotion vs stale barrier timeout
- rebuild completion vs epoch bump
- epoch bump vs stale catch-up timeout
- reusable trace builder added for replay/debug support
- current `distsim` suite at latest review:
- 86 tests passing
Remaining focus for `sw`:
- Phase 03 P0 and P1 are effectively complete
- Phase 03 P2 is also effectively complete after review
- any further simulator work should now be narrow and evidence-driven
- recommended next simulator additions only:
- control-plane latency parameter
- sustained-write convergence / tail-chasing load test
- one multi-promotion lineage extension
## Invariants To Preserve
1. committed data remains durable per policy
2. uncommitted data is never revived as committed
3. stale epoch traffic never mutates current lineage
4. committed prefix remains contiguous
5. timeout-triggered transitions are explicit and explainable
6. races do not silently bypass fencing or rebuild boundaries
## Required Updates Per Task
For each completed task:
1. add or update tests
2. update `sw-block/design/v2_scenarios.md` if scenario coverage changed
3. add a short note to:
- `sw-block/.private/phase/phase-03-log.md`
4. if the simulator boundary changed, record it in:
- `sw-block/.private/phase/phase-03-decisions.md`
## Exit Criteria
Phase 03 is done when:
1. timeout semantics exist as explicit simulator behavior
2. at least three important timer-race scenarios are modeled and tested
3. `distsim` vs `eventsim` responsibilities are clearly separated
4. failure traces from race/timeout scenarios are replayable enough to debug

97
sw-block/.private/phase/phase-04-decisions.md

@ -0,0 +1,97 @@
# Phase 04 Decisions
Date: 2026-03-27
Status: initial
## First Slice Decision
The first standalone V2 implementation slice is:
- per-replica sender ownership
- one active recovery session per replica per epoch
## Why Not Start In V1
V1/V1.5 remains:
- production line
- maintenance/fix line
It should not be the place where V2 architecture is first implemented.
## Why This Slice
This slice:
- directly addresses the clearest V1.5 structural pain
- maps cleanly to the V2-boundary tests
- is narrow enough to implement without dragging in the entire future architecture
## Accepted P0 Refinements
### Sender epoch coherence
Sender-owned epoch is real state, not decoration.
So:
- reconcile/update paths must refresh sender epoch
- stale active session must be invalidated on epoch advance
### Session lifecycle
The first slice should not use a totally loose lifecycle shell.
So:
- session phase changes now follow an explicit transition map
- invalid jumps are rejected
### Session attach rule
Attaching a session at the wrong epoch is invalid.
So:
- `AttachSession(epoch, kind)` must reject epoch mismatch with the owning sender
## Accepted P1 Refinements
### Session identity fencing
The standalone V2 slice must reject stale completion by explicit session identity.
So:
- `RecoverySession` has stable unique identity
- sender completion must be by session ID, not by "current pointer"
- stale session results are rejected at the sender authority boundary
### Ownership vs execution
Ownership creation is not the same as execution start.
So:
- `AttachSession()` and `SupersedeSession()` establish ownership only
- `BeginConnect()` is the first execution-state mutation
### Completion authority
An ID match alone is not enough to complete recovery.
So:
- completion must require a valid completion-ready phase
- normal completion requires converged catch-up
- zero-gap fast completion is allowed explicitly from handshake
## P2 Direction
The next prototype step is not broader simulation.
It is:
- recovery outcome branching
- assignment-intent orchestration
- prototype-level end-to-end recovery flow

46
sw-block/.private/phase/phase-04-log.md

@ -0,0 +1,46 @@
# Phase 04 Log
Date: 2026-03-27
Status: active
## 2026-03-27
- Phase 04 created to start the first standalone V2 implementation slice.
- Decision:
- do not begin in `weed/storage/blockvol/`
- begin under `sw-block/`
- first slice chosen:
- per-replica sender ownership
- explicit recovery-session ownership
- Initial slice delivered under `sw-block/prototype/enginev2/`:
- sender
- recovery session
- sender group
- First review found:
- sender/session epoch coherence gap
- session lifecycle was shell-only, not enforcing real transitions
- attach-session epoch mismatch was not rejected
- Follow-up delivered and accepted:
- reconcile updates preserved sender epoch
- epoch bump invalidates stale session
- session transition map enforced
- attach-session rejects epoch mismatch
- enginev2 tests increased to 26 passing
- Phase 04a created to close the ownership-validation gap:
- explicit session identity in `distsim`
- bridge tests into `enginev2`
- Phase 04a ownership problem closed well enough:
- stale completion rejected by session ID
- endpoint invalidation includes `CtrlAddr`
- boundary doc aligned with real simulator/prototype evidence
- Phase 04 P1 delivered and accepted:
- sender-owned execution APIs added
- all execution APIs fence on `sessionID`
- completion now requires valid completion point
- attach/supersede now establish ownership only
- handshake range validation added
- enginev2 tests increased to 46 passing
- Next phase focus narrowed to P2:
- recovery outcome branching
- assignment-intent orchestration
- prototype end-to-end recovery flow

153
sw-block/.private/phase/phase-04.md

@ -0,0 +1,153 @@
# Phase 04
Date: 2026-03-27
Status: active
Purpose: start the first standalone V2 implementation slice under `sw-block/`, centered on per-replica sender ownership and explicit recovery-session ownership
## Goal
Build the first real V2 implementation slice without destabilizing V1.
This slice should prove:
1. per-replica sender identity
2. explicit one-session-per-replica recovery ownership
3. endpoint/assignment-driven recovery updates
4. clean handoff between normal sender and recovery session
## Why This Phase Exists
The simulator and design work are now strong enough to support a narrow implementation slice.
We should not start with:
- Smart WAL
- new storage engine
- frontend integration
We should start with the ownership problem that most clearly separates V2 from V1.5.
## Source Of Truth
Design:
- `sw-block/design/v2-first-slice-session-ownership.md`
- `sw-block/design/v2-acceptance-criteria.md`
- `sw-block/design/v2-open-questions.md`
Simulator reference:
- `sw-block/prototype/distsim/`
## Scope
### In scope
1. per-replica sender owner object
2. explicit recovery session object
3. session lifecycle rules
4. endpoint update handling
5. basic tests for sender/session ownership
### Out of scope
- Smart WAL in production code
- real block backend redesign
- V1 integration
- frontend publication
## Assigned Tasks For `sw`
### P0
1. create standalone V2 implementation area under `sw-block/`
- recommended:
- `sw-block/prototype/enginev2/`
2. define sender/session types
- sender owner per replica
- recovery session per replica per epoch
3. implement basic lifecycle
- create sender
- attach session
- supersede stale session
- close session on success / invalidation
## Current Progress
Delivered in this phase so far:
- standalone V2 area created under:
- `sw-block/prototype/enginev2/`
- core types added:
- `Sender`
- `RecoverySession`
- `SenderGroup`
- sender/session lifecycle shell implemented
- per-replica ownership implemented
- endpoint-change invalidation implemented
- sender epoch coherence implemented
- session epoch attach validation implemented
- session phase transitions now enforce a real transition map
- session identity fencing implemented
- stale completion rejected by session ID
- execution APIs implemented:
- `BeginConnect`
- `RecordHandshake`
- `BeginCatchUp`
- `RecordCatchUpProgress`
- `CompleteSessionByID`
- completion authority tightened:
- catch-up must converge
- zero-gap handshake fast path allowed
- attach/supersede now establish ownership only
- sender-group orchestration tests added
- current `enginev2` test state at latest review:
- 46 tests passing
Next focus for `sw`:
- continue Phase 04 beyond execution gating:
- recovery outcome branching
- sender-group orchestration from assignment intent
- prototype-level end-to-end recovery flow
- do not integrate into V1 production tree yet
### P1
4. implement endpoint update handling
- changed-address update must refresh the right sender owner
5. implement epoch invalidation
- stale session must stop after epoch bump
6. add tests matching the slice acceptance
### P2
7. add recovery outcome branching
- distinguish:
- zero-gap fast completion
- positive-gap catch-up completion
- unrecoverable gap / `NeedsRebuild`
8. add assignment-intent driven orchestration
- move beyond raw reconcile-only tests
- make sender-group react to explicit recovery intent
9. add prototype-level end-to-end flow tests
- assignment/update
- session creation
- execution
- completion / invalidation
- rebuild escalation
## Exit Criteria
Phase 04 is done when:
1. standalone V2 sender/session slice exists under `sw-block/`
2. sender ownership is per replica, not set-global
3. one active recovery session per replica per epoch is enforced
4. endpoint update and epoch invalidation are tested
5. sender-owned execution flow is validated
6. recovery outcome branching exists at prototype level

49
sw-block/.private/phase/phase-04a-decisions.md

@ -0,0 +1,49 @@
# Phase 04a Decisions
Date: 2026-03-27
Status: initial
## Core Decision
The next must-fix validation problem is:
- sender/session ownership semantics
This outranks:
- more timing realism
- more WAL detail
- broader scenario growth
## Why
V2's core claim over V1.5 is not only:
- better recovery policy
It is also:
- stable per-replica sender identity
- one active recovery owner
- stale work cannot mutate current state
If those ownership rules are not validated, the simulator can overstate confidence.
## Validation Rule
For this phase, a scenario is only complete when it is expressed at two levels:
1. simulator ownership model (`distsim`)
2. standalone implementation slice (`enginev2`)
Real `weed/` adversarial tests remain the system-level gate.
## Scope Discipline
Do not expand this phase into:
- generic simulator feature growth
- Smart WAL design growth
- V1 integration work
Keep it focused on the ownership model.

22
sw-block/.private/phase/phase-04a-log.md

@ -0,0 +1,22 @@
# Phase 04a Log
Date: 2026-03-27
Status: active
## 2026-03-27
- Phase 04a created as a narrow validation phase.
- Reason:
- the biggest remaining V2 validation gap is ownership semantics
- not general scenario count
- not more timer realism
- not more WAL detail
- Scope chosen:
- sender identity
- recovery session identity
- supersede / invalidate rules
- stale completion rejection
- `distsim` to `enginev2` bridge tests
- This phase is intentionally separate from broad Phase 04 implementation growth.
- Goal:
- gain confidence that V2 is validated as owned session/sender protocol state, not only as policy

113
sw-block/.private/phase/phase-04a.md

@ -0,0 +1,113 @@
# Phase 04a
Date: 2026-03-27
Status: active
Purpose: close the critical V2 ownership-validation gap by making sender/session ownership explicit in both simulation and the standalone `enginev2` slice
## Goal
Validate the core V2 claim more deeply:
1. one stable sender identity per replica
2. one active recovery session per replica
3. endpoint change, epoch bump, and supersede rules invalidate stale work
4. stale late results from old sessions cannot mutate current state
This phase is not about adding broad new simulator surface.
It is about proving the ownership model that is supposed to make V2 better than V1.5.
## Why This Phase Exists
Current simulation is already strong on:
- quorum / commit rules
- stale epoch rejection
- catch-up vs rebuild
- timeout / race ordering
- changed-address recovery at the policy level
The remaining critical risk is narrower:
- the simulator still validates V2 strongly as policy
- but not yet strongly enough as owned sender/session protocol state
That is the highest-value validation gap to close before trusting V2 too much.
## Source Of Truth
Design:
- `sw-block/design/v2-first-slice-session-ownership.md`
- `sw-block/design/v2-acceptance-criteria.md`
- `sw-block/design/v2-open-questions.md`
- `sw-block/design/protocol-development-process.md`
Simulator / prototype:
- `sw-block/prototype/distsim/`
- `sw-block/prototype/enginev2/`
Historical / review context:
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
- `sw-block/design/v2-scenario-sources-from-v1.md`
## Scope
### In scope
1. explicit sender/session identity validation in `distsim`
2. explicit stale-session invalidation rules
3. bridge tests from `distsim` scenarios to `enginev2` sender/session invariants
4. doc cleanup so V2-boundary tests point to real simulator and `enginev2` coverage
### Out of scope
- Smart WAL expansion
- broad new timing realism
- TCP / disk realism
- V1 production integration
- new backend/storage engine work
## Critical Questions To Close
1. can an old session completion mutate state after a new session supersedes it?
2. does endpoint change invalidate or supersede the active session cleanly?
3. does epoch bump remove all authority from prior sessions?
4. can duplicate recovery triggers create overlapping active sessions?
## Assigned Tasks For `sw`
### P0
1. add explicit session identity to `distsim`
- model session ID or equivalent ownership token
- make stale session results rejectable by identity, not just by coarse state
2. add ownership scenarios to `distsim`
- endpoint change during active catch-up
- epoch bump during active catch-up
- stale late completion from old session
- duplicate recovery trigger while a session is already active
3. add bridge tests in `enginev2`
- same-address reconnect preserves sender identity
- endpoint bump supersedes or invalidates active session
- epoch bump rejects stale completion
- only one active session per sender
### P1
4. tighten `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
- point to actual `distsim` scenarios
- point to actual `enginev2` bridge tests
- state what remains real-engine-only
5. only add simulator mechanics if a bridge test exposes a real ownership gap
## Exit Criteria
Phase 04a is done when:
1. `distsim` explicitly validates sender/session ownership invariants
2. `enginev2` has bridge tests for the same invariants
3. stale session work is shown unable to mutate current sender state
4. V2-boundary doc no longer has stale simulator references
5. we can say with confidence that V2 ownership semantics, not just V2 policy, are validated at prototype level

18
sw-block/README.md

@ -0,0 +1,18 @@
# sw-block
Private WAL V2 and standalone block-service workspace.
Purpose:
- keep WAL V2 design/prototype work isolated from WAL V1 production code in `weed/storage/blockvol`
- allow private design notes and experiments to evolve without polluting V1 delivery paths
- keep the future standalone `sw-block` product structure clean enough to split into a separate repo later if needed
Suggested layout:
- `design/`: shared V2 design docs
- `prototype/`: code prototypes and experiments
- `.private/`: private notes, phase development, roadmap, and non-public working material
Repository direction:
- current state: `sw-block/` is an isolated workspace inside `seaweedfs`
- likely future state: `sw-block` becomes a standalone sibling repo/product
- design and prototype structure should therefore stay product-oriented and not depend on SeaweedFS-specific paths

26
sw-block/design/README.md

@ -0,0 +1,26 @@
# V2 Design
Current WAL V2 design set:
- `wal-replication-v2.md`
- `wal-replication-v2-state-machine.md`
- `wal-replication-v2-orchestrator.md`
- `wal-v2-tiny-prototype.md`
- `wal-v1-to-v2-mapping.md`
- `v2-dist-fsm.md`
- `v2_scenarios.md`
- `v1-v15-v2-comparison.md`
- `v2-scenario-sources-from-v1.md`
- `protocol-development-process.md`
- `v2-acceptance-criteria.md`
- `v2-open-questions.md`
- `v2-first-slice-session-ownership.md`
- `v2-prototype-roadmap-and-gates.md`
These documents are the working design home for the V2 line.
The original project-level copies under `learn/projects/sw-block/design/` remain as shared references for now.
Execution note:
- active development tracking for the current simulator phase lives under:
- `../.private/phase/phase-01.md`
- `../.private/phase/phase-02.md`

288
sw-block/design/protocol-development-process.md

@ -0,0 +1,288 @@
# Protocol Development Process
Date: 2026-03-27
## Purpose
This document defines how `sw-block` protocol work should be developed.
The process is meant to work for:
- V2
- future V3
- or a later block algorithm that is not WAL-based
The point is to make protocol work systematic rather than reactive.
## Core Philosophy
### 1. Design before implementation
Do not start with production code and hope the protocol becomes clear later.
Start with:
1. system contract
2. invariants
3. state model
4. scenario backlog
Only then move to implementation.
### 2. Real failures are inputs, not just bugs
When V1 or V1.5 fails in real testing, treat that as:
- a design requirement
- a scenario source
- a simulator input
Do not patch and forget.
### 3. Simulator is part of the protocol, not a side tool
The simulator exists to answer:
- what should happen
- what must never happen
- which old designs fail
- why the new design is better
It is not a replacement for real testing.
It is the design-validation layer before production implementation.
### 4. Passing tests are not enough
Green tests are necessary, not sufficient.
We also require:
- explicit invariants
- explicit scenario intent
- clear state transitions
- review of assumptions and abstraction boundaries
### 5. Keep hot-path and recovery-path reasoning separate
Healthy steady-state behavior and degraded recovery behavior are different problems.
Both must be designed explicitly.
## Development Ladder
Every major protocol feature should move through these steps:
1. **Problem statement**
- what real bug, limit, or product goal is driving the work
2. **Contract**
- what the protocol guarantees
- what it does not guarantee
3. **State model**
- node state
- coordinator state
- recovery state
- role / epoch / lineage rules
4. **Scenario backlog**
- named scenarios
- source:
- real failure
- design obligation
- adversarial distributed case
5. **Prototype / simulator**
- reduced but explicit model
- invariant checks
- V1 / V1.5 / V2 comparison where relevant
6. **Implementation**
- production code only after the protocol shape is clear enough
7. **Real validation**
- unit
- component
- integration
- real hardware where needed
8. **Feedback loop**
- turn new failures back into scenario/design inputs
## Required Artifacts
For protocol work to be considered real progress, we usually want:
### Design
- design doc
- scenario doc
- comparison doc when replacing an older approach
### Prototype
- simulator or prototype code
- tests that assert protocol behavior
### Implementation
- production patch
- production tests
- docs updated to match the actual algorithm
### Review
- implementation gate
- design/protocol gate
## Two-Gate Rule
We use two acceptance gates.
### Gate 1: implementation
Owned by the coding side.
Questions:
- does it build?
- do tests pass?
- does it behave as intended in code?
### Gate 2: protocol/design
Owned by the design/review side.
Questions:
- is the logic actually sound?
- do tests prove the intended thing?
- are assumptions explicit?
- is the abstraction boundary honest?
A task is not accepted until both gates pass.
## Layering Rule
Keep simulation layers separate.
### `distsim`
Use for:
- protocol correctness
- state transitions
- fencing
- recoverability
- promotion / lineage
- reference-state checking
### `eventsim`
Use for:
- timeout behavior
- timer races
- event ordering
- same-tick / delayed event interactions
Do not duplicate scenarios blindly across both layers.
## Test Selection Rule
Do not choose simulator inputs only from failing tests.
Review all relevant tests and classify them by:
- protocol significance
- simulator value
- implementation specificity
Good simulator candidates often come from:
- barrier truth
- catch-up vs rebuild
- stale message rejection
- failover / promotion safety
- changed-address restart
- mode semantics
Keep real-only tests for:
- wire format
- OS timing
- exact WAL file behavior
- frontend transport specifics
## Version Comparison Rule
When designing a successor protocol:
- keep the old version visible
- reproduce the old failure or limitation
- show the improved behavior in the new version
For `sw-block`, that means:
- `V1`
- `V1.5`
- `V2`
should be compared explicitly where possible.
## Documentation Rule
The docs must track three different things:
### `learn/projects/sw-block/`
Use for:
- project history
- V1/V1.5 algorithm records
- phase records
- real test history
### `sw-block/design/`
Use for:
- active design truth
- V2 and later protocol docs
- scenario backlog
- comparison docs
### `sw-block/.private/phase/`
Use for:
- active execution plan
- log
- decisions
## What Good Progress Looks Like
A good protocol iteration usually has this pattern:
1. real failure or design pressure identified
2. scenario named and written down
3. simulator reproduces the bad case
4. new protocol handles it explicitly
5. implementation follows
6. real tests validate it
If one of those steps is missing, confidence is weaker.
## Bottom Line
The process is:
1. design the contract
2. model the state
3. define the scenarios
4. simulate the protocol
5. implement carefully
6. validate in real tests
7. feed failures back into design
That is the process we should keep using for V2 and any later protocol line.

252
sw-block/design/protocol-version-simulation.md

@ -0,0 +1,252 @@
# Protocol Version Simulation
Date: 2026-03-26
Status: design proposal
Purpose: define how the simulator should model WAL V1, WAL V1.5 (Phase 13), and WAL V2 on the same scenario set
## Why This Exists
The simulator is more valuable if the same scenario can answer:
1. how WAL V1 behaves
2. how WAL V1.5 behaves
3. how WAL V2 should behave
That turns the simulator into:
- a regression tool for V1/V1.5
- a justification tool for V2
- a comparison framework across protocol generations
## Principle
Do not fork three separate simulators.
Instead:
- keep one simulator core
- add protocol-version behavior modes
- run the same named scenario under different modes
## Proposed Versions
### `ProtocolV1`
Intent:
- represent pre-Phase-13 behavior
Behavior shape:
- WAL is streamed optimistically
- lagging replica is degraded/excluded quickly
- no real short-gap catch-up contract
- no retention-backed recovery window
- replica usually falls toward rebuild rather than incremental recovery
What scenarios should expose:
- short outage still causes unnecessary degrade/rebuild
- transient jitter may be over-penalized
- poor graceful rejoin story
### `ProtocolV15`
Intent:
- represent Phase-13 WAL V1.5 behavior
Behavior shape:
- reconnect handshake exists
- WAL catch-up exists
- primary may retain WAL longer for lagging replica
- recovery still depends heavily on address stability and control-plane timing
- catch-up may still tail-chase or stall operationally
What scenarios should expose:
- transient disconnects may recover
- restart with new receiver address may still fail practical recovery
- tail-chasing / retention pressure remain structural risks
### `ProtocolV2`
Intent:
- represent the target design
Behavior shape:
- explicit recovery reservation
- explicit catch-up vs rebuild boundary
- lineage-first promotion
- version-correct recovery sources
- explicit abort/rebuild path on non-convergence or lost recoverability
What scenarios should show:
- short gap recovers cleanly
- impossible catch-up fails cleanly
- rebuild is explicit, not accidental
## Behavior Axes To Toggle
The simulator does not need completely different code paths.
It needs protocol-version-sensitive policy on these axes:
### 1. Lagging replica treatment
`V1`:
- degrade quickly
- no meaningful WAL catch-up window
`V1.5`:
- allow WAL catch-up while history remains available
`V2`:
- allow catch-up only with explicit recoverability / reservation
### 2. WAL retention / recoverability
`V1`:
- little or no retention for lagging-replica recovery
`V1.5`:
- retention-based recovery window
- but no strong reservation contract
`V2`:
- recoverability check plus reservation
### 3. Restart / address stability
`V1`:
- generally poor rejoin path
`V1.5`:
- reconnect may work only if replica address is stable
`V2`:
- address/identity assumptions should be explicit in the model
### 4. Tail-chasing behavior
`V1`:
- usually degrades rather than catches up
`V1.5`:
- catch-up may be attempted but may never converge
`V2`:
- non-convergence should explicitly abort/escalate
### 5. Promotion policy
`V1`:
- weaker lineage reasoning
`V1.5`:
- improved epoch/LSN handling
`V2`:
- lineage-first promotion is a first-class rule
## Recommended Simulator API
Add a version enum, for example:
```go
type ProtocolVersion string
const (
ProtocolV1 ProtocolVersion = "v1"
ProtocolV15 ProtocolVersion = "v1_5"
ProtocolV2 ProtocolVersion = "v2"
)
```
Attach it to the simulator or cluster:
```go
type Cluster struct {
Protocol ProtocolVersion
...
}
```
## Policy Hooks
Rather than branching everywhere, centralize the differences in a few hooks:
1. `CanAttemptCatchup(...)`
2. `CatchupConvergencePolicy(...)`
3. `RecoverabilityPolicy(...)`
4. `RestartRejoinPolicy(...)`
5. `PromotionPolicy(...)`
That keeps the simulator readable.
## Example Scenario Comparisons
### Scenario: brief disconnect
`V1`:
- likely degrade / no efficient catch-up
`V1.5`:
- catch-up may succeed if address/history remain stable
`V2`:
- explicit recoverability + reservation
- catch-up only if the missing window is still recoverable
- otherwise explicit rebuild
### Scenario: replica restart with new receiver port
`V1`:
- poor recovery path
`V1.5`:
- background reconnect fails if it retries stale address
`V2`:
- identity/address model must make this explicit
- direct reconnect is not assumed
- use explicit reassignment plus catch-up if recoverable, otherwise rebuild cleanly
### Scenario: primary writes faster than catch-up
`V1`:
- replica degrades
`V1.5`:
- may tail-chase indefinitely or pin WAL too long
`V2`:
- explicit non-convergence detection -> abort / rebuild
## What To Measure
For each scenario, compare:
1. does committed data remain safe?
2. does uncommitted data stay out of committed lineage?
3. does recovery complete or stall?
4. does protocol choose catch-up or rebuild?
5. is the outcome explicit or accidental?
## Immediate Next Step
Start with a minimal versioned policy layer:
1. add `ProtocolVersion`
2. implement one or two version-sensitive hooks:
- `CanAttemptCatchup`
- `CatchupConvergencePolicy`
3. run existing scenarios under:
- `ProtocolV1`
- `ProtocolV15`
- `ProtocolV2`
That is enough to begin proving:
- V1 breaks
- V1.5 improves but still strains
- V2 handles the same scenario more cleanly
## Bottom Line
The same scenario set should become a comparison harness across protocol generations.
That is one of the strongest uses of the simulator:
- not only "does V2 work?"
- but "why is V2 better than V1 and V1.5?"

314
sw-block/design/v1-v15-v2-comparison.md

@ -0,0 +1,314 @@
# V1, V1.5, and V2 Comparison
Date: 2026-03-27
## Purpose
This document compares:
- `V1`: original replicated WAL shipping model
- `V1.5`: Phase 13 catch-up-first improvements on top of V1
- `V2`: explicit FSM / orchestrator / recoverability-driven design under `sw-block/`
It is a design comparison, not a marketing document.
## 1. One-line summary
- `V1` is simple but weak on short-gap recovery.
- `V1.5` materially improves recovery, but still relies on assumptions and incremental control-plane fixes.
- `V2` is structurally cleaner, more explicit, and easier to validate, but is not yet a production engine.
## 2. Steady-State Hot Path
In the healthy case, all three versions can look similar:
1. primary appends ordered WAL
2. primary ships entries to replicas
3. replicas apply in order
4. durability barrier determines when client-visible commit completes
### V1
- simplest replication path
- lagging replica typically degrades quickly
- little explicit recovery structure
### V1.5
- same basic hot path as V1
- WAL retention and reconnect/catch-up improve short outage handling
- extra logic exists, but much of it is off the hot path
### V2
- can keep a similar hot path if implemented carefully
- extra complexity is mainly in:
- recovery planner
- replica state machine
- coordinator/orchestrator
- recoverability checks
### Performance expectation
In a normal healthy cluster:
- `V2` should not be much heavier than `V1.5`
- most V2 complexity sits in failure/recovery/control paths
- there is no proof yet that V2 has better steady-state throughput or latency
## 3. Recovery Behavior
### V1
Recovery is weakly structured:
- lagging replica tends to degrade
- short outage often becomes rebuild or long degraded state
- little explicit catch-up boundary
### V1.5
Recovery is improved:
- short outage can recover by retained-WAL catch-up
- background reconnect closes the `sync_all` dead-loop
- catch-up-first is preferred before rebuild
But the model is still partly implicit:
- reconnect depends on endpoint stability unless control plane refreshes assignment
- recoverability boundary is not as explicit as V2
- tail-chasing and retention pressure still need policy care
### V2
Recovery is explicit by design:
- `InSync`
- `Lagging`
- `CatchingUp`
- `NeedsRebuild`
- `Rebuilding`
And explicit decisions exist for:
- catch-up vs rebuild
- stale-epoch rejection
- promotion candidate choice
- recoverable vs unrecoverable gap
## 4. Real V1.5 Lessons
The main V2 requirements come from real V1.5 behavior.
### 4.1 Changed-address restart
Observed in `CP13-8 T4b`:
- replica restarted
- endpoint changed
- primary shipper held stale address
- direct reconnect could not succeed until control plane refreshed assignment
V1.5 fix:
- saved address used only as hint
- heartbeat-reported address becomes source of truth
- master refreshes primary assignment
Lesson for V2:
- endpoint is not identity
- reassignment must be explicit
### 4.2 Reconnect race
Observed in Phase 13 review:
- barrier path and background reconnect path could both trigger reconnect
V1.5 fix:
- `reconnectMu` serializes reconnect / catch-up
Lesson for V2:
- one active recovery session per replica should be a protocol rule, not just a local mutex trick
### 4.3 Tail-chasing
Even with retained WAL:
- primary may write faster than a lagging replica can recover
- catch-up may not converge
Lesson for V2:
- explicit abort / `NeedsRebuild`
- do not pretend catch-up will always work
### 4.4 Control-plane recovery latency
V1.5 can be correct but still operationally slow if recovery waits on slower management cycles.
Lesson for V2:
- keep authority in coordinator
- but make recovery decisions explicit and fast when possible
## 5. V2 Structural Improvements
V2 is better primarily because it is easier to reason about and validate.
### 5.1 Better state model
Instead of implicit recovery behavior, V2 has:
- per-replica FSM
- volume/orchestrator model
- distributed simulator with scenario coverage
### 5.2 Better validation
V2 has:
- named scenario backlog
- protocol-state assertions
- randomized simulation
- V1/V1.5/V2 comparison tests
This is a major difference from V1/V1.5, where many fixes were discovered through implementation and hardware testing first.
### 5.3 Better correctness boundaries
V2 makes these explicit:
- recoverable gap vs rebuild
- stale traffic rejection
- promotion lineage safety
- reservation or payload availability transitions
## 6. Stability Comparison
### Current judgment
- `V1`: least stable under failure/recovery stress
- `V1.5`: meaningfully better and now functionally validated on real tests
- `V2`: best protocol structure and best simulator confidence
### Important limit
`V2` is not yet proven more stable in production because:
- it is not a production engine yet
- confidence comes from simulator/design work, not real block workload deployment
So the accurate statement is:
- `V2` is more stable **architecturally**
- `V1.5` is more stable **operationally today** because it is implemented and tested on real hardware
## 7. Performance Comparison
### What is likely true
`V2` should perform better than rebuild-heavy recovery approaches when:
- outage is short
- gap is recoverable
- catch-up avoids full rebuild
It should also behave better under:
- flapping replicas
- stale delayed messages
- mixed-state replica sets
### What is not yet proven
We do not yet know whether `V2` has:
- better steady-state throughput
- lower p99 latency
- lower CPU overhead
- lower memory overhead
than `V1.5`
That requires real implementation and benchmarking.
## 8. Smart WAL Fit
### Why Smart WAL is awkward in V1/V1.5
V1/V1.5 do not naturally model:
- payload classes
- recoverability reservations
- historical payload resolution
- explicit recoverable/unrecoverable transition
So Smart WAL would be harder to add cleanly there.
### Why Smart WAL fits V2 better
V2 already has the right conceptual slots:
- `RecoveryClass`
- `WALInline`
- `ExtentReferenced`
- recoverability planner
- catch-up vs rebuild decision point
- simulator for payload-availability transitions
### Important rule
Smart WAL must not mean:
- “read current extent for old LSN”
That is incorrect.
Historical correctness requires:
- WAL inline payload
- or pinned snapshot/versioned extent state
- not current live extent contents
## 9. What Is Proven Today
### Proven
- `V1.5` significantly improves V1 recovery behavior
- real `CP13-8` testing validated the V1.5 data path and `sync_all` behavior
- the V2 simulator covers:
- stale traffic rejection
- tail-chasing
- flapping replicas
- multi-promotion lineage
- changed-address restart comparison
- same-address transient outage comparison
- Smart WAL availability transitions
### Not yet proven
- V2 production implementation quality
- V2 steady-state performance advantage
- V2 real hardware recovery performance
## 10. Bottom Line
If choosing based on current evidence:
- use `V1.5` as the production line today
- use `V2` as the better long-term architecture
If choosing based on protocol quality:
- `V2` is clearly better structured
- `V1.5` is still more ad hoc, even after successful fixes
If choosing based on current real-world proof:
- `V1.5` has the stronger operational evidence today
- `V2` has the stronger design and simulation evidence today

281
sw-block/design/v1-v15-v2-simulator-goals.md

@ -0,0 +1,281 @@
# V1 / V1.5 / V2 Simulator Goals
Date: 2026-03-26
Status: working design note
Purpose: define how the simulator should be used against WAL V1, Phase-13 V1.5, and WAL V2
## Why This Exists
The simulator is not only for validating V2.
It should also be used to:
1. break WAL V1
2. stress WAL V1.5 / Phase 13
3. justify why WAL V2 is needed
This note defines what failures we want the simulator to find in each protocol generation.
## What The Simulator Can And Cannot Do
### What it is good at
The simulator is good at:
1. finding concrete counterexamples
2. exposing bad protocol assumptions
3. checking commit / failover / fencing invariants
4. checking historical data correctness at target `LSN`
### What it is not
The simulator is not a full proof unless promoted to formal model checking.
So the right claim is:
- "no issue found under these modeled runs"
not:
- "protocol proven correct in all implementations"
## Protocol Targets
### WAL V1
Core shape:
- primary ships WAL out
- lagging replica degrades quickly
- no real recoverability contract
- no strong short-gap catch-up window
Primary risk:
- a briefly lagging replica gets downgraded too early and forced into rebuild
### WAL V1.5 / Phase 13
Core shape:
- primary retains WAL longer for lagging replicas
- reconnect / catch-up exists
- rebuild fallback exists
- primary may wait before releasing WAL
Primary risks:
- WAL pinning
- tail chasing
- slow availability recovery
- recoverability assumptions that do not hold long enough
### WAL V2
Core shape:
- explicit state machine
- explicit recoverability / reservation
- catch-up vs rebuild boundary is formalized
- eventual support for `WALInline` vs `ExtentReferenced`
Primary goal:
- no committed data loss
- no false recovery
- cheaper and clearer short-gap recovery
## What To Find In WAL V1
The simulator should try to find scenarios where V1 fails operationally or structurally.
### V1-F1. Short Disconnect Still Forces Rebuild
Sequence:
1. replica disconnects briefly
2. primary continues writing
3. replica returns quickly
Expected ideal behavior:
- short-gap catch-up
What V1 may do:
- downgrade replica too early
- no usable catch-up path
- rebuild required unnecessarily
### V1-F2. Jitter Causes Avoidable Degrade
Sequence:
1. replica is alive but sees delayed/reordered delivery
2. primary interprets this as lag/failure
Failure signal:
- unnecessary downgrade or exclusion
### V1-F3. Repeated Brief Flaps Cause Thrash
Sequence:
1. repeated short disconnect/reconnect
2. primary repeatedly degrades replica
Failure signal:
- poor availability
- excessive rebuild churn
### V1-F4. No Efficient Path Back To Healthy State
Sequence:
1. replica becomes degraded
2. network recovers
Failure signal:
- control plane or protocol provides no clean short recovery path
## What To Find In WAL V1.5 / Phase 13
The simulator should stress whether retention-based catch-up is actually enough.
### V15-F1. Tail Chasing Under Ongoing Writes
Sequence:
1. replica reconnects behind
2. primary keeps writing
3. catch-up tries to close the gap
Failure signal:
- replica never converges
- stays forever behind
- no clean escalation path
### V15-F2. WAL Pinning Harms System Progress
Sequence:
1. replica lags
2. primary retains WAL to help recovery
3. lag persists
Failure signal:
- WAL window remains pinned too long
- reclaim stalls
- system availability or throughput suffers
### V15-F3. Catch-Up Window Expires Mid-Recovery
Sequence:
1. catch-up begins
2. primary continues advancing
3. required recoverability disappears before completion
Failure signal:
- protocol still claims success
- or lacks a clean abort-to-rebuild path
### V15-F4. Restart Recovery Too Slow
Sequence:
1. replica restarts
2. primary blocks writes correctly under `sync_all`
3. service recovery takes too long
Failure signal:
- correctness preserved
- but availability recovery is operationally unacceptable
### V15-F5. Multiple Lagging Replicas Poison Progress
Sequence:
1. more than one replica lags
2. retention and recovery obligations interact
Failure signal:
- one slow replica or mixed states poison the entire volume behavior
## What WAL V2 Should Survive
V2 should not merely avoid V1/V1.5 failures.
It should make them explicit and manageable.
### V2-S1. Short Gap Recovers Cheaply
Expected:
- brief disconnect -> catch-up -> promote
- no rebuild
### V2-S2. Impossible Catch-Up Fails Cleanly
Expected:
- not fully recoverable -> `NeedsRebuild`
- no pretend success
### V2-S3. Reservation Loss Forces Correct Abort
Expected:
- once recoverability is lost, catch-up aborts
- rebuild path takes over
### V2-S4. Promotion Is Lineage-First
Expected:
- new primary chosen from valid lineage
- not simply highest apparent `LSN`
### V2-S5. Historical Data Correctness Is Preserved
Expected:
- no rebuild from current extent pretending to be old state
- correct snapshot/base + replay behavior
## Simulation Strategy By Version
### For V1
Use simulator to:
- break it
- demonstrate avoidable rebuilds and downgrade behavior
The simulator is mainly a diagnostic and justification tool here.
### For V1.5
Use simulator to:
- stress retention-based catch-up
- find operational limits
- expose where retention alone is not enough
The simulator is a stress and tradeoff tool here.
### For V2
Use simulator to:
- validate named protocol scenarios
- validate random/adversarial runs
- confirm state + data correctness under failover/recovery
The simulator is a design-validation tool here.
## Practical Outcome
If the simulator finds:
### On V1
- short outages still lead to rebuild
Then conclusion:
- V1 lacks a real short-gap recovery story
### On V1.5
- retention helps but can still tail-chase or pin WAL too long
Then conclusion:
- V1.5 is a useful bridge, but not the final architecture
### On V2
- catch-up/rebuild boundary is explicit and safe
Then conclusion:
- V2 solves the protocol problem more cleanly
## Bottom Line
Use the simulator differently for each generation:
1. WAL V1: find where it breaks
2. WAL V1.5: find where it strains
3. WAL V2: validate that it behaves correctly and more cleanly
That is how the simulator justifies the architectural move from V1 to V2.

280
sw-block/design/v2-acceptance-criteria.md

@ -0,0 +1,280 @@
# V2 Acceptance Criteria
Date: 2026-03-27
## Purpose
This document defines the minimum protocol-validation bar for V2.
It is not the full scenario backlog.
It is the smaller acceptance set that should be true before we claim:
- the V2 protocol shape is validated enough to guide implementation
## Scope
This acceptance set is about:
- protocol correctness
- recovery correctness
- lineage / fencing correctness
- data correctness at target `LSN`
This acceptance set is not yet about:
- production performance
- frontend integration
- wire protocol
- disk implementation details
## Acceptance Rule
A V2 acceptance item should satisfy all of:
1. named scenario
2. explicit expected behavior
3. simulator coverage
4. clear invariant or pass condition
5. mapped reason why it matters
## Acceptance Set
### A1. Committed Data Survives Failover
Must prove:
- acknowledged data is not lost after primary failure and promotion
Evidence:
- `S1`
- distributed simulator pass
Pass condition:
- promoted node matches reference state at committed `LSN`
### A2. Uncommitted Data Is Not Revived
Must prove:
- non-acknowledged writes do not become committed after failover
Evidence:
- `S2`
Pass condition:
- committed prefix remains at the previous valid boundary
### A3. Stale Epoch Traffic Is Fenced
Must prove:
- old primary / stale sender traffic cannot mutate current lineage
Evidence:
- `S3`
- stale write / stale barrier / stale delayed ack scenarios
Pass condition:
- stale traffic is rejected
- committed prefix does not change
### A4. Short-Gap Catch-Up Works
Must prove:
- brief outage with recoverable gap returns via catch-up, not rebuild
Evidence:
- `S4`
- same-address transient outage comparison
Pass condition:
- recovered replica returns to `InSync`
- final state matches reference
### A5. Non-Convergent Catch-Up Escalates Explicitly
Must prove:
- tail-chasing or failed catch-up does not pretend success
Evidence:
- `S6`
Pass condition:
- explicit `CatchingUp -> NeedsRebuild`
### A6. Recoverability Boundary Is Explicit
Must prove:
- recoverable vs unrecoverable gap is decided explicitly
Evidence:
- `S7`
- Smart WAL availability transition scenarios
Pass condition:
- recovery aborts when reservation/payload availability is lost
- rebuild becomes the explicit fallback
### A7. Historical Data Correctness Holds
Must prove:
- recovered data for target `LSN` is historically correct
- current extent cannot fake old history
Evidence:
- `S8`
- `S9`
Pass condition:
- snapshot + tail rebuild matches reference state
- current-extent reconstruction of old `LSN` fails correctness
### A8. Durability Mode Semantics Are Correct
Must prove:
- `best_effort`, `sync_all`, and `sync_quorum` behave as intended under mixed replica states
Evidence:
- `S10`
- `S11`
- timeout-backed quorum/all race tests
Pass condition:
- `sync_all` remains strict
- `sync_quorum` commits only with true durable quorum
- invalid `sync_quorum` topology assumptions are rejected
### A9. Promotion Uses Safe Candidate Eligibility
Must prove:
- promotion requires:
- running
- epoch alignment
- state eligibility
- committed-prefix sufficiency
Evidence:
- stronger `S12`
- candidate eligibility tests
Pass condition:
- unsafe candidates are rejected by default
- desperate promotion, if any, is explicit and separate
### A10. Changed-Address Restart Is Explicitly Recoverable
Must prove:
- endpoint is not identity
- changed-address restart does not rely on stale endpoint reuse
Evidence:
- V1 / V1.5 / V2 changed-address comparison
- endpoint-version / assignment-update simulator flow
Pass condition:
- stale endpoint is rejected
- control-plane update refreshes primary view
- recovery proceeds only after explicit update
### A11. Timeout Semantics Are Explicit
Must prove:
- barrier, catch-up, and reservation timeouts are first-class protocol behavior
Evidence:
- Phase 03 P0 timeout tests
Pass condition:
- timeout effects are explicit
- stale timeouts do not regress recovered state
- late barrier ack after timeout is rejected
### A12. Timer Races Are Stable
Must prove:
- timer/event ordering does not silently break protocol guarantees
Evidence:
- Phase 03 P1/P2 race tests
Pass condition:
- same-tick ordering is explicit
- promotion / epoch bump / timeout interactions preserve invariants
- traces are debuggable
## Compare Requirement
Where meaningful, V2 acceptance should include comparison against:
- `V1`
- `V1.5`
Especially for:
- changed-address restart
- same-address transient outage
- tail-chasing
- slow control-plane recovery
## Required Evidence
Before calling V2 protocol validation “good enough”, we want:
1. scenario coverage in `v2_scenarios.md`
2. selected simulator tests in `distsim`
3. timing/race tests in `eventsim`
4. V1 / V1.5 / V2 comparison where relevant
5. review sign-off that the tests prove the right thing
## What This Does Not Prove
Even if all acceptance items pass, this still does not prove:
- production implementation quality
- wire protocol correctness
- real performance
- disk-level behavior
Those require later implementation and real-system validation.
## Bottom Line
If A1 through A12 are satisfied, V2 is validated enough at the protocol/design level to justify:
1. implementation slicing
2. Smart WAL design refinement
3. later real-engine integration

234
sw-block/design/v2-dist-fsm.md

@ -0,0 +1,234 @@
# WAL V2 Distributed Simulator
Date: 2026-03-26
Status: design proposal
Purpose: define the next prototype layer above `ReplicaFSM` and `VolumeModel` so WAL V2 can be validated as a distributed state machine rather than only a local state machine
## Why This Exists
The current V2 prototype already has:
- `ReplicaFSM`
- `VolumeModel`
- `RecoveryPlanner`
- scenario tracing
That is enough to reason about local recovery logic and volume-level admission.
It is not enough to prove the distributed safety claim.
The real system question is:
- when time moves forward, nodes start/stop/disconnect/reconnect, and the coordinator changes epoch,
- do all acknowledged writes remain recoverable according to the configured durability policy?
That requires a distributed simulator.
## Core Idea
Model the system as:
1. node-local state machines
2. a coordinator state machine
3. a time-driven message simulator
4. a reference data model used as the correctness oracle
## Layers
### 1. `NodeModel`
Each node has:
- role
- epoch seen
- local WAL state
- head
- tail
- `receivedLSN`
- `flushedLSN`
- checkpoint/snapshot state
- `cpLSN`
- local extent state
- local connectivity state
- local `ReplicaFSM` for each remote relationship as needed
### 2. `CoordinatorModel`
The coordinator owns:
- current epoch
- primary assignment
- membership
- durability policy
- rebuild assignments
- promotion decisions
### 3. `Network/Time Simulator`
The simulator owns:
- logical time ticks
- message delivery queues
- delay, drop, and disconnect events
- node start/stop/restart
### 4. `Reference Model`
The reference model is the correctness oracle.
It applies the committed write history to an idealized block map.
At any target `LSN = X`, it can answer:
- what value should each block contain at `X`?
## Data Correctness Model
### Synthetic 4K writes
For simulation, each 4K write should be represented as:
- block ID
- value
A simple deterministic choice is:
- `value = LSN`
Example:
- `LSN 10`: write block 7 = 10
- `LSN 11`: write block 2 = 11
- `LSN 12`: write block 7 = 12
This makes correctness checks trivial.
### Why this matters
This catches the exact extent-recovery trap:
1. `LSN 10`: block 7 = 10
2. `LSN 12`: block 7 = 12
If recovery claims to rebuild state at `LSN 10` using current extent and returns block 7 = 12, the simulator detects the bug immediately.
## Golden Invariant
For any node declared recovered to target `LSN = T`:
- node extent state must equal the reference model's state at `T`
Not:
- equal to current latest state
- equal to any valid-looking value
Exactly:
- the reference state at target `LSN`
## Recovery Correctness Rules
### WAL replay correctness
For `(startLSN, endLSN]` replay to be valid:
- every record in the interval must exist
- every payload must be the correct historical version for its LSN
- no replay gaps are allowed
- no stale-epoch records are allowed
### Extent/snapshot correctness
Extent-based recovery is valid only if the data source is version-correct.
Allowed examples:
- immutable snapshot at `cpLSN`
- pinned copy-on-write generation
- pinned payload object referenced by a recovery record
Not allowed:
- current live extent used as if it were historical state at old `cpLSN`
## Suggested Prototype Package
Prototype location:
- `sw-block/prototype/distsim/`
Suggested files:
- `types.go`
- `node.go`
- `coordinator.go`
- `network.go`
- `reference.go`
- `scenario.go`
- `sim_test.go`
## Minimal First Milestone
Do not try to simulate the whole product first.
First milestone:
1. one primary
2. one replica
3. time ticks
4. synthetic 4K writes with deterministic values
5. canonical reference model
6. simple recovery check:
- WAL replay recovers correct value
- current extent alone does not recover old `LSN`
- snapshot/base image at `cpLSN` does recover correct value
If that milestone is solid, then add:
- failover
- quorum
- multi-replica
- coordinator promotion rules
## Test Cases To Add Early
### 1. WAL replay preserves historical values
- write block 7 = 10
- write block 7 = 12
- replay only to `LSN 10`
- expect block 7 = 10
### 2. Current extent cannot reconstruct old `LSN`
- same write sequence
- try rebuilding `LSN 10` from latest extent
- expect mismatch/error
### 3. Snapshot at `cpLSN` works
- snapshot at `LSN 10`
- later overwrite block 7 at `LSN 12`
- rebuild from snapshot `LSN 10`
- expect block 7 = 10
### 4. Reservation expiration invalidates recovery
- recovery window initially valid
- time advances
- reservation expires
- recovery must abort rather than return partial or wrong state
## Relationship To Existing Prototype
This simulator should reuse existing prototype concepts where possible:
- `fsmv2` for node-local recovery lifecycle
- `volumefsm` ideas for mode semantics and admission
- `RecoveryPlanner` for recoverability decisions
The simulator is the next proof layer:
- not just whether transitions are legal
- but whether data remains correct under those transitions
## Bottom Line
WAL V2 correctness is not only a state problem.
It is also a data-version problem.
The distributed simulator should therefore prove two things together:
1. state-machine safety
2. data correctness at target `LSN`
That is the right next prototype layer if the goal is to prove:
- quorum commit safety
- no committed data loss
- no incorrect recovery from later extent state

159
sw-block/design/v2-first-slice-sender-ownership.md

@ -0,0 +1,159 @@
# V2 First Slice: Per-Replica Sender/Session Ownership
Date: 2026-03-27
Status: implementation-ready
Depends-on: Q1 (recovery session), Q6 (orchestrator scope), Q7 (first slice)
## Problem
`SetReplicaAddrs()` replaces the entire `ShipperGroup` atomically. This causes:
1. **State loss on topology change.** All shippers are destroyed and recreated.
Recovery state (`replicaFlushedLSN`, `lastContactTime`, catch-up progress) is lost.
After a changed-address restart, the new shipper starts from scratch.
2. **No per-replica identity.** Shippers are identified by array index. The master
cannot target a specific replica for rebuild/catch-up — it must re-issue the
entire address set.
3. **Background reconnect races.** A reconnect cycle may be in progress when
`SetReplicaAddrs` replaces the group. The in-progress reconnect's connection
objects become orphaned.
## Design
### Per-replica sender identity
`ShipperGroup` changes from `[]*WALShipper` to `map[string]*WALShipper`, keyed by
the replica's canonical data address. Each shipper stores its own `ReplicaID`.
```go
type WALShipper struct {
ReplicaID string // canonical data address — identity across reconnects
// ... existing fields
}
type ShipperGroup struct {
mu sync.RWMutex
shippers map[string]*WALShipper // keyed by ReplicaID
}
```
### ReconcileReplicas replaces SetReplicaAddrs
Instead of replacing the entire group, `ReconcileReplicas` diffs old vs new:
```
ReconcileReplicas(newAddrs []ReplicaAddr):
for each existing shipper:
if NOT in newAddrs → Stop and remove
for each newAddr:
if matching shipper exists → keep (preserve state)
if no match → create new shipper
```
This preserves `replicaFlushedLSN`, `lastContactTime`, catch-up progress, and
background reconnect goroutines for replicas that stay in the set.
`SetReplicaAddrs` becomes a wrapper:
```go
func (v *BlockVol) SetReplicaAddrs(addrs []ReplicaAddr) {
if v.shipperGroup == nil {
v.shipperGroup = NewShipperGroup(nil)
}
v.shipperGroup.ReconcileReplicas(addrs, v.makeShipperFactory())
}
```
### Changed-address restart flow
1. Replica restarts on new port. Heartbeat reports new address.
2. Master detects endpoint change (address differs, same volume).
3. Master sends assignment update to primary with new replica address.
4. Primary's `ReconcileReplicas` receives `[oldAddr1, newAddr2]`.
5. Old shipper for the changed replica is stopped (old address gone from set).
6. New shipper created with new address — but this is a fresh shipper.
7. New shipper bootstraps: Disconnected → Connecting → CatchingUp → InSync.
The improvement over V1.5: the **other** replicas in the set are NOT disturbed.
Only the changed replica gets a fresh shipper. Recovery state for stable replicas
is preserved.
### Recovery session
Each WALShipper already contains the recovery state machine:
- `state` (Disconnected → Connecting → CatchingUp → InSync → Degraded → NeedsRebuild)
- `replicaFlushedLSN` (authoritative progress)
- `lastContactTime` (retention budget)
- `catchupFailures` (escalation counter)
- Background reconnect goroutine
No separate `RecoverySession` object is needed. The WALShipper IS the per-replica
recovery session. The state machine already tracks the session lifecycle.
What changes: the session is no longer destroyed on topology change (unless the
replica itself is removed from the set).
### Coordinator vs primary responsibilities
| Responsibility | Owner |
|---------------|-------|
| Endpoint truth (canonical address) | Coordinator (master) |
| Assignment updates (add/remove replicas) | Coordinator |
| Epoch authority | Coordinator |
| Session creation trigger | Coordinator (via assignment) |
| Session execution (reconnect, catch-up, barrier) | Primary (via WALShipper) |
| Timeout enforcement | Primary |
| Ordered receive/apply | Replica |
| Barrier ack | Replica |
| Heartbeat reporting | Replica |
### Migration from current code
| Current | V2 |
|---------|-----|
| `ShipperGroup.shippers []*WALShipper` | `ShipperGroup.shippers map[string]*WALShipper` |
| `SetReplicaAddrs()` creates all new | `ReconcileReplicas()` diffs and preserves |
| `StopAll()` in demote | `StopAll()` unchanged (stops all) |
| `ShipAll(entry)` iterates slice | `ShipAll(entry)` iterates map values |
| `BarrierAll(lsn)` parallel slice | `BarrierAll(lsn)` parallel map values |
| `MinReplicaFlushedLSN()` iterates slice | Same, iterates map values |
| `ShipperStates()` iterates slice | Same, iterates map values |
| No per-shipper identity | `WALShipper.ReplicaID` = canonical data addr |
### Files changed
| File | Change |
|------|--------|
| `wal_shipper.go` | Add `ReplicaID` field, pass in constructor |
| `shipper_group.go` | `map[string]*WALShipper`, `ReconcileReplicas`, update iterators |
| `blockvol.go` | `SetReplicaAddrs` calls `ReconcileReplicas`, shipper factory |
| `promotion.go` | No change (StopAll unchanged) |
| `dist_group_commit.go` | No change (uses ShipperGroup API) |
| `block_heartbeat.go` | No change (uses ShipperStates) |
### Acceptance bar
The following existing tests must continue to pass:
- All CP13-1 through CP13-7 protocol tests (sync_all_protocol_test.go)
- All adversarial tests (sync_all_adversarial_test.go)
- All baseline tests (sync_all_bug_test.go)
- All rebuild tests (rebuild_v1_test.go)
The following CP13-8 tests validate the V2 improvement:
- `TestCP13_SyncAll_ReplicaRestart_Rejoin` — changed-address recovery
- `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` — V2 reconnect protocol
- `TestAdversarial_CatchupMultipleDisconnects` — state preservation across reconnects
New tests to add:
- `TestReconcileReplicas_PreservesExistingShipper` — stable replica keeps state
- `TestReconcileReplicas_RemovesStaleShipper` — removed replica stopped
- `TestReconcileReplicas_AddsNewShipper` — new replica bootstraps
- `TestReconcileReplicas_MixedUpdate` — one kept, one removed, one added
## Non-goals for this slice
- Smart WAL payload classes
- Recovery reservation protocol
- Full coordinator orchestration
- New transport layer

193
sw-block/design/v2-first-slice-session-ownership.md

@ -0,0 +1,193 @@
# V2 First Slice: Per-Replica Sender and Recovery Session Ownership
Date: 2026-03-27
## Purpose
This document defines the first real V2 implementation slice.
The slice is intentionally narrow:
- per-replica sender ownership
- explicit recovery session ownership
- clear coordinator vs primary responsibility
This is the first step toward a standalone V2 block engine under `sw-block/`.
## Why This Slice First
It directly addresses the clearest V1.5 structural limits:
- sender identity loss when replica sets are refreshed
- changed-address restart recovery complexity
- repeated reconnect cycles without stable per-replica ownership
- adversarial Phase 13 boundary tests that V1.5 cannot cleanly satisfy
It also avoids jumping too early into:
- Smart WAL
- new backend storage layout
- full production transport redesign
## Core Decision
Use:
- **one sender owner per replica**
- **at most one active recovery session per replica per epoch**
Healthy replicas may only need their steady sender object.
Degraded / reconnecting replicas gain an explicit recovery session owned by the primary.
## Ownership Split
### Coordinator
Owns:
- replica identity / endpoint truth
- assignment updates
- epoch authority
- session creation / destruction intent
Does not own:
- byte-by-byte catch-up execution
- local sender loop scheduling
### Primary
Owns:
- per-replica sender objects
- per-replica recovery session execution
- reconnect / catch-up progress
- timeout enforcement for active session
- transition from:
- normal sender
- to recovery session
- back to normal sender
### Replica
Owns:
- receive/apply path
- barrier ack
- heartbeat/reporting
Replica remains passive from the recovery-orchestration point of view.
## Data Model
## Sender Owner
Per replica, maintain a stable sender owner with:
- replica logical ID
- current endpoint
- current epoch view
- steady-state health/status
- optional active recovery session reference
## Recovery Session
Per replica, per epoch:
- `ReplicaID`
- `Epoch`
- `EndpointVersion` or equivalent endpoint truth
- `State`
- `connecting`
- `catching_up`
- `in_sync`
- `needs_rebuild`
- `StartLSN`
- `TargetLSN`
- timeout / deadline metadata
## Session Rules
1. only one active session per replica per epoch
2. new assignment for same replica:
- supersedes old session only if epoch/session generation is newer
3. stale session must not continue after:
- epoch bump
- endpoint truth change
- explicit coordinator replacement
## Minimal State Transitions
### Healthy path
1. replica sender exists
2. sender ships normally
3. replica remains `InSync`
### Recovery path
1. sender detects or is told replica is not healthy
2. coordinator provides valid assignment/endpoint truth
3. primary creates recovery session
4. session connects
5. session catches up if recoverable
6. on success:
- session closes
- steady sender resumes normal state
### Rebuild path
1. session determines catch-up is not sufficient
2. session transitions to `needs_rebuild`
3. higher layer rebuild flow takes over
## What This Slice Does Not Include
Not in the first slice:
- Smart WAL payload classes in production
- snapshot pinning / GC logic
- new on-disk engine
- frontend publication changes
- full production event scheduler
## Proposed V2 Workspace Target
Do this under `sw-block/`, not `weed/storage/blockvol/`.
Suggested area:
- `sw-block/prototype/enginev2/`
Suggested first files:
- `sw-block/prototype/enginev2/session.go`
- `sw-block/prototype/enginev2/sender.go`
- `sw-block/prototype/enginev2/group.go`
- `sw-block/prototype/enginev2/session_test.go`
The first code does not need full storage I/O.
It should prove ownership and transition shape first.
## Acceptance For This Slice
The slice is good enough when:
1. sender identity is stable per replica
2. changed-address reassignment updates the right sender owner
3. multiple reconnect cycles do not lose recovery ownership
4. stale session does not survive epoch bump
5. the 4 Phase 13 V2-boundary tests have a clear path to become satisfiable
## Relationship To Existing Simulator
This slice should align with:
- `v2-acceptance-criteria.md`
- `v2-open-questions.md`
- `v1-v15-v2-comparison.md`
- `distsim` / `eventsim` behavior
The simulator remains the design oracle.
The first implementation slice should not contradict it.

161
sw-block/design/v2-open-questions.md

@ -0,0 +1,161 @@
# V2 Open Questions
Date: 2026-03-27
## Purpose
This document records what is still algorithmically open in V2.
These are not bugs.
They are design questions that should be closed deliberately before or during implementation slicing.
## 1. Recovery Session Ownership
Open question:
- what is the exact ownership model for one active recovery session per replica?
Need to decide:
- session identity fields
- supersede vs reject vs join behavior
- how epoch/session invalidates old recovery work
Why it matters:
- V1.5 needed local reconnect serialization
- V2 should make this a protocol rule
## 2. Promotion Threshold Strictness
Open question:
- must a promotion candidate always have `FlushedLSN >= CommittedLSN`, or is there any narrower safe exception?
Current prototype:
- uses committed-prefix sufficiency as the safety gate
Why it matters:
- determines how strict real failover behavior should be
## 3. Recovery Reservation Shape
Open question:
- what exactly is reserved during catch-up?
Need to decide:
- WAL range only?
- payload pins?
- snapshot pin?
- expiry semantics?
Why it matters:
- recoverability must be explicit, not hopeful
## 4. Smart WAL Payload Classes
Open question:
- which payload classes are allowed in V2 first?
Current model has:
- `WALInline`
- `ExtentReferenced`
Need to decide:
- whether first real implementation includes both
- whether `ExtentReferenced` requires pinned snapshot/versioned extent only
## 5. Smart WAL Garbage Collection Boundary
Open question:
- when can a referenced payload stop being recoverable?
Need to decide:
- GC interaction
- timeout interaction
- recovery session pinning
Why it matters:
- this is the line between catch-up and rebuild
## 6. Exact Orchestrator Scope
Open question:
- how much of the final V2 control logic belongs in:
- local node state
- coordinator
- transport/session manager
Why it matters:
- avoid V1-style scattered state ownership
## 7. First Real Implementation Slice
Open question:
- what is the first production slice of V2?
Candidates:
1. per-replica sender/session ownership
2. explicit recovery-session management
3. catch-up/rebuild decision plumbing
Recommended default:
- per-replica sender/session ownership
## 8. Steady-State Overhead Budget
Open question:
- what overhead is acceptable in the normal healthy case?
Need to decide:
- metadata checks on hot path
- extra state bookkeeping
- what stays off the hot path
Why it matters:
- V2 should be structurally better without becoming needlessly heavy
## 9. Smart WAL First-Phase Goal
Open question:
- is the first Smart WAL goal:
- lower recovery cost
- lower steady-state WAL volume
- or just proof of historical correctness model?
Recommended answer:
- first prove correctness model, then optimize
## 10. End Condition For Simulator Work
Open question:
- when do we stop adding simulator depth and start implementation?
Suggested answer:
- once acceptance criteria are satisfied
- and the first implementation slice is clear
- and remaining simulator additions are no longer changing core protocol decisions

239
sw-block/design/v2-prototype-roadmap-and-gates.md

@ -0,0 +1,239 @@
# V2 Prototype Roadmap And Gates
Date: 2026-03-27
Status: active
Purpose: define the remaining prototype roadmap, the validation gates between stages, and the decision point between real V2 engine work and possible V2.5 redesign
## Current Position
V2 design/FSM/simulator work is sufficiently closed for serious prototyping, but not frozen against later `V2.5` adjustments.
Current state:
- design proof: high
- execution proof: medium
- data/recovery proof: low
- prototype end-to-end proof: low
Rough prototype progress:
- `25%` to `35%`
This is early executable prototype, not engine-ready prototype.
## Roadmap Goal
Answer this question with prototype evidence:
- can V2 become a real engine path?
- or should it become `V2.5` before real implementation begins?
## Step 1: Execution Authority Closure
Purpose:
- finish the sender / recovery-session authority model so stale work is unambiguously rejected
Scope:
1. ownership-only `AttachSession()` / `SupersedeSession()`
2. execution begins only through execution APIs
3. stale handshake / progress / completion fenced by `sessionID`
4. endpoint bump / epoch bump invalidate execution authority
5. sender-group preserve-or-kill behavior is explicit
Done when:
1. all execution APIs are sender-gated and reject stale `sessionID`
2. session creation is separated from execution start
3. phase ordering is enforced
4. endpoint bump / epoch bump invalidate execution authority correctly
5. mixed add/remove/update reconciliation preserves or kills state exactly as intended
Main files:
- `sw-block/prototype/enginev2/`
- `sw-block/prototype/distsim/`
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
Key gate:
- old recovery work cannot mutate current sender state at any execution stage
## Step 2: Orchestrated Recovery Prototype
Purpose:
- move from good local sender APIs to an actual prototype recovery flow driven by assignment/update intent
Scope:
1. assignment/update intent creates or supersedes recovery attempts
2. reconnect / reassignment / catch-up / rebuild decision path
3. sender-group becomes orchestration entry point
4. explicit outcome branching:
- zero-gap fast completion
- positive-gap catch-up
- unrecoverable gap -> `NeedsRebuild`
Done when:
1. the prototype expresses a realistic recovery flow from topology/control intent
2. sender-group drives recovery creation, not only unit helpers
3. recovery outcomes are explicit and testable
4. orchestrator responsibility is clear enough to narrow `v2-open-questions.md` item 6
Key gate:
- recovery control is no longer scattered across helper calls; it has one clear orchestration path
## Step 3: Minimal Historical Data Prototype
Purpose:
- prove the recovery model against real data-history assumptions, not only control logic
Scope:
1. minimal WAL/history model, not full engine
2. enough to exercise:
- catch-up range
- retained prefix/window
- rebuild fallback
- historical correctness at target LSN
3. enough reservation/recoverability state to make recovery explicit
Done when:
1. the prototype can prove why a gap is recoverable or unrecoverable
2. catch-up and rebuild decisions are backed by minimal data/history state
3. `v2-open-questions.md` items 3, 4, 5 are closed or sharply narrowed
4. prototype evidence strengthens acceptance criteria `A5`, `A6`, and `A7`
Key gate:
- the prototype must explain why recovery is allowed, not just that policy says it is
## Step 4: Prototype Scenario Closure
Purpose:
- make the prototype itself demonstrate the V2 story end-to-end
Scope:
1. map key V2 scenarios onto the prototype
2. express the 4 V2-boundary cases against prototype behavior
3. add one small end-to-end harness inside `sw-block/prototype/`
4. align prototype evidence with acceptance criteria
Done when:
1. prototype behavior can be reviewed scenario-by-scenario
2. key V1/V1.5 failures have prototype equivalents
3. prototype outcomes match intended V2 design claims
4. remaining gaps are clearly real-engine gaps, not protocol/prototype ambiguity
Key gate:
- a reviewer can trace:
- acceptance criteria -> scenario -> prototype behavior
without hand-waving
## Gates
### Gate 1: Design Closed Enough
Status:
- mostly passed
Meaning:
1. acceptance criteria exist
2. core simulator exists
3. ownership gap from V1.5 is understood
### Gate 2: Execution Authority Closed
Passes after Step 1.
Meaning:
- stale execution results cannot mutate current authority
### Gate 3: Orchestrated Recovery Closed
Passes after Step 2.
Meaning:
- recovery flow is controlled by one coherent orchestration model
### Gate 4: Historical Data Model Closed
Passes after Step 3.
Meaning:
- catch-up vs rebuild is backed by executable data-history logic
### Gate 5: Prototype Convincing
Passes after Step 4.
Meaning:
- enough evidence exists to choose:
- real V2 engine path
- or `V2.5` redesign
## Decision Gate After Step 4
### Path A: Real V2 Engine Planning
Choose this if:
1. prototype control logic is coherent
2. recovery boundary is explicit
3. boundary cases are convincing
4. no major structural flaw remains
Outputs:
1. real engine slicing plan
2. migration/integration plan into future standalone `sw-block`
3. explicit non-goals for first production version
### Path B: V2.5 Redesign
Choose this if the prototype reveals:
1. ownership/orchestration still too fragile
2. recovery boundary still too implicit
3. historical correctness model too costly or too unclear
4. too much complexity leaks into the hot path
Output:
- write `V2.5` as a design/prototype correction before engine work
## What Not To Do Yet
1. no Smart WAL expansion beyond what Step 3 minimally needs
2. no backend/storage-engine redesign
3. no V1 production integration
4. no frontend/wire protocol work
5. no performance optimization as a primary goal
## Practical Summary
Current sequence:
1. finish execution authority
2. build orchestrated recovery
3. add minimal historical-data proof
4. close key scenarios against the prototype
5. decide:
- V2 engine
- or `V2.5`

249
sw-block/design/v2-scenario-sources-from-v1.md

@ -0,0 +1,249 @@
# V2 Scenario Sources From V1 and V1.5
Date: 2026-03-27
## Purpose
This document distills V1 / V1.5 real-test material into V2 scenario inputs.
Sources:
- `learn/projects/sw-block/phases/phase13_test.md`
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
This is not the active scenario backlog.
Use:
- `v2_scenarios.md` for the active V2 scenario set
- this file for historical source and rationale
## How To Use This File
For each item below:
1. keep the real V1/V1.5 test as implementation evidence
2. create or maintain a V2 simulator scenario for the protocol core
3. define the expected V2 behavior explicitly
## Source Buckets
### 1. Core protocol behavior
These are the highest-value simulator inputs.
- barrier durability truth
- reconnect + catch-up
- non-convergent catch-up -> rebuild
- rebuild fallback
- failover / promotion safety
- WAL retention / tail-chasing
- durability mode semantics
Recommended V2 treatment:
- `sim_core`
### 2. Supporting invariants
These matter, but usually as reduced simulator checks.
- canonical address handling
- replica role/epoch gating
- committed-prefix rules
- rebuild publication cleanup
- assignment refresh behavior
Recommended V2 treatment:
- `sim_reduced`
### 3. Real-only implementation behavior
These should usually stay in real-engine tests.
- actual wire encoding / decode bugs
- real disk / `fdatasync` timing
- NVMe / iSCSI frontend behavior
- Go concurrency artifacts tied to concrete implementation
Recommended V2 treatment:
- `real_only`
### 4. V2 boundary items
These are especially important.
They should remain visible as:
- current V1/V1.5 limitation
- explicit V2 acceptance target
Recommended V2 treatment:
- `v2_boundary`
## Distilled Scenario Inputs
### A. Barrier truth uses durable replica progress
Real source:
- Phase 13 barrier / `replicaFlushedLSN` tests
Why it matters:
- commit must follow durable replica progress, not send progress
V2 target:
- barrier completion counted only from explicit durable progress state
### B. Same-address transient outage
Real source:
- Phase 13 reconnect / catch-up tests
- `CP13-8` short outage recovery
Why it matters:
- proves cheap short-gap recovery path
V2 target:
- explicit recoverability check
- catch-up if recoverable
- rebuild otherwise
### C. Changed-address restart
Real source:
- `CP13-8 T4b`
- changed-address refresh fixes
Why it matters:
- endpoint is not identity
- stale endpoint must not remain authoritative
V2 target:
- heartbeat/control-plane learns new endpoint
- reassignment updates sender target
- recovery session starts only after endpoint truth is updated
### D. Non-convergent catch-up / tail-chasing
Real source:
- Phase 13 retention + catch-up + rebuild fallback line
Why it matters:
- “catch-up exists” is not enough
- must know when to stop and rebuild
V2 target:
- explicit `CatchingUp -> NeedsRebuild`
- no fake success
### E. Slow control-plane recovery
Real source:
- `CP13-8 T4b` hardware behavior before fix
Why it matters:
- safety can be correct while availability recovery is poor
V2 target:
- explicit fast recovery path when possible
- explicit fallback when only control-plane repair can help
### F. Stale message / delayed ack fencing
Real source:
- Phase 13 epoch/fencing tests
- V2 scenario work already mirrors this
Why it matters:
- old lineage must not mutate committed prefix
V2 target:
- stale message rejection is explicit and testable
### G. Promotion candidate safety
Real source:
- failover / promotion gating tests
- V2 candidate-selection work
Why it matters:
- wrong promotion loses committed lineage
V2 target:
- candidate must satisfy:
- running
- epoch aligned
- state eligible
- committed-prefix sufficient
### H. Rebuild boundary after failed catch-up
Real source:
- Phase 13 rebuild fallback behavior
Why it matters:
- rebuild is required when retained WAL cannot safely close the gap
V2 target:
- rebuild is explicit fallback, not ad hoc recovery
## Immediate Feed Into `v2_scenarios.md`
These are the most important V1/V1.5-derived V2 scenarios:
1. same-address transient outage
2. changed-address restart
3. non-convergent catch-up / tail-chasing
4. stale delayed message / barrier ack rejection
5. committed-prefix-safe promotion
6. control-plane-latency recovery shape
## What Should Not Be Copied Blindly
Do not clone every real-engine test into the simulator.
Do not use the simulator for:
- exact OS timing
- exact socket/wire bugs
- exact block frontend behavior
- implementation-specific lock races
Instead:
- extract the protocol invariant
- model the reduced scenario if the protocol value is high
## Bottom Line
V1 / V1.5 tests should feed V2 in two ways:
1. as historical evidence of what failed or mattered in real life
2. as scenario seeds for the V2 simulator and acceptance backlog

638
sw-block/design/v2_scenarios.md

@ -0,0 +1,638 @@
# WAL V2 Scenarios
Date: 2026-03-26
Status: working scenario backlog
Purpose: define the scenario set that proves why WAL V2 exists, what it must do better than WAL V1, and what it should handle better than rebuild-heavy systems
Execution note:
- active implementation planning for these scenarios lives under `../.private/phase/`
- `design/` is the design/source-of-truth view
- `.private/phase/` is the execution/checklist view for `sw`
## Why This File Exists
V2 should not grow by adding random simulations.
Each new scenario should prove one of these claims:
1. committed data is never lost
2. uncommitted data is never falsely revived
3. epoch and promotion lineage are safe
4. short-gap recovery is cheaper and cleaner than rebuild
5. catch-up vs rebuild boundary is explicit and correct
6. historical data correctness is preserved
## Scenario Sources
The backlog draws scenarios from three sources:
1. **V1 / V1.5 real failures**
- real bugs and real-hardware gaps observed during Phase 12 / Phase 13
- these are the highest-value scenarios because they came from actual system behavior
2. **V2 design obligations**
- scenarios required by the intended V2 protocol shape
- examples:
- reservations
- lineage-first promotion
- explicit catch-up vs rebuild boundary
3. **Distributed-systems adversarial cases**
- scenarios not yet seen in production, but known to be dangerous
- examples:
- zombie primary
- partitions
- message reordering
- multi-promotion lineage chains
This file is the shared backlog for anyone extending:
- `sw-block/prototype/fsmv2/`
- `sw-block/prototype/volumefsm/`
- `sw-block/prototype/distsim/`
For active development sequencing, see:
- `sw-block/.private/phase/phase-01.md`
- `sw-block/.private/phase/phase-02.md`
- `sw-block/design/v2-scenario-sources-from-v1.md`
Current simulator note:
- current `distsim` coverage already includes:
- changed-address restart comparison across `V1` / `V1.5` / `V2`
- same-address transient outage comparison
- slow control-plane recovery comparison
- stale-endpoint rejection
- committed-prefix-aware promotion eligibility
## V2 Goals
Compared with WAL V1, V2 should improve:
1. state clarity
2. recovery boundary clarity
3. fencing and promotion correctness
4. testability of distributed behavior
5. proof of data correctness at a target `LSN`
Compared with rebuild-heavy systems, V2 should improve:
1. short-gap recovery cost
2. explicit progress semantics
3. catch-up vs rebuild decision quality
## Scenario Format
Each scenario should eventually define:
1. setup
2. event sequence
3. expected commit/ack behavior
4. expected promotion/fencing behavior
5. expected final data state at target `LSN`
Where possible, use synthetic 4K writes with:
- `value = LSN`
That makes correctness assertions trivial.
## Priority 1: Commit Safety
These scenarios prove the most important distributed claim:
- if the system ACKed a write under the configured policy, that write is not lost
### S1. ACK Then Primary Crash
Goal:
- prove a quorum-acknowledged write survives failover
Sequence:
1. primary commits a write
2. replicas durable-ACK enough nodes for policy
3. primary crashes immediately
4. coordinator promotes a valid replica
Expect:
- promoted node contains the committed `LSN`
- final state matches reference model at committed `LSN`
### S2. Non-Quorum Write Then Primary Crash
Goal:
- prove uncommitted data is not revived after failover
Sequence:
1. primary accepts a write locally
2. quorum durability is not reached
3. primary crashes
4. coordinator promotes another node
Expect:
- promoted node does not expose the uncommitted write
- committed `LSN` stays at previous value
### S3. Zombie Old Primary Is Fenced
Goal:
- prove old-epoch traffic cannot corrupt new lineage
Sequence:
1. primary loses lease
2. coordinator bumps epoch and promotes new primary
3. old primary continues trying to send writes / barriers
Expect:
- all old-epoch traffic is rejected
- no stale write becomes committed under the new epoch
## Priority 2: Short-Gap Recovery
These scenarios justify V2 over rebuild-heavy designs.
### S4. Brief Disconnect, WAL Catch-Up Only
Goal:
- prove a short outage recovers via WAL catch-up, not rebuild
Sequence:
1. replica disconnects briefly
2. primary continues writing
3. gap stays inside recoverable window
4. replica reconnects and catches up
Expect:
- `CatchingUp -> PromotionHold -> InSync`
- no rebuild required
- final state matches reference at target `LSN`
### S5. Flapping Replica Stays Recoverable
Goal:
- prove transient disconnects do not force unnecessary rebuild
Sequence:
1. replica disconnects and reconnects repeatedly
2. gaps stay within reserved recoverable windows
Expect:
- replica may move between `Lagging`, `CatchingUp`, and `PromotionHold`
- replica does not enter `NeedsRebuild` unless recoverability is actually lost
### S6. Tail-Chasing Under Load
Goal:
- prove behavior when primary writes faster than catch-up rate
Sequence:
1. replica reconnects behind
2. primary continues writing quickly
3. catch-up target may be reached or may fall behind again
Expect:
- explicit result:
- converge and promote
- or abort to `NeedsRebuild`
- never silently pretend the replica is current
## Priority 3: Catch-Up vs Rebuild Boundary
These scenarios justify the V2 recoverability model.
### S7. Recovery Initially Possible, Then Reservation Expires
Goal:
- prove `check -> reserve -> recover` is enforced
Sequence:
1. primary grants a recoverability reservation
2. catch-up starts
3. reservation expires or is revoked before completion
Expect:
- catch-up aborts
- replica transitions to `NeedsRebuild`
- no partial recovery is treated as success
### S8. Current Extent Cannot Recover Old LSN
Goal:
- prove the historical correctness trap
Sequence:
1. write block `B = 10` at `LSN 10`
2. later write block `B = 12` at `LSN 12`
3. attempt to recover state at `LSN 10` from current extent
Expect:
- mismatch detected
- scenario must fail correctness check
### S9. Snapshot + Tail Rebuild Works
Goal:
- prove correct long-gap reconstruction
Sequence:
1. take snapshot at `cpLSN`
2. later writes extend head
3. lagging replica rebuilds from snapshot
4. replay trailing WAL tail
Expect:
- final state matches reference at target `LSN`
## Priority 4: Quorum and Mixed Replica States
These scenarios justify V2 mode clarity.
### S10. Mixed States Under `sync_quorum`
Goal:
- prove `sync_quorum` remains available with mixed replica states
Sequence:
1. one replica `InSync`
2. one replica `CatchingUp`
3. one replica `Rebuilding`
Expect:
- writes may continue if durable quorum exists
- ACK gating follows quorum rules exactly
### S11. Mixed States Under `sync_all`
Goal:
- prove `sync_all` remains strict
Sequence:
1. same mixed-state setup as above
Expect:
- writes/acks block or fail according to `sync_all`
- no silent downgrade to quorum or best effort
### S12. Promotion Chooses Best Valid Lineage
Goal:
- prove promotion is correctness-first, not “highest apparent LSN wins”
Sequence:
1. candidate nodes have different:
- flushed LSN
- rebuild state
- epoch lineage
2. coordinator chooses a new primary
Expect:
- only a valid-lineage node is promotable
- stale or inconsistent node is rejected
## Priority 5: Smart WAL / Recovery Classes
These scenarios justify V2’s future adaptive write path.
### S13. `WALInline` Window Is Recoverable
Goal:
- prove inline WAL payload replay works directly
Sequence:
1. missing range consists of `WALInline` records
2. planner grants reservation
Expect:
- catch-up allowed
- final state correct
### S14. `ExtentReferenced` Payload Still Resolvable
Goal:
- prove direct-extent records can still support catch-up when pinned
Sequence:
1. missing range includes `ExtentReferenced` records
2. payload objects / generations are still resolvable
3. reservation pins those dependencies
Expect:
- catch-up allowed
- final state correct
### S15. `ExtentReferenced` Payload Lost
Goal:
- prove metadata alone is not enough
Sequence:
1. missing range includes `ExtentReferenced` records
2. metadata still exists
3. payload object / version is no longer resolvable
Expect:
- planner returns `NeedsRebuild`
- catch-up is forbidden
## Priority 6: Restart and Rebuild Robustness
These scenarios justify operational resilience.
### S16. Replica Restarts During Catch-Up
Goal:
- prove restart does not corrupt catch-up state
Sequence:
1. replica is catching up
2. replica restarts
3. reconnect and recover again
Expect:
- no false promotion
- resume or restart recovery cleanly
### S17. Replica Restarts During Rebuild
Goal:
- prove rebuild interruption is safe
Sequence:
1. replica is rebuilding from snapshot
2. replica restarts mid-copy
Expect:
- rebuild aborts or restarts safely
- no partial base image is treated as valid
### S18. Primary Restarts Without Failover
Goal:
- prove restart with same lineage is handled explicitly
Sequence:
1. primary stops and restarts
2. coordinator either preserves or changes epoch depending on policy
Expect:
- replicas react consistently
- no stale assumptions about previous sender sessions
### S19. Chain Of Custody Across Multiple Promotions
Goal:
- prove committed data survives more than one failover lineage step
Sequence:
1. primary `A` commits writes
2. fail over to `B`
3. `B` commits additional writes
4. fail over to `C`
Expect:
- `C` contains all writes committed by `A` and `B`
- no committed data disappears across multiple promotions
- final state matches reference model at committed `LSN`
### S20. Network Partition With Concurrent Write Attempts
Goal:
- prove epoch fencing prevents split-brain writes during partition
Sequence:
1. cluster partitions into two live sides
2. old primary side continues trying to write
3. coordinator promotes a new primary on the surviving side
4. both sides attempt to send control/data traffic
Expect:
- only the current-epoch side can advance committed state
- stale-side writes are rejected or ignored
- no conflicting committed lineage appears
## Suggested Implementation Order
Implement in this order:
1. `S1` ACK then primary crash
2. `S2` non-quorum write then primary crash
3. `S3` zombie old primary fenced
4. `S4` brief disconnect with WAL catch-up
5. `S7` reservation expiry aborts catch-up
6. `S10` mixed-state quorum policy
7. `S9` long-lag rebuild from snapshot + tail
8. `S13-S15` Smart WAL recoverability
## Coverage Matrix
Status values:
- `covered`
- `partial`
- `not_started`
- `needs_richer_model`
| Scenario | Package | Test / Artifact | Status | Notes |
|---|---|---|---|---|
| `S1` ACK then primary crash | `distsim` | `TestQuorumCommitSurvivesPrimaryFailover` | `covered` | quorum commit survives failover |
| `S2` non-quorum write then primary crash | `distsim` | `TestUncommittedWriteNotPreservedAfterPrimaryLoss` | `covered` | no false revival |
| `S3` zombie old primary fenced | `distsim` | `TestZombieOldPrimaryWritesAreFenced` | `covered` | stale epoch traffic ignored |
| `S4` brief disconnect, WAL catch-up only | `distsim` | `TestReplicaCatchupFromPrimaryWAL` | `covered` | short-gap recovery |
| `S5` flapping replica stays recoverable | `distsim` | `TestS5_FlappingReplica_NoUnnecessaryRebuild`, `TestS5_FlappingWithStateTracking`, `TestS5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `covered` | both recoverable flapping and explicit budget-exceeded escalation are now asserted |
| `S6` tail-chasing under load | `distsim` | `TestS6_TailChasing_ConvergesOrAborts`, `TestS6_TailChasing_NonConvergent_Aborts`, `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild`, `TestP02_S6_NonConvergent_ExplicitStateTransition` | `covered` | explicit non-convergent `CatchingUp -> NeedsRebuild` path now asserted |
| `S7` reservation expiry aborts catch-up | `fsmv2`, `volumefsm`, `distsim` | `TestFSMReservationLostNeedsRebuild`, `TestModelReservationLostDuringCatchupAfterRebuild`, `TestReservationExpiryAbortsCatchup` | `covered` | present at 3 layers |
| `S8` current extent cannot recover old LSN | `distsim` | `TestCurrentExtentCannotRecoverOldLSN` | `covered` | historical correctness trap |
| `S9` snapshot + tail rebuild works | `distsim` | `TestReplicaRebuildFromSnapshotAndTail`, `TestSnapshotPlusTrailingReplayReachesTargetLSN` | `covered` | long-gap reconstruction |
| `S10` mixed states under `sync_quorum` | `volumefsm`, `distsim` | `TestModelSyncQuorumWithThreeReplicasMixedStates`, `TestSyncQuorumWithMixedReplicaStates` | `covered` | quorum stays available |
| `S11` mixed states under `sync_all` | `distsim` | `TestSyncAllBlocksWithMixedReplicaStates` | `covered` | strict sync_all behavior |
| `S12` promotion chooses best valid lineage | `distsim` | `TestPromotionUsesValidLineageNode`, `TestS12_PromotionChoosesBestLineage_NotHighestLSN`, `TestS12_PromotionRejectsRebuildingCandidate` | `covered` | lineage-first promotion now exercised beyond simple LSN comparison |
| `S13` `WALInline` window recoverable | `distsim` | `TestWALInlineRecordsAreRecoverable` | `covered` | inline payload recoverability |
| `S14` `ExtentReferenced` payload resolvable | `distsim` | `TestExtentReferencedResolvableRecordsAreRecoverable`, `TestMixedClassRecovery_FullSuccess` | `covered` | recoverable direct-extent and mixed-class recovery case |
| `S15` `ExtentReferenced` payload lost | `distsim` | `TestExtentReferencedUnresolvableForcesRebuild`, `TestRecoverableThenUnrecoverable`, `TestTimeVaryingAvailability` | `covered` | metadata alone not enough; active recovery can transition from recoverable to unrecoverable |
| `S16` replica restarts during catch-up | `distsim` | `TestReplicaRestartDuringCatchupRestartsSafely` | `covered` | safe recovery restart |
| `S17` replica restarts during rebuild | `distsim` | `TestReplicaRestartDuringRebuildRestartsSafely` | `covered` | rebuild interruption safe |
| `S18` primary restarts without failover | `distsim` | `TestS18_PrimaryRestart_SameLineage`, `TestS18_PrimaryRestart_ReplicasRejectOldEpoch`, `TestS18_PrimaryRestart_DelayedOldAck_DoesNotAdvancePrefix`, `TestS18_PrimaryRestart_InFlightBarrierDropped`, `TestP02_S18_DelayedAck_ExplicitRejection` | `covered` | delayed stale ack rejection and committed-prefix stability are now asserted directly |
| `S19` chain of custody across promotions | `distsim` | `TestS19_ChainOfCustody_MultiplePromotions`, `TestS19_ChainOfCustody_ThreePromotions` | `covered` | multi-promotion lineage continuity covered |
| `S20` live partition with competing writes | `distsim` | `TestS20_LivePartition_StaleWritesNotCommitted`, `TestS20_LivePartition_HealRecovers`, `TestS20_StalePartition_ProtocolRejectsStaleWrites`, `TestP02_S20_StaleTraffic_CommittedPrefixUnchanged` | `covered` | stale-side protocol traffic is explicitly rejected and committed prefix remains unchanged |
## Ownership Notes
When adding a scenario:
1. add or extend the relevant prototype test:
- `fsmv2`
- `volumefsm`
- `distsim`
2. update this file with:
- status
- package location
3. keep correctness checks tied to:
- committed `LSN`
- reference model state
## Current Coverage Snapshot
Already covered in some form:
- quorum commit survives primary failover
- uncommitted write not preserved after primary loss
- zombie old primary fenced by epoch
- lagging replica catch-up from primary WAL
- reservation expiry aborts catch-up in distributed sim
- `sync_quorum` continues with one lagging replica
- `sync_all` blocks with one lagging replica
- `sync_quorum` with mixed replica states
- `sync_all` with mixed replica states
- rebuild from snapshot + tail
- promotion uses valid lineage node
- flapping recoverable vs budget-exceeded rebuild path
- tail-chasing explicit escalation to rebuild
- restart during catch-up recovers safely
- restart during rebuild recovers safely
- primary restart delayed stale ack rejection
- `WALInline` recoverability
- `ExtentReferenced` resolvable vs unresolvable boundary
- mixed-class Smart WAL recovery and time-varying payload availability
- delayed stale messages and selective drop behavior
- multi-node reservation expiry and rebuild-timeout behavior
- current extent cannot reconstruct old `LSN`
Still important to add:
- explicit coordinator-driven candidate selection among competing valid/invalid lineages
- control-plane latency scenarios derived from `CP13-8 T4b`
- explicit V1 / V1.5 / V2 comparison scenarios for:
- changed-address restart
- same-address transient outage
- slow reassignment recovery
## V1.5 Lessons To Add Or Strengthen
These come directly from WAL V1.5 / Phase 13 behavior and should be treated as high-priority scenario drivers.
### L1. Replica Restart With New Receiver Port
Observed:
- replica VS restarts
- receiver comes back on a new random port
- primary background reconnect retries old address and fails
Implication:
- direct reconnect only works if replica address is stable
Backlog impact:
- strengthen `S18`
- add a restart/address-change sub-scenario under `S20` or a future network/control-plane recovery scenario
### L2. Slow Control-Plane Reassignment Dominates Recovery
Observed:
- sync correctness preserved
- write availability recovery waits for heartbeat/reassignment cycle
Implication:
- "recoverable in theory" is not enough
- recovery latency is part of protocol quality
Backlog impact:
- `S5` is now covered at current simulator level
- strengthen `S18`
- add long-running restart/rejoin timing scenarios
### L3. Background Reconnect Helps Only Same-Address Recovery
Observed:
- background reconnect is useful for transient network failure
- not sufficient for process restart with address change
Implication:
- scenarios must distinguish:
- transient disconnect
- process restart
- address change
Backlog impact:
- keep `S4` as transient disconnect
- strengthen `S18` with restart/address-stability cases
### L4. Tail-Chasing And Retention Pressure Are Structural Risks
Observed:
- Phase 13 reasoning repeatedly exposed:
- lagging replica may pin WAL
- catch-up may not converge while primary keeps advancing
Implication:
- V2 must explicitly model convergence, abort, and rebuild boundaries
Backlog impact:
- strengthen `S6`
- add multi-node retention / timeout variants
### L5. Current Extent Is Not Historical State
Observed:
- using current extent to reconstruct old `LSN` can return later values
Implication:
- V2 must require version-correct base images or resolvable historical payloads
Backlog impact:
- already covered by `S8`
- should remain a permanent regression scenario
## Randomized Simulation
In addition to fixed scenarios, V2 should keep a randomized simulator suite.
Purpose:
1. discover paths that were not explicitly written as named scenarios
2. stress promotion, restart, and recovery ordering
3. check invariants after each random step
Current prototype:
- `sw-block/prototype/distsim/random.go`
- `sw-block/prototype/distsim/random_test.go`
Current invariants checked:
1. current committed `LSN` remains a committed prefix
2. promotable nodes match reference state at committed `LSN`
3. current primary, if valid/running, matches reference state at committed `LSN`
This does not replace named scenarios.
It complements them.
## Scenario Summary
When reviewing or adding scenarios, always record the source:
1. from real V1/V1.5 behavior
2. from explicit V2 design obligation
3. from adversarial distributed-systems reasoning
The best scenarios are the ones that come from real failures first, then are generalized into V2 requirements.
## Development Phases
Execution detail is tracked in:
- `sw-block/.private/phase/phase-01.md`
- `sw-block/.private/phase/phase-02.md`
High-level phase order:
1. close explicit scenario backlog
- `S19`
- `S20`
2. strengthen missing lifecycle scenarios
- `S5`
- `S6`
- `S18`
- stronger `S12`
3. extend protocol-state simulation and version comparison
- `V1`
- `V1.5`
- `V2`
- stronger closure of current `partial` scenarios
4. strengthen random/adversarial simulation
5. add timeout-based scenarios only when the execution path is modeled

359
sw-block/design/wal-replication-v2-orchestrator.md

@ -0,0 +1,359 @@
# WAL Replication V2 Orchestrator
Date: 2026-03-26
Status: design proposal
Purpose: define the volume-level orchestration model that sits above the per-replica WAL V2 FSM
## Why This Document Exists
`ReplicaFSM` alone is not enough.
It can describe one replica relative to the current primary, but it cannot by itself model:
- primary head continuing to advance
- multiple replicas in different states
- durability mode semantics
- primary lease loss and epoch change
- primary failover and replica promotion
- fencing of old recovery sessions
So WAL V2 needs a second layer:
- per-replica `ReplicaFSM`
- volume-level `Orchestrator`
## Scope
This document defines the volume-level logic only.
It does not define:
- exact network protocol
- exact master RPCs
- exact storage backend internals
It assumes the per-replica state machine from:
- `wal-replication-v2-state-machine.md`
## Core Model
The orchestrator owns:
1. current primary lineage
- `epoch`
- lease/authority state
2. volume durability mode
- `best_effort`
- `sync_all`
- `sync_quorum`
3. moving primary progress
- `headLSN`
- checkpoint/snapshot anchors
4. replica set
- one `ReplicaFSM` per replica
- per-replica role in the current volume topology
5. volume-level admission decision
- can writes proceed?
- can sync requests complete?
- must promotion/failover occur?
## Two FSM Layers
### Layer A: `ReplicaFSM`
Owns per-replica state such as:
- `Bootstrapping`
- `InSync`
- `Lagging`
- `CatchingUp`
- `PromotionHold`
- `NeedsRebuild`
- `Rebuilding`
- `CatchUpAfterRebuild`
- `Failed`
### Layer B: `VolumeOrchestrator`
Owns system-wide state such as:
- current `epoch`
- current primary identity
- durability mode
- set of required replicas
- current `headLSN`
- whether writes or promotions are allowed
The orchestrator does not replace `ReplicaFSM`.
It drives it.
## Volume State
The orchestrator should track at least:
```go
type VolumeMode string
type PrimaryState string
const (
PrimaryServing PrimaryState = "Serving"
PrimaryDraining PrimaryState = "Draining"
PrimaryLost PrimaryState = "Lost"
)
type VolumeModel struct {
Epoch uint64
PrimaryID string
PrimaryState PrimaryState
Mode VolumeMode
HeadLSN uint64
CheckpointLSN uint64
RequiredReplicaIDs []string
Replicas map[string]*ReplicaFSM
}
```
This is a model shape, not a required production struct.
## Orchestrator Responsibilities
### 1. Advance primary head
When primary commits a new write:
- increment `headLSN`
- enqueue/send to replica sender loops
- evaluate whether the current mode still allows ACK
### 2. Evaluate sync eligibility
The orchestrator computes volume-level durability from replica states.
Derived rule:
- only `ReplicaFSM.IsSyncEligible()` counts
### 3. Drive recovery entry
When a replica disconnects or falls behind:
- feed disconnect/lag events into that replica FSM
- decide whether to try catch-up or rebuild
- acquire recovery reservation if required
### 4. Handle primary authority changes
When lease is lost or a new primary is chosen:
- increment epoch
- abort stale recovery sessions
- reevaluate all replica relationships from the new primary's perspective
### 5. Drive promotion / failover
When current primary is lost:
- choose promotion candidate
- assign new epoch
- move old primary to stale/lost
- convert the promoted replica into the new serving primary
- reclassify remaining replicas relative to the new primary
## Required Volume-Level Events
The orchestrator should be able to simulate at least these events.
### Write/progress events
- `WriteCommitted(lsn)`
- `CheckpointAdvanced(lsn)`
- `BarrierCompleted(replicaID, flushedLSN)`
### Replica health events
- `ReplicaDisconnected(replicaID)`
- `ReplicaReconnect(replicaID, flushedLSN)`
- `ReplicaReservationLost(replicaID)`
- `ReplicaCatchupTimeout(replicaID)`
- `ReplicaRebuildTooSlow(replicaID)`
### Topology/control events
- `PrimaryLeaseLost()`
- `EpochChanged(newEpoch)`
- `PromoteReplica(replicaID)`
- `ReplicaAssigned(replicaID)`
- `ReplicaRemoved(replicaID)`
## Mode Semantics
### `best_effort`
Rules:
- ACK after primary local durability
- replicas may be `Lagging`, `CatchingUp`, `NeedsRebuild`, or `Rebuilding`
- background recovery continues
Volume implication:
- primary can keep serving while replicas recover
### `sync_all`
Rules:
- ACK only when all required replicas are `InSync` and durable through target LSN
- bounded retry only
- no silent downgrade
Volume implication:
- one lagging required replica can block sync completion
- orchestrator may fail requests, not silently reinterpret policy
### `sync_quorum`
Rules:
- ACK when quorum of required nodes are durable through target LSN
- lagging replicas may recover in background as long as quorum remains
Volume implication:
- orchestrator must count eligible replicas, not just healthy sockets
## Primary-Head Simulation Rules
The orchestrator must explicitly model that the primary keeps moving.
### Rule 1: head moves independently of replica recovery
A replica entering `CatchingUp` does not freeze `headLSN`.
### Rule 2: each recovery attempt uses explicit targets
For a replica in recovery, orchestrator chooses:
- `catchupTargetLSN = H0`
- or `snapshotCpLSN = C` and replay target `H0`
### Rule 3: promotion is explicit
A replica is not restored to `InSync` just because it reaches `H0`.
It must still pass:
- barrier confirmation
- `PromotionHold`
## Failover / Promotion Model
The orchestrator must be able to simulate:
1. old primary loses lease
2. old primary is fenced by epoch change
3. one replica is promoted
4. promoted replica becomes new primary under a higher epoch
5. all old recovery sessions from the old primary are invalidated
6. remaining replicas are reevaluated relative to the new primary's head and retained history
Important consequence:
- failover is not a `ReplicaFSM` transition only
- it is a volume-level re-rooting of all replica relationships
## Suggested Promotion Rules
Promotion candidate should prefer:
1. highest valid durable progress
2. current epoch-consistent history
3. healthiest replica among tied candidates
After promotion:
- `PrimaryID` changes
- `Epoch` increments
- all replica reservations from the previous primary are void
- all non-primary replicas must renegotiate recovery against the new primary
## Multi-Replica Examples
### Example 1: `sync_all`
- replica A = `InSync`
- replica B = `Lagging`
- replica C = `InSync`
If A and B are required replicas in RF=3 `sync_all`:
- writes needing sync durability fail or wait
- even though one replica is still healthy
### Example 2: `sync_quorum`
- replica A = `InSync`
- replica B = `CatchingUp`
- replica C = `InSync`
If quorum is 2:
- volume can continue serving sync requests
- B recovers in background
### Example 3: failover
- old primary lost
- replica A promoted
- replica B was previously `CatchingUp` under old epoch
After promotion:
- B's old session is aborted
- B re-enters evaluation against A's history
## What The Tiny Prototype Should Simulate
The V2 prototype should be able to drive at least these scenarios:
1. steady state keep-up
- primary head advances
- all required replicas remain `InSync`
2. short outage
- one replica disconnects
- primary keeps writing
- reconnect succeeds within recoverable window
- replica returns via `PromotionHold`
3. long outage
- one replica disconnects too long
- recoverability expires
- replica goes `NeedsRebuild`
- rebuild and trailing replay complete
4. tail chasing
- replica catch-up speed is below primary ingest speed
- orchestrator chooses fail, throttle, or rebuild path depending on mode
5. failover
- primary lease lost
- new epoch assigned
- replica promoted
- old recovery sessions fenced
6. mixed-state quorum
- different replicas in different states
- orchestrator computes correct `sync_all` / `sync_quorum` result
## Relationship To WAL V1
WAL V1 already contains pieces of this logic, but they are scattered across:
- shipper state
- barrier code
- retention code
- assignment/promotion code
- rebuild code
- heartbeat/master logic
V2 should separate these into:
- per-replica recovery FSM
- volume-level orchestrator
## Bottom Line
The next step after `ReplicaFSM` is not `Smart WAL`.
The next step is the volume-level orchestrator model.
Why:
- primary keeps moving
- durability mode is volume-scoped
- failover/promotion is volume-scoped
- replica recovery must be evaluated in the context of the whole volume
So V2 needs:
- `ReplicaFSM` for one replica
- `VolumeOrchestrator` for the moving multi-replica system

632
sw-block/design/wal-replication-v2-state-machine.md

@ -0,0 +1,632 @@
# WAL Replication V2 State Machine
Date: 2026-03-26
Status: design proposal
Purpose: define the V2 replication state machine for a moving-head primary where replicas may transition between keep-up, catch-up, and reconstruction while the primary continues accepting writes
## Why This Document Exists
The hard part of V2 is not the existence of three modes:
- keep-up
- catch-up
- reconstruction
The hard part is that the primary head continues advancing while replicas move between those modes.
So V2 must be specified as a real state machine:
- state definitions
- state-owned LSN anchors
- allowed transitions
- retention obligations
- abort rules
This document treats edge cases as state-transition cases.
## Scope
This is a protocol/state-machine design.
It does not yet define:
- exact RPC payloads
- exact snapshot storage format
- exact implementation package boundaries
Those can follow after the state model is stable.
## Core Terms
### `headLSN`
The primary's current highest WAL LSN.
### `replicaFlushedLSN`
The highest LSN durably persisted on the replica.
### `cpLSN`
A checkpoint/snapshot base point. A snapshot at `cpLSN` represents the block state exactly at that LSN.
### `promotionBarrierLSN`
The LSN a replica must durably reach before it can re-enter `InSync`.
### `Recovery Feasibility`
Whether `(startLSN, endLSN]` can be reconstructed completely, in order, under the current epoch.
This is not a static fact. It changes over time as WAL is reclaimed, payload generations are garbage-collected, or snapshots are released.
### `Recovery Reservation`
A bounded primary-side reservation proving a recovery window is recoverable and pinning all dependencies needed to finish the current catch-up or rebuild-tail replay.
A transition into recovery is valid only after the reservation is granted.
## State Set
Replica may be in one of these states:
1. `Bootstrapping`
2. `InSync`
3. `Lagging`
4. `CatchingUp`
5. `PromotionHold`
6. `NeedsRebuild`
7. `Rebuilding`
8. `CatchUpAfterRebuild`
9. `Failed`
Only `InSync` replicas count for sync durability.
## State Semantics
### 1. `Bootstrapping`
Replica has not yet earned sync eligibility and does not yet have trusted reconnect progress.
Properties:
- fresh replica identity or newly assigned replica
- may receive initial baseline/live stream
- not yet eligible for `sync_all`
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: background/bootstrap only
Owned anchors:
- current assignment epoch
### 2. `InSync`
Replica is eligible for sync durability.
Properties:
- receiving live ordered stream
- `replicaFlushedLSN` is near the primary head
- normal barrier protocol is valid
Counts for:
- `sync_all`: yes
- `sync_quorum`: yes
- `best_effort`: yes, but not required for ACK
Owned anchors:
- `replicaFlushedLSN`
### 3. `Lagging`
Replica has fallen out of the normal live-stream envelope but recovery path is not yet chosen.
Properties:
- primary no longer treats it as sync-eligible
- replica may still be recoverable from WAL or extent-backed recovery records
- or may require rebuild
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: background recovery only
Owned anchors:
- last known `replicaFlushedLSN`
### 4. `CatchingUp`
Replica is replaying from its own durable point toward a chosen target.
Properties:
- short-gap recovery mode
- primary must reserve and pin the required recovery window
- primary head continues to move
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: background recovery only
Owned anchors:
- `catchupStartLSN = replicaFlushedLSN`
- `catchupTargetLSN`
- `promotionBarrierLSN`
- `recoveryReservationID`
- `reservationExpiry`
### 5. `PromotionHold`
Replica has reached the chosen promotion point but must demonstrate short stability before re-entering `InSync`.
Properties:
- prevents immediate flapping back into sync eligibility
- replica has already reached `promotionBarrierLSN`
- promotion requires stable barriers or elapsed hold time
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: stabilization only
Owned anchors:
- `promotionBarrierLSN`
- `promotionHoldUntil` or equivalent hold criterion
### 6. `NeedsRebuild`
Replica cannot recover from retained recovery records alone.
Properties:
- catch-up window is insufficient or no longer provable
- replica must not count toward sync durability
- replica no longer pins old catch-up history
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: background repair candidate only
Owned anchors:
- last known `replicaFlushedLSN`
### 7. `Rebuilding`
Replica is fetching and installing a checkpoint/snapshot base image.
Properties:
- primary must preserve the chosen snapshot/base
- primary must preserve the required WAL or recovery tail after `cpLSN`
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: background rebuild only
Owned anchors:
- `snapshotID`
- `snapshotCpLSN`
- `tailReplayStartLSN = snapshotCpLSN + 1`
- `recoveryReservationID`
- `reservationExpiry`
### 8. `CatchUpAfterRebuild`
Replica has installed the base image and is replaying trailing history after it.
Properties:
- semantically similar to `CatchingUp`
- base point is checkpoint/snapshot, not the replica's original own state
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: background recovery only
Owned anchors:
- `snapshotCpLSN`
- `catchupTargetLSN`
- `promotionBarrierLSN`
- `recoveryReservationID`
- `reservationExpiry`
### 9. `Failed`
Replica recovery failed in a way that needs operator/control-plane action beyond normal retry.
Properties:
- terminal or semi-terminal fault state
- may require delete/recreate/manual intervention
Counts for:
- `sync_all`: no
- `sync_quorum`: no
- `best_effort`: no direct role
## Transition Rules
### `Bootstrapping -> InSync`
Trigger:
- initial bootstrap completes
- barrier confirms durable progress under the current epoch
Action:
- establish trusted `replicaFlushedLSN`
- grant sync eligibility for the first time
### `InSync -> Lagging`
Trigger:
- disconnect
- barrier timeout
- barrier fsync failure
- stream error
Action:
- remove sync eligibility immediately
### `Lagging -> CatchingUp`
Trigger:
- reconnect succeeds
- primary grants a recovery reservation proving `(replicaFlushedLSN, catchupTargetLSN]` is recoverable for a bounded window
Action:
- choose `catchupTargetLSN`
- pin required recovery dependencies for the reservation lifetime
### `Lagging -> NeedsRebuild`
Trigger:
- required recovery window is not recoverable
- impossible progress reported
- epoch mismatch invalidates direct catch-up
- background janitor determines the replica is outside recoverable budget
Action:
- stop treating replica as a catch-up candidate
### `CatchingUp -> PromotionHold`
Trigger:
- replica replays to `catchupTargetLSN`
- barrier confirms `promotionBarrierLSN`
Action:
- start promotion debounce window
### `PromotionHold -> InSync`
Trigger:
- promotion hold criteria satisfied
- stable barrier successes
- or elapsed hold time
Action:
- restore sync eligibility
- clear promotion anchors
### `PromotionHold -> Lagging`
Trigger:
- disconnect
- failed barrier
- failed live stream health check
Action:
- cancel promotion attempt
- remove sync eligibility
### `CatchingUp -> NeedsRebuild`
Trigger:
- catch-up cannot converge
- recovery reservation is lost
- catch-up timeout policy exceeded
- epoch changes
Action:
- abandon WAL-only catch-up
- move to reconstruction path
### `NeedsRebuild -> Rebuilding`
Trigger:
- control plane or primary chooses reconstruction base
- snapshot/base image transfer starts
- primary grants a rebuild reservation
Action:
- bind replica to `snapshotID` and `snapshotCpLSN`
### `Rebuilding -> CatchUpAfterRebuild`
Trigger:
- snapshot/base image installed successfully
- trailing recovery reservation is still valid
Action:
- replay trailing history after `snapshotCpLSN`
### `Rebuilding -> NeedsRebuild`
Trigger:
- rebuild copy fails
- rebuild reservation is lost
- rebuild WAL-tail budget is exceeded
- epoch changes
Action:
- abort current rebuild session
- remain excluded from sync durability
### `CatchUpAfterRebuild -> PromotionHold`
Trigger:
- trailing replay reaches target
- barrier confirms durable replay through `promotionBarrierLSN`
Action:
- start promotion debounce
### `CatchUpAfterRebuild -> NeedsRebuild`
Trigger:
- reservation is lost
- replay cannot converge
- epoch changes
Action:
- abandon current attempt
- require a fresh rebuild plan
### Any state -> `Failed`
Trigger examples:
- unrecoverable protocol inconsistency
- repeated rebuild failure beyond retry policy
- snapshot corruption
- local replica storage failure
## Retention Obligations By State
The key V2 rule is:
- recoverability is not a static fact
- it is a bounded promise the primary must honor once it admits a replica into recovery
### `InSync`
Primary must retain:
- recent WAL under normal retention policy
Primary does not need:
- snapshot pin purely for this replica
### `Lagging`
Primary must retain:
- enough recent information to evaluate recoverability or intentionally declare `NeedsRebuild`
This state should be short-lived.
### `CatchingUp`
Primary must retain for the reservation lifetime:
- recovery metadata for `(catchupStartLSN, promotionBarrierLSN]`
- every payload referenced by that recovery window
- current epoch lineage for the session
### `PromotionHold`
Primary must retain:
- whatever live-stream and barrier state is required to validate promotion
This state should be brief and must not pin long-lived history.
### `NeedsRebuild`
Primary retains:
- no special old recovery window for this replica
This state explicitly releases the old catch-up hold.
### `Rebuilding`
Primary must retain for the reservation lifetime:
- chosen `snapshotID`
- any base-image dependencies
- trailing history after `snapshotCpLSN`
### `CatchUpAfterRebuild`
Primary must retain for the reservation lifetime:
- recovery metadata for `(snapshotCpLSN, promotionBarrierLSN]`
- every payload referenced by that trailing window
## Moving-Head Rules
The primary head continues advancing during:
- `CatchingUp`
- `Rebuilding`
- `CatchUpAfterRebuild`
Therefore transitions must never use current head at finish time as an implicit target.
Instead, each transition must select explicit targets.
### Catch-up target
When catch-up starts, choose:
- `catchupTargetLSN = H0`
Replica first chases to `H0`, not to an infinite moving head.
Then:
- either enter `PromotionHold` and promote
- or begin another bounded cycle
- or abort to rebuild
### Rebuild target
When rebuild starts, choose:
- `snapshotCpLSN = C`
- trailing replay target `H0`
Replica installs the snapshot at `C`, then replays `(C, H0]`, then enters `PromotionHold`.
## Tail-Chasing Rule
Replica may fail to converge if:
- catch-up speed < primary ingest speed
V2 must define bounded behavior:
1. bounded catch-up window
2. bounded catch-up time
3. policy after failure to converge:
- for `sync_all`: bounded retry, then fail requests
- for `best_effort`: keep serving and continue background recovery or escalate to rebuild
No silent downgrade of `sync_all` is allowed.
## Recovery Feasibility
The primary must not admit a replica into catch-up based on a best-effort guess.
It must prove the requested recovery window is recoverable and then reserve it.
Recommended abstraction:
- `CheckRecoveryFeasibility(startLSN, endLSN) -> fully recoverable | needs rebuild`
- `ReserveRecoveryWindow(startLSN, endLSN) -> reservation`
Only a successful reservation may drive:
- `Lagging -> CatchingUp`
- `NeedsRebuild -> Rebuilding`
- `Rebuilding -> CatchUpAfterRebuild`
## Recovery Classes
V2 must support more than one local record type without leaking that detail into replica state.
### `WALInline`
Properties:
- payload lives directly in WAL
- recoverable while WAL is retained
### `ExtentReferenced`
Properties:
- recovery metadata points at payload outside WAL
- payload must be resolved from extent/snapshot generation state
The FSM does not care how payload is stored.
It only cares whether the requested window is fully recoverable for the lifetime of the reservation.
The engine-level rule is:
- every record in `(startLSN, endLSN]` must be payload-resolvable
- the resolved version must correspond to that record's historical state
- the payload must stay pinned until the reservation ends
If any required payload is not resolvable:
- the window is not recoverable
- the replica must go to `NeedsRebuild`
## Snapshot Rule
Rebuild must use a real checkpoint/snapshot base image.
Valid:
- immutable snapshot at `cpLSN`
- copy-on-write checkpoint image
- frozen base image with exact `cpLSN`
Invalid:
- current extent treated as historical `cpLSN`
## Epoch / Fencing Rule
Every transition is epoch-bound.
If epoch changes during:
- `Bootstrapping`
- `Lagging`
- `CatchingUp`
- `PromotionHold`
- `Rebuilding`
- `CatchUpAfterRebuild`
Then:
- abort current transition
- discard old sender assumptions
- restart negotiation under the new epoch
This prevents stale-primary recovery traffic from being accepted.
## Multi-Replica Volume Rules
Different replicas may be in different states simultaneously.
Example:
- replica A = `InSync`
- replica B = `CatchingUp`
- replica C = `Rebuilding`
Volume-level durability policy is computed per mode.
### `sync_all`
- all required replicas must be `InSync`
### `sync_quorum`
- enough replicas must be `InSync`
### `best_effort`
- primary local durability only
- replicas recover in background
## Illegal or Suspicious Conditions
These should force rejection or abort:
1. replica reports `replicaFlushedLSN > headLSN`
2. replica progress belongs to wrong epoch
3. requested recovery window is not recoverable
4. recovery reservation cannot be granted
5. snapshot base does not match claimed `cpLSN`
6. replay stream shows impossible gap/ordering after reconstruction
## Design Guidance
V2 should be implemented so that:
1. state owns recovery semantics
2. anchors make transitions explicit
3. retention obligations are derived from state
4. catch-up admission requires reservation, not guesswork
5. mode semantics are derived from `InSync` eligibility
This is better than burying recovery behavior across many ad hoc code paths.
## Bottom Line
V2 is fundamentally a state machine problem.
The correct abstraction is not:
- some edge cases around WAL replay
It is:
- replicas move through explicit states while the primary head continues advancing and recovery windows must be provable and reserved
So V2 must be designed around:
- state definitions
- anchor LSNs
- transition rules
- retention obligations
- recoverability checks
- recovery reservations
- abort conditions

401
sw-block/design/wal-replication-v2.md

@ -0,0 +1,401 @@
# WAL Replication V2
Date: 2026-03-26
Status: design proposal
Purpose: redesign WAL-based block replication around explicit short-gap catch-up and long-gap reconstruction
## Goal
Provide a replication architecture that:
- keeps the primary write path fast
- supports correct synchronous durability semantics
- supports short-gap reconnect catch-up using WAL
- avoids paying unbounded WAL retention tax for long-lag replicas
- uses reconstruction from a real checkpoint/snapshot base for larger lag
This design replaces a "WAL does everything" mindset with a 3-tier recovery model.
## Core Principle
WAL is excellent for:
- recent ordered delta
- local crash recovery
- short-gap replica catch-up
WAL is not the right long-range recovery mechanism for lagging block replicas.
Long-gap recovery should use:
- a real checkpoint/snapshot base image
- plus WAL tail replay after that base point
## Correctness Boundary
Never reconstruct old state from current extent alone.
Example:
1. `LSN 100`: block `A = foo`
2. `LSN 120`: block `A = bar`
If a replica needs state at `LSN 100`, current extent contains `bar`, not `foo`.
Therefore:
- current extent is latest state
- not historical state
So long-gap recovery must use a base image that is known to represent a real checkpoint/snapshot `cpLSN`.
## 3-Tier Replication Model
### Tier A: Keep-up
Replica is close enough to the primary that normal ordered streaming keeps it current.
Properties:
- normal steady-state mode
- no special recovery path
- replica stays `InSync`
### Tier B: Lagging Catch-up
Replica fell behind, but the primary still has enough recoverable history covering the missing range.
Properties:
- reconnect handshake determines the replica durable point
- primary proves and reserves a bounded recovery window
- primary replays missing history
- replica returns to `InSync` only after replay, barrier confirmation, and promotion hold
### Tier C: Reconstruction
Replica is too far behind for direct replay.
Properties:
- replica must rebuild from a real checkpoint/snapshot base
- after base image install, primary replays trailing history after `cpLSN`
- replica only re-enters `InSync` after durable catch-up completes
## Architecture
### Primary Artifacts
The primary owns three forms of state:
1. `Active WAL`
- recent ordered metadata/delta stream
- bounded by retention policy
2. `Checkpoint Snapshot`
- immutable point-in-time base image at `cpLSN`
- used for long-gap reconstruction
3. `Current Extent`
- latest live block state
- not a substitute for historical checkpoint state
### Replica Artifacts
Replica maintains:
1. local WAL or equivalent recovery log
2. replica `receivedLSN`
3. replica `flushedLSN`
4. local extent state
## Sender Model
Do not ship recovery data inline from foreground write goroutines.
Per replica, use:
- one ordered send queue
- one sender loop
The sender loop owns:
- live stream shipping
- reconnect handling
- short-gap catch-up
- reconstruction tail replay
This guarantees:
- strict LSN order per replica
- clean transport state ownership
- no inline shipping races in the primary write path
## Write Path
Primary write path:
1. allocate monotonic `LSN`
2. append recovery metadata to local WAL or journal
3. enqueue the record to each replica sender queue
4. return according to durability mode semantics
Flusher later:
- flushes dirty data to extent
- manages checkpoints
- manages bounded retention of WAL and other recovery dependencies
## Recovery Classes
V2 supports more than one local record type.
### `WALInline`
Properties:
- payload lives directly in WAL
- recoverable while WAL is retained
### `ExtentReferenced`
Properties:
- journal entry contains metadata only
- payload is resolved from extent/snapshot generation state
- direct-extent writes and future smart-WAL paths fall into this class
Replica state does not encode these classes.
Instead, the primary must answer a stricter question for reconnect:
- is `(startLSN, endLSN]` fully recoverable under the current epoch, and can it be reserved for the duration of recovery?
## Replica Progress Model
Each replica reports progress explicitly.
### `receivedLSN`
- highest LSN received and appended locally
- not yet a durability guarantee
### `flushedLSN`
- highest LSN durably persisted on the replica
- authoritative sync durability signal
Only `flushedLSN` counts for:
- `sync_all`
- `sync_quorum`
## Replica States
Replica state is defined by `wal-replication-v2-state-machine.md`.
Important highlights:
- `Bootstrapping`
- `InSync`
- `Lagging`
- `CatchingUp`
- `PromotionHold`
- `NeedsRebuild`
- `Rebuilding`
- `CatchUpAfterRebuild`
- `Failed`
Only `InSync` replicas count toward sync durability.
## Protocol
### 1. Normal Streaming
Primary sender loop:
- sends ordered replicated write records
Replica:
1. validates ordering
2. appends locally
3. advances `receivedLSN`
### 2. Barrier / Sync
Primary sends:
- `BarrierReq{LSN, Epoch}`
Replica:
1. wait until `receivedLSN >= LSN`
2. flush durable local state
3. set `flushedLSN = LSN`
4. reply `BarrierResp{Status, FlushedLSN}`
Primary uses this to evaluate mode policy.
### 3. Reconnect Handshake
On reconnect, primary obtains:
- current epoch
- primary head
- replica durable `flushedLSN`
Then primary evaluates recovery feasibility.
Possible outcomes:
1. replica already caught up
- state -> `PromotionHold` or `InSync` depending on policy
2. bounded catch-up possible
- reserve recovery window
- state -> `CatchingUp`
3. direct replay not possible
- state -> `NeedsRebuild`
## Recovery Feasibility and Reservation
The key V2 rule is:
- `fully recoverable` is not enough
- the primary must also reserve the recovery window
Recommended engine-side flow:
1. `CheckRecoveryFeasibility(startLSN, endLSN)`
2. if feasible, `ReserveRecoveryWindow(startLSN, endLSN)`
3. only then start `CatchingUp` or `CatchUpAfterRebuild`
A recovery reservation pins:
- recovery metadata
- referenced payload generations
- required snapshots/base images
- current epoch lineage for the session
If the reservation is lost during recovery:
- abort the current attempt
- fall back to `NeedsRebuild`
## Tier B: Lagging Catch-up Algorithm
When a replica is behind but within a recoverable retained window:
1. choose a bounded target `H0`
2. reserve `(ReplicaFlushedLSN, H0]`
3. replay the missing range
4. barrier confirms durable `flushedLSN >= H0`
5. enter `PromotionHold`
6. only then restore `InSync`
### Tail-chasing problem
If the primary is writing faster than the replica can catch up, the replica may never converge.
To handle this:
1. define a bounded catch-up window
2. if catch-up rate is slower than ingest rate for too long:
- either temporarily throttle primary admission for strict `sync_all`
- or fail `sync_all` requests and let control-plane policy react
- or abort to rebuild
3. do not let a replica remain in unbounded perpetual `CatchingUp`
### Important rule
For `sync_all`, the data path must not silently downgrade to `best_effort`.
Correct behavior:
- bounded retry
- then fail
Any mode change must be explicit policy, not silent transport behavior.
## Tier C: Reconstruction Algorithm
When a replica is too far behind for direct replay:
1. mark replica `NeedsRebuild`
2. choose a real checkpoint/snapshot base at `cpLSN`
3. create a rebuild reservation
4. replica enters `Rebuilding`
5. replica pulls immutable checkpoint/snapshot image
6. replica installs that base image and sets base progress to `cpLSN`
7. primary replays trailing history `(cpLSN, H0]`
8. barrier confirms durable replay
9. replica enters `PromotionHold`
10. replica returns to `InSync`
### Why snapshot/base image must be real
If the replica needs state at `cpLSN`, the base image must represent exactly that checkpoint.
Invalid:
- current extent copied at some later time and treated as historical `cpLSN`
Valid:
- immutable snapshot
- copy-on-write checkpoint image
- frozen base image
## Retention and Budget
V2 retention is bounded.
### WAL / recovery metadata retention
Primary keeps only a bounded recent recovery window:
- `max_retained_wal_bytes`
- optionally `max_retained_wal_time`
### Recovery reservation budget
Reservations are also bounded:
- timeout
- bytes pinned
- snapshot dependency lifetime
If a catch-up or rebuild session exceeds its reservation budget:
- primary aborts the session
- replica falls back to `NeedsRebuild`
- a newer rebuild plan may be chosen later
## Sync Modes
### `best_effort`
- ACK after primary local durability
- replicas may lag
- background catch-up or rebuild allowed
### `sync_all`
- ACK only when all required replicas are `InSync` and durably at target LSN
- bounded retry only
- no silent downgrade
### `sync_quorum`
- ACK when enough replicas are `InSync` and durably at target LSN
## Why This Direction
V2 separates three different concerns cleanly:
1. fast steady-state replication
2. short-gap replay
3. long-gap reconstruction
This avoids forcing WAL alone to solve all recovery cases.
## Implementation Order
Recommended order:
1. pure FSM
2. ordered sender loop
3. bounded direct replay
4. checkpoint/snapshot reconstruction
5. smarter local write path and recovery classes
6. policy and control-plane integration
## Phase 13 current direction
Current Phase 13 / WAL V1 is still:
- fixing correctness of WAL-centered sync replication
- still focused mainly on bounded WAL replay and rebuild fallback
That is the right bridge.
V2 should follow after WAL V1 closes.
## Bottom Line
V2 is not "more WAL features."
It is:
- explicit recovery feasibility
- explicit recovery reservations
- ordered sender loops
- short-gap replay for recent lag
- checkpoint/snapshot reconstruction for long lag
- promotion back to `InSync` only after durable proof

349
sw-block/design/wal-v1-to-v2-mapping.md

@ -0,0 +1,349 @@
# WAL V1 To V2 Mapping
Date: 2026-03-26
Status: working note
Purpose: map the current WAL V1 scattered state across `sw-block` into the proposed WAL V2 FSM vocabulary
## Why This Note Exists
Current WAL V1 correctness logic is spread across:
- `wal_shipper.go`
- `replica_apply.go`
- `dist_group_commit.go`
- `blockvol.go`
- `promotion.go`
- `rebuild.go`
- heartbeat/master reporting
This note does not propose immediate code changes.
It exists to answer two questions:
1. what state already exists in WAL V1 today?
2. how does that state map into the cleaner WAL V2 FSM model?
## Current V1 State Owners
### 1. Shipper state
Primary-side per-replica transport and recovery state lives mainly in:
- `weed/storage/blockvol/wal_shipper.go`
Current V1 shipper states:
- `ReplicaDisconnected`
- `ReplicaConnecting`
- `ReplicaCatchingUp`
- `ReplicaInSync`
- `ReplicaDegraded`
- `ReplicaNeedsRebuild`
Other shipper-owned flags/anchors:
- `replicaFlushedLSN`
- `hasFlushedProgress`
- `catchupFailures`
- `lastContactTime`
### 2. Replica receiver progress
Replica-side receive/apply progress lives mainly in:
- `weed/storage/blockvol/replica_apply.go`
Current V1 replica progress:
- `receivedLSN`
- `flushedLSN`
- duplicate/gap handling in `applyEntry()`
### 3. Volume-level durability policy
Volume-level sync semantics live mainly in:
- `weed/storage/blockvol/dist_group_commit.go`
Current V1 policy uses:
- local WAL sync result
- per-shipper barrier results
- `DurabilityBestEffort`
- `DurabilitySyncAll`
- `DurabilitySyncQuorum`
### 4. Volume-level retention/checkpoint state
Primary-side local checkpoint and WAL retention state lives mainly in:
- `weed/storage/blockvol/blockvol.go`
- `weed/storage/blockvol/flusher.go`
Current V1 anchors:
- `nextLSN`
- `CheckpointLSN()`
- WAL retained range
- retention-floor callbacks from `ShipperGroup`
### 5. Role/assignment state
Master-driven volume role state lives mainly in:
- `weed/storage/blockvol/promotion.go`
- `weed/storage/blockvol/blockvol.go`
- `weed/server/volume_server_block.go`
Current V1 roles:
- `RolePrimary`
- `RoleReplica`
- `RoleStale`
- `RoleRebuilding`
- `RoleDraining`
### 6. Rebuild state
Existing V1 rebuild transport/process lives mainly in:
- `weed/storage/blockvol/rebuild.go`
Current V1 rebuild phases:
- WAL catch-up attempt
- full extent copy
- trailing WAL catch-up
- rejoin via assignment + fresh shipper bootstrap
### 7. Heartbeat/master-visible replication state
Master-visible state lives mainly in:
- `weed/storage/blockvol/block_heartbeat.go`
- `weed/storage/blockvol/blockvol.go`
- server-side registry/master handling
Current V1 visible fields include:
- `ReplicaDegraded`
- `ReplicaShipperStates []ReplicaShipperStatus`
- role/epoch/checkpoint/head state
## V1 To V2 Mapping
### Shipper state mapping
| WAL V1 shipper state | Proposed WAL V2 FSM state | Notes |
| --- | --- | --- |
| `ReplicaDisconnected` | `Bootstrapping` or `Lagging` | Fresh shipper with no durable progress maps to `Bootstrapping`; previously-synced disconnected replica maps to `Lagging`. |
| `ReplicaConnecting` | transitional part of `Lagging -> CatchingUp` | V2 should model this as an event/session phase, not a durable steady state. |
| `ReplicaCatchingUp` | `CatchingUp` | Direct mapping for short-gap replay. |
| `ReplicaInSync` | `InSync` | Direct mapping. |
| `ReplicaDegraded` | `Lagging` | V1 transport failure state becomes the cleaner V2 recovery-needed state. |
| `ReplicaNeedsRebuild` | `NeedsRebuild` | Direct mapping. |
Main V1 cleanup opportunity:
- V1 mixes transport/session detail (`Connecting`) with recovery lifecycle state.
- V2 should keep the long-lived FSM smaller and push connection mechanics into sender-loop/session logic.
### Replica receiver progress mapping
| WAL V1 field | WAL V2 concept | Notes |
| --- | --- | --- |
| `receivedLSN` | `receivedLSN` | Keep as transport/apply progress only. |
| `flushedLSN` | `replicaFlushedLSN` | Keep as authoritative durability anchor. |
| duplicate/gap rules | replay validity rules | These become part of the V2 replay contract, not ad hoc receiver behavior. |
Main V1 cleanup opportunity:
- V1 receiver progress is already conceptually sound.
- V2 should keep it but drive it from explicit FSM transitions and replay reservations.
### Volume durability policy mapping
| WAL V1 behavior | WAL V2 concept | Notes |
| --- | --- | --- |
| `BarrierAll` against current shippers | promotion and sync gate | V2 should keep barrier-based durability truth. |
| `sync_all` requires all barriers | `InSync` eligibility gate | Same rule, but V2 eligibility should come from FSM state rather than scattered checks. |
| `best_effort` ignores barrier failures | background recovery mode | Same high-level policy. |
| `sync_quorum` counts successful barriers | quorum over `InSync` replicas | Same direction, but should be derived from explicit FSM state. |
Main V1 cleanup opportunity:
- durability mode logic should depend on `IsSyncEligible()`-style state, not raw shipper state enums spread across code.
### Retention/checkpoint mapping
| WAL V1 concept | WAL V2 concept | Notes |
| --- | --- | --- |
| `CheckpointLSN()` | checkpoint/base anchor | Keep, but V2 also adds explicit `cpLSN` snapshot semantics. |
| retention floor from recoverable replicas | recoverability budget | Keep the idea, but V2 turns this into explicit reservation management. |
| timeout-based `NeedsRebuild` | janitor-driven `Lagging -> NeedsRebuild` | Keep as background control logic, not hot-path mutation. |
Main V1 cleanup opportunity:
- V1 retains data because replicas might need it.
- V2 should reserve specific recovery windows, not rely only on ambient retention conditions.
### Role/assignment mapping
| WAL V1 role state | WAL V2 meaning | Notes |
| --- | --- | --- |
| `RolePrimary` | primary ownership / epoch authority | Not a replica FSM state; remains volume/control-plane state. |
| `RoleReplica` | replica service role | Orthogonal to replication FSM state. A replica volume may be `RoleReplica` while its sender-facing state is `Bootstrapping`, `Lagging`, or `InSync`. |
| `RoleStale` | pre-rebuild/non-serving | Closest to `NeedsRebuild` preparation on the volume role side. |
| `RoleRebuilding` | rebuild session role | Maps to volume-wide orchestration around V2 `Rebuilding`. |
| `RoleDraining` | assignment/failover coordination | Outside replica FSM; remains a volume transition role. |
Main V1 cleanup opportunity:
- role state and replication FSM state are different dimensions.
- V1 sometimes implicitly blends them.
- V2 should keep them separate:
- control-plane role FSM
- per-replica replication FSM
### Rebuild flow mapping
| WAL V1 rebuild phase | WAL V2 FSM phase | Notes |
| --- | --- | --- |
| WAL catch-up pre-pass | `Lagging -> CatchingUp` if feasible | Same idea, but V2 requires recoverability proof and reservation. |
| full extent copy | `NeedsRebuild -> Rebuilding` | Same high-level phase. |
| trailing WAL catch-up | `CatchUpAfterRebuild` | Direct conceptual mapping. |
| fresh shipper bootstrap after reassignment | `Bootstrapping` then promotion | V1 does this through assignment refresh; V2 may eventually do it with cleaner local transitions. |
Main V1 cleanup opportunity:
- V1 rebuild success is currently rejoined indirectly through control-plane reassignment.
- V2 should eventually make rebuild completion and promotion explicit FSM transitions.
### Heartbeat/master state mapping
| WAL V1 visible state | WAL V2 meaning | Notes |
| --- | --- | --- |
| `ReplicaShipperStatus{DataAddr, State, FlushedLSN}` | control-plane view of per-replica FSM | Good starting shape. |
| `ReplicaDegraded` | derived summary only | Too coarse for V2 decision-making; keep only as convenience/compat field. |
| role/epoch/head/checkpoint | role FSM + replication anchors | Continue reporting; V2 may need richer recovery reservation visibility later. |
Main V1 cleanup opportunity:
- master-facing replication state should be per replica, not summarized as one degraded bit.
## Current V1 Event Sources vs V2 Events
### V1 event source: `Barrier()` outcome
Current effects:
- mark `InSync`
- update `replicaFlushedLSN`
- mark degraded on error
V2 event mapping:
- `BarrierSuccess`
- `BarrierFailure`
- `PromotionHealthy`
### V1 event source: reconnect handshake
Current effects:
- `Connecting`
- choose `InSync`, `CatchingUp`, or `NeedsRebuild`
V2 event mapping:
- `ReconnectObserved`
- `RecoveryFeasible`
- `RecoveryReservationGranted`
- `ReconnectNeedsRebuild`
### V1 event source: retention budget evaluation
Current effects:
- stale replica becomes `NeedsRebuild`
V2 event mapping:
- `RecoverabilityExpired`
- `BackgroundJanitorNeedsRebuild`
### V1 event source: rebuild assignment and `StartRebuild`
Current effects:
- role becomes `RoleRebuilding`
- run baseline + trailing catch-up
- rejoin later via reassignment
V2 event mapping:
- `StartRebuild`
- `RebuildBaseApplied`
- `RebuildReservationLost`
- `RebuildCompleteReadyForPromotion`
## Main Gaps Between V1 And V2
### 1. V1 has shipper state, but not a pure FSM
Current V1 state is embedded in:
- transport logic
- barrier logic
- retention logic
- rebuild orchestration
V2 goal:
- one pure FSM that owns state and anchors
- transport/session code only executes actions
### 2. V1 does not model reservation explicitly
Current V1 asks, roughly:
- is WAL still retained?
V2 must ask:
- is `(startLSN, endLSN]` fully recoverable?
- can the primary reserve that window until recovery completes?
### 3. V1 has no explicit promotion debounce state
Current V1 goes effectively:
- caught up -> `InSync`
V2 adds:
- `PromotionHold`
### 4. V1 rebuild completion is control-plane indirect
Current V1:
- old `NeedsRebuild` shipper stays stuck
- master reassigns
- fresh shipper bootstraps
V2 likely wants:
- cleaner local FSM transitions, even if control plane still participates
### 5. V1 does not yet encode recovery classes
Current V1 is mostly WAL-centric.
V2 should support:
- `WALInline`
- `ExtentReferenced`
without leaking storage details into replica state.
## What Should Stay From V1
These V1 ideas are solid and should be preserved:
1. `replicaFlushedLSN` as sync truth
2. barrier-driven durability confirmation
3. explicit `NeedsRebuild`
4. per-replica status reporting to master
5. retention budgets eventually forcing rebuild
6. rebuild as a separate path from normal catch-up
## What Should Move In V2
These are the main redesign items:
1. move scattered shipper/recovery state into one pure FSM
2. separate transport/session phases from durable FSM state
3. add `Bootstrapping` and `PromotionHold`
4. add recoverability proof and reservation as first-class concepts
5. make replay/rebuild admission depend on reservation, not just present-time checks
6. cleanly separate:
- control-plane role FSM
- per-replica replication FSM
## Bottom Line
WAL V1 already contains most of the important primitives:
- durable progress
- barrier truth
- catch-up
- rebuild detection
- master-visible per-replica state
What V2 changes is not the existence of these ideas.
It changes their organization:
- from scattered transport/rebuild logic
- to one explicit, testable FSM with recovery reservations and cleaner state boundaries

277
sw-block/design/wal-v2-tiny-prototype.md

@ -0,0 +1,277 @@
# WAL V2 Tiny Prototype
Date: 2026-03-26
Status: design/prototyping plan
Purpose: validate the core V2 replication logic before committing to a broader redesign
## Goal
Build a small, non-production prototype that proves the core V2 ideas:
1. `ExtentBackend` abstraction
2. 3-tier replication FSM
3. async ordered sender loop
4. barrier-driven durability tracking
5. short-gap catch-up vs long-gap rebuild boundary
6. recovery feasibility and reservation semantics
This prototype is for discovering:
- state complexity
- recovery correctness
- sender-loop behavior
- performance shape
It is not for shipping.
## Prototype Scope
### 1. Extent backend isolation layer
Define a clean backend interface for extent reads/writes.
Initial implementation:
- `FileBackend`
- normal Linux file
- `pread`
- `pwrite`
- optional `fallocate`
Do not start with raw-device allocation.
The point is to stabilize:
- extent semantics
- base-image import/export assumptions
- checkpoint/snapshot integration points
### 2. V2 asynchronous replication FSM
Build a pure in-memory FSM for one replica.
FSM owns:
- state
- anchor LSNs
- transition legality
- sync eligibility
- action suggestions
- recovery reservation metadata
Target state set:
- `Bootstrapping`
- `InSync`
- `Lagging`
- `CatchingUp`
- `PromotionHold`
- `NeedsRebuild`
- `Rebuilding`
- `CatchUpAfterRebuild`
- `Failed`
The FSM must not do:
- network I/O
- disk I/O
- goroutine management
### 3. Sender loop + barrier primitive
For each replica:
- one ordered sender goroutine
- one non-blocking enqueue path from primary write path
- one barrier/progress path
Primary write path:
1. allocate `LSN`
2. append local WAL/journal metadata
3. enqueue to sender loop
4. return according to durability mode
The sender loop is responsible for:
- live ordered send
- reconnect handling
- catch-up replay
- rebuild-tail replay
## Explicit Non-Goals
These are intentionally excluded from the tiny prototype:
- raw allocator
- garbage collection
- `NVMe-oF`
- `ublk`
- chain replication
- CSI / control plane
- multi-replica quorum
- encryption
- real snapshot storage optimization
These are extension layers, not the core logic being validated here.
## Design Principle
Those excluded items are not being rejected.
They are treated as:
- extensions of the core logic
The prototype should be designed so they can later plug in without rewriting the state machine.
## Suggested Layout
One reasonable layout:
- `weed/storage/blockvol/fsmv2/`
- `fsm.go`
- `events.go`
- `actions.go`
- `fsm_test.go`
- `weed/storage/blockvol/prototypev2/`
- `backend.go`
- `file_backend.go`
- `sender_loop.go`
- `barrier.go`
- `prototype_test.go`
Preferred direction:
- keep it close enough to production packages that later reuse is easy
- but clearly marked experimental
## Core Interfaces
### Extent backend
Example direction:
```go
type ExtentBackend interface {
ReadAt(p []byte, off int64) (int, error)
WriteAt(p []byte, off int64) (int, error)
Sync() error
Size() uint64
}
```
### FSM
Example direction:
```go
type ReplicaFSM struct {
// state
// epoch
// anchor LSNs
// reservation metadata
}
func (f *ReplicaFSM) Apply(evt ReplicaEvent) ([]ReplicaAction, error)
```
### Sender loop
Example direction:
```go
type SenderLoop struct {
// input queue
// FSM
// transport mock/adapter
}
```
## What The Prototype Must Prove
### A. FSM correctness
The FSM must show that the state set is sufficient and coherent.
Key scenarios:
1. `Bootstrapping -> InSync`
2. `InSync -> Lagging -> CatchingUp -> PromotionHold -> InSync`
3. `Lagging -> NeedsRebuild -> Rebuilding -> CatchUpAfterRebuild -> PromotionHold -> InSync`
4. epoch change aborts catch-up
5. epoch change aborts rebuild
6. reservation-lost aborts catch-up
7. rebuild-too-slow aborts reconstruction
8. flapping replica does not instantly re-enter `InSync`
### B. Sender ordering
The sender loop must prove:
- strict LSN order per replica
- no inline ship races from concurrent writes
- decoupled foreground write path
### C. Barrier semantics
Barrier must prove:
- it waits on replica progress
- it uses `flushedLSN`, not transport guesses
- it can drive promotion eligibility cleanly
### D. Recovery boundary
Prototype must make the handoff explicit:
- recent lag -> reserved replay window
- long lag -> rebuild from base image + trailing replay
### E. Recovery reservation
Prototype must make this explicit:
- a window is not enough
- it must be provable and then reserved
- losing the reservation must abort recovery cleanly
## Performance Questions The Prototype Should Answer
Not benchmark headlines.
Instead:
1. how much contention disappears from the hot write path after removing inline ship
2. how queue depth grows under slow replicas
3. when catch-up stops converging
4. how expensive promotion hold is
5. how much complexity is added by rebuild-tail replay
6. how much complexity is added by reservation management
## Success Criteria
The tiny prototype is successful if it gives clear answers to:
1. can the V2 FSM be made explicit and testable?
2. does sender-loop ordering materially simplify the replication path?
3. is the catch-up vs rebuild boundary coherent under a moving primary head?
4. does reservation-based recoverability make the design safer and clearer?
5. does the architecture look simpler than extending WAL V1 forever?
## Failure Criteria
The prototype should be considered unsuccessful if:
1. state count explodes and remains hard to reason about
2. sender loop does not materially simplify ordering/recovery
3. promotion and recovery rules remain too coupled to ad hoc timers and network callbacks
4. rebuild-from-base + trailing replay is still ambiguous even in a controlled prototype
5. reservation handling turns into unbounded complexity
## Relationship To WAL V1
WAL V1 remains the current delivery line.
This prototype is not a replacement for:
- `CP13-6`
- `CP13-7`
- `CP13-8`
- `CP13-9`
It exists to inform what should move into WAL V2 after WAL V1 closes.
## Bottom Line
The tiny prototype should validate the core logic only:
- clean backend boundary
- explicit FSM
- ordered async sender
- recoverability as a proof-plus-reservation problem
- rebuild as a separate recovery mode, not a WAL accident

14
sw-block/private/README.md

@ -0,0 +1,14 @@
# private
Deprecated in favor of `../.private/`.
Private working area for:
- design sketches
- draft notes
- temporary comparison docs
- prototype experiments not ready to move into shared design docs
Keep production-independent work here until it is ready to be promoted into:
- `../design/`
- `../prototype/`
- or the main repo docs under `learn/projects/sw-block/`

23
sw-block/prototype/README.md

@ -0,0 +1,23 @@
# V2 Prototype
Experimental WAL V2 prototype code lives here.
Current prototype:
- `fsmv2/`: pure in-memory replication FSM prototype
- `volumefsm/`: volume-level orchestrator prototype above `fsmv2`
- `distsim/`: early distributed/data-correctness simulator with synthetic 4K block values
Rules:
- do not wire this directly into WAL V1 production code
- keep interfaces and tests focused on architecture learning
- promote pieces into production only after V2 design stabilizes
## Windows test workflow
Because normal `go test` may be blocked by Windows Defender when it executes temporary test binaries from `%TEMP%`, use:
```powershell
powershell -ExecutionPolicy Bypass -File .\sw-block\prototype\run-tests.ps1
```
This builds test binaries into the workspace and runs them directly.

1120
sw-block/prototype/distsim/cluster.go
File diff suppressed because it is too large
View File

1004
sw-block/prototype/distsim/cluster_test.go
File diff suppressed because it is too large
View File

BIN
sw-block/prototype/distsim/distsim.test.exe

266
sw-block/prototype/distsim/eventsim.go

@ -0,0 +1,266 @@
// eventsim.go — timeout events and timer-race infrastructure.
//
// This file implements the eventsim layer within the distsim package.
// The two conceptual layers share the Cluster model but serve different purposes:
//
// distsim (protocol layer — cluster.go, protocol.go):
// - Protocol correctness: epoch fencing, barrier semantics, commit rules
// - Reference-state validation: AssertCommittedRecoverable
// - Recoverability logic: catch-up, rebuild, reservation
// - Promotion/lineage: candidate eligibility, ranking
// - Endpoint identity: address versioning, stale endpoint rejection
// - Control-plane flow: heartbeat → detect → assignment
//
// eventsim (timing/race layer — this file):
// - Explicit timeout events: barrier, catch-up, reservation
// - Timer-triggered state transitions
// - Same-tick race resolution: data events process before timeouts
// - Timeout cancellation on successful ack/convergence
//
// Boundary rule:
// - A scenario belongs in distsim tests if the bug is protocol-level
// (wrong state, wrong commit, wrong rejection).
// - A scenario belongs in eventsim tests if the bug is timing-level
// (race between ack and timeout, ordering of concurrent events).
// - Do not duplicate scenarios across both layers unless
// timer/event ordering is the actual bug surface.
package distsim
import "fmt"
// TimeoutKind identifies the type of timeout event.
type TimeoutKind string
const (
TimeoutBarrier TimeoutKind = "barrier"
TimeoutCatchup TimeoutKind = "catchup"
TimeoutReservation TimeoutKind = "reservation"
)
// PendingTimeout represents a registered timeout that has not yet fired or been cancelled.
type PendingTimeout struct {
Kind TimeoutKind
ReplicaID string
LSN uint64 // for barrier timeouts: which LSN's barrier
DeadlineAt uint64 // absolute tick when timeout fires
Cancelled bool
}
// FiredTimeout records a timeout that actually fired (was not cancelled in time).
type FiredTimeout struct {
PendingTimeout
FiredAt uint64
}
// barrierExpiredKey uniquely identifies a timed-out barrier instance.
type barrierExpiredKey struct {
ReplicaID string
LSN uint64
}
// RegisterTimeout adds a pending timeout to the cluster.
func (c *Cluster) RegisterTimeout(kind TimeoutKind, replicaID string, lsn uint64, deadline uint64) {
c.Timeouts = append(c.Timeouts, PendingTimeout{
Kind: kind,
ReplicaID: replicaID,
LSN: lsn,
DeadlineAt: deadline,
})
}
// CancelTimeout cancels a pending timeout matching the given kind, replica, and LSN.
// For catch-up/reservation timeouts, LSN is ignored (matched by kind+replica only).
func (c *Cluster) CancelTimeout(kind TimeoutKind, replicaID string, lsn uint64) {
for i := range c.Timeouts {
t := &c.Timeouts[i]
if t.Cancelled {
continue
}
if t.Kind != kind || t.ReplicaID != replicaID {
continue
}
if kind == TimeoutBarrier && t.LSN != lsn {
continue
}
t.Cancelled = true
c.logEvent(EventTimeoutCancelled, fmt.Sprintf("%s replica=%s lsn=%d", kind, replicaID, t.LSN))
}
}
// fireTimeouts checks all pending timeouts against the current tick.
// Called by Tick() AFTER message delivery, so data events (acks) get
// a chance to cancel timeouts before they fire. This is the same-tick
// race resolution rule: data before timers.
//
// State-guard rules (prevent stale timeout from mutating post-success state):
// - CatchupTimeout only fires if replica is still CatchingUp
// - ReservationTimeout only fires if replica is still CatchingUp
// - BarrierTimeout marks the barrier instance as expired (late acks rejected)
func (c *Cluster) fireTimeouts() {
var remaining []PendingTimeout
for i := range c.Timeouts {
t := c.Timeouts[i]
if t.Cancelled {
continue
}
if c.Now < t.DeadlineAt {
remaining = append(remaining, t)
continue
}
// Check whether the timeout still has authority to mutate state.
stale := false
switch t.Kind {
case TimeoutBarrier:
// Barrier timeouts always apply — they mark the instance as expired.
case TimeoutCatchup, TimeoutReservation:
// Only valid if replica is still CatchingUp. If already recovered
// or escalated, the timeout is stale and has no authority.
if n := c.Nodes[t.ReplicaID]; n == nil || n.ReplicaState != NodeStateCatchingUp {
stale = true
}
}
if stale {
c.IgnoredTimeouts = append(c.IgnoredTimeouts, FiredTimeout{
PendingTimeout: t,
FiredAt: c.Now,
})
c.logEvent(EventTimeoutIgnored, fmt.Sprintf("%s replica=%s lsn=%d (stale)", t.Kind, t.ReplicaID, t.LSN))
continue
}
// Timeout fires with authority.
c.FiredTimeouts = append(c.FiredTimeouts, FiredTimeout{
PendingTimeout: t,
FiredAt: c.Now,
})
c.logEvent(EventTimeoutFired, fmt.Sprintf("%s replica=%s lsn=%d", t.Kind, t.ReplicaID, t.LSN))
switch t.Kind {
case TimeoutBarrier:
c.removeQueuedBarrier(t.ReplicaID, t.LSN)
c.ExpiredBarriers[barrierExpiredKey{t.ReplicaID, t.LSN}] = true
case TimeoutCatchup:
c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild
case TimeoutReservation:
c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild
}
}
c.Timeouts = remaining
}
// removeQueuedBarrier removes a re-queuing barrier from the message queue
// after its timeout fires. Without this, the barrier would re-queue indefinitely.
func (c *Cluster) removeQueuedBarrier(replicaID string, lsn uint64) {
var kept []inFlightMessage
for _, item := range c.Queue {
if item.msg.Kind == MsgBarrier && item.msg.To == replicaID && item.msg.TargetLSN == lsn {
continue
}
kept = append(kept, item)
}
c.Queue = kept
}
// cancelRecoveryTimeouts cancels all catch-up and reservation timeouts for a replica.
// Called automatically by CatchUpWithEscalation on convergence or escalation,
// so stale timeouts cannot regress a replica that already recovered or failed.
func (c *Cluster) cancelRecoveryTimeouts(replicaID string) {
c.CancelTimeout(TimeoutCatchup, replicaID, 0)
c.CancelTimeout(TimeoutReservation, replicaID, 0)
}
// === Tick event log ===
// TickEventKind identifies the type of event within a tick.
type TickEventKind string
const (
EventDeliveryAccepted TickEventKind = "delivery_accepted"
EventDeliveryRejected TickEventKind = "delivery_rejected"
EventTimeoutFired TickEventKind = "timeout_fired"
EventTimeoutIgnored TickEventKind = "timeout_ignored"
EventTimeoutCancelled TickEventKind = "timeout_cancelled"
)
// TickEvent records a single event within a tick, in processing order.
type TickEvent struct {
Tick uint64
Kind TickEventKind
Detail string
}
// logEvent appends a tick event to the cluster's event log.
func (c *Cluster) logEvent(kind TickEventKind, detail string) {
c.TickLog = append(c.TickLog, TickEvent{Tick: c.Now, Kind: kind, Detail: detail})
}
// TickEventsAt returns all events recorded at a specific tick.
func (c *Cluster) TickEventsAt(tick uint64) []TickEvent {
var events []TickEvent
for _, e := range c.TickLog {
if e.Tick == tick {
events = append(events, e)
}
}
return events
}
// === Trace infrastructure ===
// Trace captures a snapshot of cluster state for debugging failed scenarios.
// Reusable across test files and future replay/debug tooling.
type Trace struct {
Tick uint64
CommittedLSN uint64
PrimaryID string
Epoch uint64
NodeStates map[string]string
FiredTimeouts []string
IgnoredTimeouts []string
TickEvents []TickEvent // full ordered event log
Deliveries int
Rejections int
QueueDepth int
}
// BuildTrace captures the current cluster state as a debuggable trace.
func BuildTrace(c *Cluster) Trace {
tr := Trace{
Tick: c.Now,
CommittedLSN: c.Coordinator.CommittedLSN,
PrimaryID: c.Coordinator.PrimaryID,
Epoch: c.Coordinator.Epoch,
NodeStates: map[string]string{},
TickEvents: c.TickLog,
Deliveries: len(c.Deliveries),
Rejections: len(c.Rejected),
QueueDepth: len(c.Queue),
}
for id, n := range c.Nodes {
tr.NodeStates[id] = fmt.Sprintf("role=%s state=%s epoch=%d flushed=%d running=%v",
n.Role, n.ReplicaState, n.Epoch, n.Storage.FlushedLSN, n.Running)
}
for _, ft := range c.FiredTimeouts {
tr.FiredTimeouts = append(tr.FiredTimeouts,
fmt.Sprintf("%s replica=%s lsn=%d fired_at=%d", ft.Kind, ft.ReplicaID, ft.LSN, ft.FiredAt))
}
for _, it := range c.IgnoredTimeouts {
tr.IgnoredTimeouts = append(tr.IgnoredTimeouts,
fmt.Sprintf("%s replica=%s lsn=%d stale_at=%d", it.Kind, it.ReplicaID, it.LSN, it.FiredAt))
}
return tr
}
// === Query helpers ===
// FiredTimeoutsByKind returns the count of fired timeouts of a specific kind.
func (c *Cluster) FiredTimeoutsByKind(kind TimeoutKind) int {
count := 0
for _, ft := range c.FiredTimeouts {
if ft.Kind == kind {
count++
}
}
return count
}

213
sw-block/prototype/distsim/phase02_advanced_test.go

@ -0,0 +1,213 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 02: Item 4 — Smart WAL recovery-class transitions
// ============================================================
// Test: recovery starts with resolvable ExtentReferenced records,
// then a payload becomes unresolvable during active recovery.
// Protocol must detect the transition and abort to NeedsRebuild.
func TestP02_SmartWAL_RecoverableThenUnrecoverable(t *testing.T) {
// Build recovery records: first 3 WALInline, then 2 ExtentReferenced.
records := []RecoveryRecord{
{Write: Write{LSN: 1, Block: 1, Value: 1}, Class: RecoveryClassWALInline},
{Write: Write{LSN: 2, Block: 2, Value: 2}, Class: RecoveryClassWALInline},
{Write: Write{LSN: 3, Block: 3, Value: 3}, Class: RecoveryClassWALInline},
{Write: Write{LSN: 4, Block: 4, Value: 4}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
{Write: Write{LSN: 5, Block: 5, Value: 5}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
}
// Initially fully recoverable.
if !FullyRecoverable(records) {
t.Fatal("initial records should be fully recoverable")
}
// Simulate payload becoming unresolvable (e.g., extent generation GC'd).
records[4].PayloadResolvable = false
// Now NOT recoverable — must detect and abort.
if FullyRecoverable(records) {
t.Fatal("after payload loss, records should NOT be recoverable")
}
// Apply only the recoverable prefix.
state := ApplyRecoveryRecords(records[:4], 0, 4) // only first 4
if state[4] != 4 {
t.Fatalf("partial apply: block 4 should be 4, got %d", state[4])
}
if _, has5 := state[5]; has5 {
t.Fatal("block 5 should NOT be in partial state — payload was lost")
}
}
func TestP02_SmartWAL_MixedClassRecovery_FullSuccess(t *testing.T) {
records := []RecoveryRecord{
{Write: Write{LSN: 1, Block: 0, Value: 10}, Class: RecoveryClassWALInline},
{Write: Write{LSN: 2, Block: 1, Value: 20}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
{Write: Write{LSN: 3, Block: 0, Value: 30}, Class: RecoveryClassWALInline},
{Write: Write{LSN: 4, Block: 2, Value: 40}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
}
if !FullyRecoverable(records) {
t.Fatal("all resolvable — should be recoverable")
}
state := ApplyRecoveryRecords(records, 0, 4)
// Block 0 overwritten: 10 then 30.
if state[0] != 30 {
t.Fatalf("block 0: got %d, want 30", state[0])
}
if state[1] != 20 {
t.Fatalf("block 1: got %d, want 20", state[1])
}
if state[2] != 40 {
t.Fatalf("block 2: got %d, want 40", state[2])
}
}
func TestP02_SmartWAL_TimeVaryingAvailability(t *testing.T) {
// Simulate time-varying payload availability:
// At time T1, all records are recoverable.
// At time T2, one becomes unrecoverable.
// At time T3, it becomes recoverable again (re-pinned).
records := []RecoveryRecord{
{Write: Write{LSN: 1, Block: 0, Value: 1}, Class: RecoveryClassWALInline},
{Write: Write{LSN: 2, Block: 1, Value: 2}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
{Write: Write{LSN: 3, Block: 2, Value: 3}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true},
}
// T1: all recoverable.
if !FullyRecoverable(records) {
t.Fatal("T1: should be recoverable")
}
// T2: payload for LSN 2 lost.
records[1].PayloadResolvable = false
if FullyRecoverable(records) {
t.Fatal("T2: should NOT be recoverable after payload loss")
}
// T3: payload re-pinned (e.g., operator restores snapshot).
records[1].PayloadResolvable = true
if !FullyRecoverable(records) {
t.Fatal("T3: should be recoverable after re-pin")
}
}
// ============================================================
// Phase 02: Item 5 — Strengthen S5 (flapping replica)
// ============================================================
// S5 strengthened: repeated disconnect/reconnect with catch-up
// state tracking. If flapping exceeds budget, escalate to NeedsRebuild.
func TestP02_S5_FlappingWithStateTracking(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.MaxCatchupAttempts = 10 // generous for flapping
// Initial writes.
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
r1 := c.Nodes["r1"]
// 5 flapping cycles — each creates a small gap then catches up.
for cycle := 0; cycle < 5; cycle++ {
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(uint64(3 + cycle*2))
c.CommitWrite(uint64(4 + cycle*2))
c.TickN(3)
c.Connect("p", "r1")
c.Connect("r1", "p")
r1.ReplicaState = NodeStateCatchingUp
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatalf("cycle %d: catch-up should converge for small gap", cycle)
}
if r1.ReplicaState != NodeStateInSync {
t.Fatalf("cycle %d: expected InSync, got %s", cycle, r1.ReplicaState)
}
}
// After 5 successful flaps, CatchupAttempts should be 0 (reset on success).
if r1.CatchupAttempts != 0 {
t.Fatalf("CatchupAttempts should be 0 after successful catch-ups, got %d", r1.CatchupAttempts)
}
// No unnecessary rebuild — r1 should NOT have a base snapshot.
if r1.Storage.BaseSnapshot != nil {
t.Fatal("flapping replica should not have been rebuilt — only WAL catch-up")
}
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatal(err)
}
}
func TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.MaxCatchupAttempts = 3 // tight budget
c.CommitWrite(1)
c.TickN(5)
r1 := c.Nodes["r1"]
// Each flap creates a gap, but primary writes a LOT during disconnect.
// Catch-up recovers only 1 entry per attempt. After MaxCatchupAttempts
// non-convergent attempts, escalate.
for cycle := 0; cycle < 5; cycle++ {
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
// Large writes during disconnect.
for w := 0; w < 30; w++ {
c.CommitWrite(uint64(cycle*30+w+2) % 8)
}
c.TickN(3)
c.Connect("p", "r1")
c.Connect("r1", "p")
r1.ReplicaState = NodeStateCatchingUp
// Try catch-up with small batch — will not converge.
for attempt := 0; attempt < 5; attempt++ {
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
for w := 0; w < 10; w++ {
c.CommitWrite(uint64(200+cycle*50+attempt*10+w) % 8)
}
c.TickN(2)
c.Connect("p", "r1")
c.Connect("r1", "p")
c.CatchUpWithEscalation("r1", 1)
if r1.ReplicaState == NodeStateNeedsRebuild {
t.Logf("flapping escalated to NeedsRebuild at cycle %d, attempt %d", cycle, attempt)
// Verify: NeedsRebuild is sticky.
c.CatchUpWithEscalation("r1", 100)
if r1.ReplicaState != NodeStateNeedsRebuild {
t.Fatal("NeedsRebuild should be sticky — catch-up should not reset it")
}
return
}
}
}
// If we got here, the budget wasn't reached. That's wrong.
t.Fatalf("expected NeedsRebuild escalation, but state is %s with %d attempts",
r1.ReplicaState, r1.CatchupAttempts)
}

445
sw-block/prototype/distsim/phase02_candidate_test.go

@ -0,0 +1,445 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 02: Coordinator candidate-selection tests
// Verifies promotion ranking under mixed replica states.
// ============================================================
func TestP02_CandidateSelection_AllEqual_AlphabeticalTieBreak(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
c.CommitWrite(1)
c.TickN(5)
// All replicas InSync with same FlushedLSN → alphabetical tie-break.
best := c.BestPromotionCandidate()
if best != "r1" {
t.Fatalf("all equal: expected r1 (alphabetical), got %q", best)
}
candidates := c.PromotionCandidates()
if len(candidates) != 3 {
t.Fatalf("expected 3 candidates, got %d", len(candidates))
}
for i, exp := range []string{"r1", "r2", "r3"} {
if candidates[i].ID != exp {
t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp)
}
}
}
func TestP02_CandidateSelection_HigherLSN_Wins(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
// Directly set FlushedLSN to simulate different progress.
// All InSync — higher LSN wins.
for _, id := range []string{"r1", "r2", "r3"} {
c.Nodes[id].ReplicaState = NodeStateInSync
}
c.Nodes["r1"].Storage.FlushedLSN = 10
c.Nodes["r2"].Storage.FlushedLSN = 20
c.Nodes["r3"].Storage.FlushedLSN = 15
best := c.BestPromotionCandidate()
if best != "r2" {
t.Fatalf("higher LSN: expected r2, got %q", best)
}
candidates := c.PromotionCandidates()
if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" {
t.Fatalf("order: got [%s, %s, %s], want [r2, r3, r1]",
candidates[0].ID, candidates[1].ID, candidates[2].ID)
}
}
func TestP02_CandidateSelection_StoppedNode_Excluded(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.Nodes["r1"].Storage.FlushedLSN = 100
c.Nodes["r2"].Storage.FlushedLSN = 50
c.StopNode("r1") // highest LSN but stopped
best := c.BestPromotionCandidate()
if best != "r2" {
t.Fatalf("stopped excluded: expected r2, got %q", best)
}
// r1 should be last in ranking (not running).
candidates := c.PromotionCandidates()
if candidates[0].ID != "r2" {
t.Fatalf("first candidate should be r2, got %s", candidates[0].ID)
}
if candidates[1].Running {
t.Fatal("r1 should be marked not running")
}
}
func TestP02_CandidateSelection_InSync_Beats_CatchingUp(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
// r1: CatchingUp with highest LSN.
c.Nodes["r1"].ReplicaState = NodeStateCatchingUp
c.Nodes["r1"].Storage.FlushedLSN = 100
// r2: InSync with lower LSN.
c.Nodes["r2"].ReplicaState = NodeStateInSync
c.Nodes["r2"].Storage.FlushedLSN = 50
// r3: InSync with even lower LSN.
c.Nodes["r3"].ReplicaState = NodeStateInSync
c.Nodes["r3"].Storage.FlushedLSN = 40
// InSync with lower LSN beats CatchingUp with higher LSN.
best := c.BestPromotionCandidate()
if best != "r2" {
t.Fatalf("InSync beats CatchingUp: expected r2, got %q", best)
}
candidates := c.PromotionCandidates()
// r2 (InSync, 50), r3 (InSync, 40), r1 (CatchingUp, 100)
if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" {
t.Fatalf("order: got [%s, %s, %s]", candidates[0].ID, candidates[1].ID, candidates[2].ID)
}
}
func TestP02_CandidateSelection_AllCatchingUp_HighestLSN_Wins(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
for _, id := range []string{"r1", "r2", "r3"} {
c.Nodes[id].ReplicaState = NodeStateCatchingUp
}
c.Nodes["r1"].Storage.FlushedLSN = 30
c.Nodes["r2"].Storage.FlushedLSN = 80
c.Nodes["r3"].Storage.FlushedLSN = 50
best := c.BestPromotionCandidate()
if best != "r2" {
t.Fatalf("all CatchingUp: expected r2 (highest LSN), got %q", best)
}
}
func TestP02_CandidateSelection_NeedsRebuild_Skipped(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
// r1: NeedsRebuild with highest LSN.
c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild
c.Nodes["r1"].Storage.FlushedLSN = 100
// r2: InSync with moderate LSN.
c.Nodes["r2"].ReplicaState = NodeStateInSync
c.Nodes["r2"].Storage.FlushedLSN = 50
// r3: CatchingUp with low LSN.
c.Nodes["r3"].ReplicaState = NodeStateCatchingUp
c.Nodes["r3"].Storage.FlushedLSN = 20
best := c.BestPromotionCandidate()
if best != "r2" {
t.Fatalf("NeedsRebuild skipped: expected r2, got %q", best)
}
candidates := c.PromotionCandidates()
// r2 (InSync, 50), r3 (CatchingUp, 20), r1 (NeedsRebuild, 100)
if candidates[0].ID != "r2" {
t.Fatalf("first should be r2, got %s", candidates[0].ID)
}
if candidates[2].ID != "r1" {
t.Fatalf("last should be r1 (NeedsRebuild), got %s", candidates[2].ID)
}
}
func TestP02_CandidateSelection_NoRunning_ReturnsEmpty(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.StopNode("r1")
c.StopNode("r2")
best := c.BestPromotionCandidate()
if best != "" {
t.Fatalf("no running: expected empty, got %q", best)
}
}
func TestP02_CandidateSelection_AfterPartition_RankingUpdates(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// All InSync at FlushedLSN=2. Best = r1 (alphabetical).
if best := c.BestPromotionCandidate(); best != "r1" {
t.Fatalf("before partition: expected r1, got %q", best)
}
// Partition r1. Write more via p+r2+r3.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
// With 4 members (p, r1, r2, r3), quorum=3. p+r2+r3=3. OK.
// Actually, quorum = 4/2+1=3. p+r2+r3=3. Marginal.
c.CommitWrite(3)
c.CommitWrite(4)
c.CommitWrite(5)
c.TickN(5)
// r1 lagging, r2/r3 ahead.
c.Nodes["r1"].ReplicaState = NodeStateCatchingUp
// Now r2 or r3 should win (both InSync with higher LSN).
best := c.BestPromotionCandidate()
if best == "r1" {
t.Fatal("after partition: r1 should not be best (CatchingUp)")
}
if best != "r2" {
t.Fatalf("after partition: expected r2 (InSync, alphabetical tie-break), got %q", best)
}
t.Logf("after partition: best=%s", best)
}
func TestP02_CandidateSelection_MixedStates_FullRanking(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4", "r5")
// Set up a diverse state mix:
// r1: InSync, LSN=50
// r2: InSync, LSN=60 (highest InSync)
// r3: CatchingUp, LSN=80 (highest overall but CatchingUp)
// r4: NeedsRebuild, LSN=90 (highest but NeedsRebuild)
// r5: stopped, LSN=100 (highest but not running)
c.Nodes["r1"].ReplicaState = NodeStateInSync
c.Nodes["r1"].Storage.FlushedLSN = 50
c.Nodes["r2"].ReplicaState = NodeStateInSync
c.Nodes["r2"].Storage.FlushedLSN = 60
c.Nodes["r3"].ReplicaState = NodeStateCatchingUp
c.Nodes["r3"].Storage.FlushedLSN = 80
c.Nodes["r4"].ReplicaState = NodeStateNeedsRebuild
c.Nodes["r4"].Storage.FlushedLSN = 90
c.Nodes["r5"].Storage.FlushedLSN = 100
c.StopNode("r5")
best := c.BestPromotionCandidate()
if best != "r2" {
t.Fatalf("mixed states: expected r2 (InSync+highest among InSync), got %q", best)
}
candidates := c.PromotionCandidates()
// Expected order: r2(InSync,60), r1(InSync,50), r3(CatchingUp,80),
// r4(NeedsRebuild,90), r5(stopped,100)
expected := []string{"r2", "r1", "r3", "r4", "r5"}
for i, exp := range expected {
if candidates[i].ID != exp {
t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp)
}
}
t.Logf("full ranking: %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d)",
candidates[0].ID, candidates[0].State, candidates[0].FlushedLSN,
candidates[1].ID, candidates[1].State, candidates[1].FlushedLSN,
candidates[2].ID, candidates[2].State, candidates[2].FlushedLSN,
candidates[3].ID, candidates[3].State, candidates[3].FlushedLSN,
candidates[4].ID, candidates[4].State, candidates[4].FlushedLSN)
}
func TestP02_CandidateSelection_AllNeedsRebuild_SafeDefaultEmpty(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild
c.Nodes["r1"].Storage.FlushedLSN = 50
c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild
c.Nodes["r2"].Storage.FlushedLSN = 80
// Safe default: refuses NeedsRebuild candidates.
safe := c.BestPromotionCandidate()
if safe != "" {
t.Fatalf("safe default should return empty for all-NeedsRebuild, got %q", safe)
}
}
func TestP02_CandidateSelection_DesperationPromotion_ExplicitAPI(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
for _, id := range []string{"r1", "r2", "r3"} {
c.Nodes[id].ReplicaState = NodeStateNeedsRebuild
}
c.Nodes["r1"].Storage.FlushedLSN = 10
c.Nodes["r2"].Storage.FlushedLSN = 30
c.Nodes["r3"].Storage.FlushedLSN = 20
safe := c.BestPromotionCandidate()
if safe != "" {
t.Fatalf("safe default should return empty, got %q", safe)
}
desperate := c.BestPromotionCandidateDesperate()
if desperate != "r2" {
t.Fatalf("desperation: expected r2 (highest LSN), got %q", desperate)
}
}
// === Candidate eligibility tests ===
func TestP02_CandidateEligibility_Running(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
e := c.EvaluateCandidateEligibility("r1")
if !e.Eligible {
t.Fatalf("running InSync replica should be eligible, reasons: %v", e.Reasons)
}
c.StopNode("r1")
e = c.EvaluateCandidateEligibility("r1")
if e.Eligible {
t.Fatal("stopped replica should not be eligible")
}
if e.Reasons[0] != "not_running" {
t.Fatalf("expected not_running reason, got %v", e.Reasons)
}
}
func TestP02_CandidateEligibility_EpochAlignment(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// Manually desync r1's epoch.
c.Nodes["r1"].Epoch = c.Coordinator.Epoch - 1
e := c.EvaluateCandidateEligibility("r1")
if e.Eligible {
t.Fatal("epoch-misaligned replica should not be eligible")
}
found := false
for _, r := range e.Reasons {
if r == "epoch_misaligned" {
found = true
}
}
if !found {
t.Fatalf("expected epoch_misaligned reason, got %v", e.Reasons)
}
}
func TestP02_CandidateEligibility_StateIneligible(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
for _, state := range []ReplicaNodeState{NodeStateNeedsRebuild, NodeStateRebuilding} {
c.Nodes["r1"].ReplicaState = state
e := c.EvaluateCandidateEligibility("r1")
if e.Eligible {
t.Fatalf("%s should not be eligible", state)
}
}
// CatchingUp IS eligible (data may be mostly current).
c.Nodes["r1"].ReplicaState = NodeStateCatchingUp
e := c.EvaluateCandidateEligibility("r1")
if !e.Eligible {
t.Fatalf("CatchingUp should be eligible, reasons: %v", e.Reasons)
}
}
func TestP02_CandidateEligibility_InsufficientCommittedPrefix(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// r1 has FlushedLSN=1, CommittedLSN=1 → eligible.
e := c.EvaluateCandidateEligibility("r1")
if !e.Eligible {
t.Fatalf("r1 at committed prefix should be eligible, reasons: %v", e.Reasons)
}
// Manually set r1 behind committed prefix.
c.Nodes["r1"].Storage.FlushedLSN = 0
e = c.EvaluateCandidateEligibility("r1")
if e.Eligible {
t.Fatal("FlushedLSN=0 with CommittedLSN=1 should not be eligible")
}
found := false
for _, r := range e.Reasons {
if r == "insufficient_committed_prefix" {
found = true
}
}
if !found {
t.Fatalf("expected insufficient_committed_prefix reason, got %v", e.Reasons)
}
}
func TestP02_CandidateEligibility_InSyncButLagging_Rejected(t *testing.T) {
// Scenario from finding: r1 is InSync with correct epoch but FlushedLSN << CommittedLSN.
// r2 is CatchingUp but has the committed prefix. r2 should be selected over r1.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3")
// Set committed prefix high.
c.Coordinator.CommittedLSN = 100
// r1: InSync, correct epoch, but FlushedLSN=1. Ineligible.
c.Nodes["r1"].ReplicaState = NodeStateInSync
c.Nodes["r1"].Storage.FlushedLSN = 1
// r2: CatchingUp, correct epoch, FlushedLSN=100. Eligible.
c.Nodes["r2"].ReplicaState = NodeStateCatchingUp
c.Nodes["r2"].Storage.FlushedLSN = 100
// r3: InSync, correct epoch, FlushedLSN=100. Eligible.
c.Nodes["r3"].ReplicaState = NodeStateInSync
c.Nodes["r3"].Storage.FlushedLSN = 100
// r1 is ineligible despite being InSync.
e1 := c.EvaluateCandidateEligibility("r1")
if e1.Eligible {
t.Fatal("r1 (InSync, FlushedLSN=1, CommittedLSN=100) should be ineligible")
}
// r2 and r3 are eligible.
e2 := c.EvaluateCandidateEligibility("r2")
if !e2.Eligible {
t.Fatalf("r2 should be eligible, reasons: %v", e2.Reasons)
}
// BestPromotionCandidate should pick r3 (InSync with prefix) over r2 (CatchingUp).
best := c.BestPromotionCandidate()
if best != "r3" {
t.Fatalf("expected r3 (InSync+prefix), got %q", best)
}
// r1 must NOT be in the eligible list at all.
eligible := c.EligiblePromotionCandidates()
for _, pc := range eligible {
if pc.ID == "r1" {
t.Fatal("r1 should not appear in eligible candidates")
}
}
t.Logf("committed-prefix gate: r1(InSync/flushed=1) rejected, r3(InSync/flushed=100) selected")
}
func TestP02_CandidateEligibility_EligiblePromotionCandidates(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4")
c.CommitWrite(1)
c.TickN(5)
// r1: InSync, eligible
// r2: NeedsRebuild, ineligible
c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild
// r3: stopped, ineligible
c.StopNode("r3")
// r4: epoch misaligned, ineligible
c.Nodes["r4"].Epoch = 0
eligible := c.EligiblePromotionCandidates()
if len(eligible) != 1 {
t.Fatalf("expected 1 eligible candidate, got %d", len(eligible))
}
if eligible[0].ID != "r1" {
t.Fatalf("expected r1 as only eligible, got %s", eligible[0].ID)
}
// BestPromotionCandidate uses eligibility.
best := c.BestPromotionCandidate()
if best != "r1" {
t.Fatalf("BestPromotionCandidate should return r1, got %q", best)
}
}

371
sw-block/prototype/distsim/phase02_network_test.go

@ -0,0 +1,371 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 02: Delayed/drop network + multi-node reservation expiry
// ============================================================
// --- Item 4: Stale delayed messages after heal/promote ---
// Scenario: messages from old primary are in-flight when partition heals
// and a new primary is promoted. The stale messages arrive AFTER the
// promotion. They must be rejected by epoch fencing.
func TestP02_DelayedStaleMessages_AfterPromote(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "A", "B", "C")
// Phase 1: A writes, ships to B and C.
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// Phase 2: A writes more, but we manually enqueue delayed delivery
// to simulate in-flight messages when partition happens.
c.CommitWrite(3) // LSN 3 ships normally
// Don't tick yet — messages are in the queue.
// Phase 3: Partition A from everyone, promote B.
c.Disconnect("A", "B")
c.Disconnect("B", "A")
c.Disconnect("A", "C")
c.Disconnect("C", "A")
c.StopNode("A")
c.Promote("B")
// Phase 4: Manually inject stale messages as if they were delayed in the network.
// These represent A's write(3) + barrier(3) that were in-flight when A crashed.
staleEpoch := c.Coordinator.Epoch - 1
c.InjectMessage(Message{
Kind: MsgWrite, From: "A", To: "B", Epoch: staleEpoch,
Write: Write{LSN: 3, Block: 3, Value: 3},
}, c.Now+1)
c.InjectMessage(Message{
Kind: MsgBarrier, From: "A", To: "B", Epoch: staleEpoch,
TargetLSN: 3,
}, c.Now+2)
c.InjectMessage(Message{
Kind: MsgWrite, From: "A", To: "C", Epoch: staleEpoch,
Write: Write{LSN: 3, Block: 3, Value: 3},
}, c.Now+1)
// Phase 5: Tick to deliver stale messages.
committedBefore := c.Coordinator.CommittedLSN
c.TickN(5)
// All stale messages must be rejected — either by epoch fencing or node-down.
epochRejects := c.RejectedByReason(RejectEpochMismatch)
nodeDownRejects := c.RejectedByReason(RejectNodeDown)
totalRejects := epochRejects + nodeDownRejects
if totalRejects == 0 {
t.Fatal("stale delayed messages were not rejected")
}
// Committed prefix must not change from stale messages.
if c.Coordinator.CommittedLSN != committedBefore {
t.Fatalf("stale delayed messages changed committed prefix: before=%d after=%d",
committedBefore, c.Coordinator.CommittedLSN)
}
// Data correct on new primary.
if err := c.AssertCommittedRecoverable("B"); err != nil {
t.Fatalf("data incorrect after stale delayed messages: %v", err)
}
t.Logf("stale delayed messages: %d rejected by epoch_mismatch", epochRejects)
}
// Scenario: old barrier ACK arrives after promotion with long delay.
// This is different from S18 — the delay is network-level, not restart-level.
func TestP02_DelayedBarrierAck_LongNetworkDelay(t *testing.T) {
c := NewCluster(CommitSyncAll, "p", "r1")
c.CommitWrite(1)
c.TickN(5)
// Write 2 — barrier sent to r1.
c.CommitWrite(2)
c.TickN(2) // barrier in flight
// Promote r1 (simulate primary failure + promotion).
c.StopNode("p")
c.Promote("r1")
committedBefore := c.Coordinator.CommittedLSN
// Long-delayed barrier ack from r1 → dead primary p.
c.InjectMessage(Message{
Kind: MsgBarrierAck, From: "r1", To: "p",
Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2,
}, c.Now+10)
c.TickN(15)
// Must be rejected — p is dead and epoch is stale.
nodeDownRejects := c.RejectedByReason(RejectNodeDown)
epochRejects := c.RejectedByReason(RejectEpochMismatch)
if nodeDownRejects == 0 && epochRejects == 0 {
t.Fatal("delayed barrier ack should be rejected (node down or epoch mismatch)")
}
// Stale ack must not advance committed prefix.
if c.Coordinator.CommittedLSN != committedBefore {
t.Fatalf("stale ack changed committed prefix: before=%d after=%d",
committedBefore, c.Coordinator.CommittedLSN)
}
}
// Scenario: write ships to replica, network drops the write but delivers
// the barrier. Barrier should timeout or detect missing data.
func TestP02_DroppedWrite_BarrierDelivered_Stalls(t *testing.T) {
c := NewCluster(CommitSyncAll, "p", "r1")
c.CommitWrite(1)
c.TickN(5)
// Write 2 — but drop the write message to r1 (link down for data only).
// We simulate by writing but not ticking, then dropping queued writes.
c.CommitWrite(2) // enqueues write(2) + barrier(2) to r1
// Remove only the write message from the queue (simulate selective drop).
var kept []inFlightMessage
for _, item := range c.Queue {
if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 2 {
continue // drop this write
}
kept = append(kept, item)
}
c.Queue = kept
// Tick — barrier arrives at r1 but r1 doesn't have LSN 2.
// Barrier should re-queue (waiting for data).
c.TickN(10)
// Assert 1: sync_all blocked — CommittedLSN stuck at 1.
if c.Coordinator.CommittedLSN != 1 {
t.Fatalf("sync_all should be blocked at LSN 1, got committed=%d", c.Coordinator.CommittedLSN)
}
// Assert 2: LSN 2 is pending but NOT committed.
p2 := c.Pending[2]
if p2 == nil {
t.Fatal("LSN 2 should be pending")
}
if p2.Committed {
t.Fatal("LSN 2 committed under sync_all but r1 never received the write — safety violation")
}
// Assert 3: barrier still re-queuing — stall proven positively.
barrierRequeued := false
for _, item := range c.Queue {
if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 {
barrierRequeued = true
break
}
}
if !barrierRequeued {
t.Fatal("barrier for LSN 2 should still be re-queuing — stall not proven")
}
t.Logf("dropped write stall proven: committed=%d, pending[2].committed=%v, barrier re-queuing=%v",
c.Coordinator.CommittedLSN, p2.Committed, barrierRequeued)
}
// --- Item 5: Multi-node reservation expiry / rebuild timeout ---
// Scenario: RF=3 cluster. Two replicas need catch-up. One's reservation
// expires during recovery. Must handle correctly: one rebuilds, one catches up.
func TestP02_MultiNode_ReservationExpiry_MixedOutcome(t *testing.T) {
// 5 nodes: p+r3+r4 provide quorum (3 of 5) while r1+r2 are disconnected.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4")
// Write initial data.
for i := uint64(1); i <= 10; i++ {
c.CommitWrite(i % 4)
}
c.TickN(5)
// Take snapshot for rebuild.
c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN)
// r1+r2 disconnect. r3+r4 stay for quorum (p+r3+r4 = 3 of 5).
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.Disconnect("p", "r2")
c.Disconnect("r2", "p")
// Write more during disconnect — committed via p+r3+r4 quorum.
for i := uint64(11); i <= 30; i++ {
c.CommitWrite(i % 4)
}
c.TickN(5)
// Reconnect both.
c.Connect("p", "r1")
c.Connect("r1", "p")
c.Connect("p", "r2")
c.Connect("r2", "p")
// r1: reserved catch-up with tight expiry — MUST expire.
// 20 entries to replay, but only 2 ticks of budget.
r1 := c.Nodes["r1"]
shortExpiry := c.Now + 2
err := c.RecoverReplicaFromPrimaryReserved("r1", r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN, shortExpiry)
if err == nil {
t.Fatal("r1 reservation must expire — 20 entries with 2-tick budget")
}
r1.ReplicaState = NodeStateNeedsRebuild
t.Logf("r1 reservation expired: %v", err)
// r2: full catch-up (no reservation pressure).
r2 := c.Nodes["r2"]
if err := c.RecoverReplicaFromPrimary("r2", r2.Storage.FlushedLSN, c.Coordinator.CommittedLSN); err != nil {
t.Fatalf("r2 full catch-up failed: %v", err)
}
r2.ReplicaState = NodeStateInSync
// Deterministic mixed outcome: r1=NeedsRebuild, r2=InSync.
if r1.ReplicaState != NodeStateNeedsRebuild {
t.Fatalf("r1 should be NeedsRebuild, got %s", r1.ReplicaState)
}
if r2.ReplicaState != NodeStateInSync {
t.Fatalf("r2 should be InSync, got %s", r2.ReplicaState)
}
// r2 data correct.
if err := c.AssertCommittedRecoverable("r2"); err != nil {
t.Fatalf("r2 data incorrect: %v", err)
}
// r1 rebuild from snapshot.
c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN)
r1.ReplicaState = NodeStateInSync
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatalf("r1 data incorrect after rebuild: %v", err)
}
t.Logf("mixed outcome proven: r1=NeedsRebuild→rebuilt, r2=InSync")
}
// Scenario: all replicas need rebuild but only one snapshot exists.
// First replica rebuilds from snapshot, second must wait or use first
// replica as rebuild source.
func TestP02_MultiNode_AllNeedRebuild(t *testing.T) {
// Use 5 nodes so quorum (3 of 5) can be met with p+r3+r4 while r1+r2 are down.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4")
c.MaxCatchupAttempts = 2
for i := uint64(1); i <= 5; i++ {
c.CommitWrite(i)
}
c.TickN(5)
c.Primary().Storage.TakeSnapshot("snap-all", c.Coordinator.CommittedLSN)
// r1 and r2 disconnect. r3+r4 stay connected so quorum (p+r3+r4) can commit.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.Disconnect("p", "r2")
c.Disconnect("r2", "p")
for i := uint64(6); i <= 100; i++ {
c.CommitWrite(i % 8)
}
c.TickN(5)
// Try catch-up for r1 and r2 — both will escalate.
// Pattern: write while target disconnected, then try partial catch-up.
for _, id := range []string{"r1", "r2"} {
n := c.Nodes[id]
n.ReplicaState = NodeStateCatchingUp
for attempt := 0; attempt < 5; attempt++ {
// Write MORE while target is still disconnected (r3+r4 provide quorum).
for w := 0; w < 20; w++ {
c.CommitWrite(uint64(101+attempt*20+w) % 8)
}
c.TickN(3) // 3 ticks: deliver writes, barriers, then acks
// Now try catch-up (partial, batch=1). Target stays disconnected —
// RecoverReplicaFromPrimaryPartial reads directly from primary WAL.
c.CatchUpWithEscalation(id, 1)
if n.ReplicaState == NodeStateNeedsRebuild {
break
}
}
}
// Reconnect all for rebuild.
c.Connect("p", "r1")
c.Connect("r1", "p")
c.Connect("p", "r2")
c.Connect("r2", "p")
// Both should be NeedsRebuild.
if c.Nodes["r1"].ReplicaState != NodeStateNeedsRebuild {
t.Fatalf("r1: expected NeedsRebuild, got %s", c.Nodes["r1"].ReplicaState)
}
if c.Nodes["r2"].ReplicaState != NodeStateNeedsRebuild {
t.Fatalf("r2: expected NeedsRebuild, got %s", c.Nodes["r2"].ReplicaState)
}
// Rebuild both from snapshot.
c.RebuildReplicaFromSnapshot("r1", "snap-all", c.Coordinator.CommittedLSN)
c.RebuildReplicaFromSnapshot("r2", "snap-all", c.Coordinator.CommittedLSN)
c.Nodes["r1"].ReplicaState = NodeStateInSync
c.Nodes["r2"].ReplicaState = NodeStateInSync
// Both correct.
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatal(err)
}
if err := c.AssertCommittedRecoverable("r2"); err != nil {
t.Fatal(err)
}
t.Logf("multi-node rebuild complete: both replicas recovered from snapshot")
}
// Scenario: rebuild timeout — rebuild takes too long, coordinator
// should be able to abort and retry or fail explicitly.
func TestP02_RebuildTimeout_PartialRebuildAborts(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
for i := uint64(1); i <= 20; i++ {
c.CommitWrite(i % 4)
}
c.TickN(5)
c.Primary().Storage.TakeSnapshot("snap-timeout", c.Coordinator.CommittedLSN)
// Write much more.
for i := uint64(21); i <= 100; i++ {
c.CommitWrite(i % 4)
}
c.TickN(5)
// r1 needs rebuild — use partial rebuild with small max.
lastRecovered, err := c.RebuildReplicaFromSnapshotPartial("r1", "snap-timeout", c.Coordinator.CommittedLSN, 5)
if err != nil {
t.Fatalf("partial rebuild: %v", err)
}
// Partial rebuild: not complete.
if lastRecovered >= c.Coordinator.CommittedLSN {
t.Fatal("expected partial rebuild, not complete")
}
// r1 state should remain NeedsRebuild (not promoted to InSync).
c.Nodes["r1"].ReplicaState = NodeStateRebuilding
if c.Nodes["r1"].ReplicaState == NodeStateInSync {
t.Fatal("partial rebuild should not grant InSync")
}
// Full rebuild to complete.
c.RebuildReplicaFromSnapshot("r1", "snap-timeout", c.Coordinator.CommittedLSN)
c.Nodes["r1"].ReplicaState = NodeStateInSync
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatal(err)
}
}

359
sw-block/prototype/distsim/phase02_test.go

@ -0,0 +1,359 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 02: Protocol-state assertions + version comparison
// ============================================================
// --- P0: Protocol-level rejection assertions ---
func TestP02_EpochFencing_AllStaleTrafficRejected(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// Partition + promote.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.Disconnect("p", "r2")
c.Disconnect("r2", "p")
c.Promote("r1")
staleEpoch := c.Coordinator.Epoch - 1
c.Nodes["p"].Epoch = staleEpoch
// Stale writes through protocol.
delivered := c.StaleWrite("p", staleEpoch, 99)
// Protocol-level assertion: zero accepted, all rejected by epoch.
if delivered > 0 {
t.Fatalf("stale traffic accepted: %d messages passed fencing", delivered)
}
epochRejects := c.RejectedByReason(RejectEpochMismatch)
if epochRejects == 0 {
t.Fatal("no epoch rejections recorded — fencing not tracked")
}
// Delivery log must show explicit rejections (protocol behavior, not just final state).
totalRejected := 0
for _, d := range c.Deliveries {
if !d.Accepted {
totalRejected++
}
}
if totalRejected == 0 {
t.Fatal("delivery log has no rejections — protocol behavior not recorded")
}
t.Logf("protocol-level: %d rejected, %d epoch_mismatch", totalRejected, epochRejects)
}
func TestP02_AcceptedDeliveries_Tracked(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// Should have accepted write + barrier deliveries.
accepted := c.AcceptedCount()
if accepted == 0 {
t.Fatal("no accepted deliveries recorded")
}
acceptedWrites := c.AcceptedByKind(MsgWrite)
if acceptedWrites == 0 {
t.Fatal("no accepted write deliveries")
}
t.Logf("after 1 write: %d accepted total, %d writes", accepted, acceptedWrites)
}
// --- P1: S20 protocol-level closure ---
func TestP02_S20_StaleTraffic_CommittedPrefixUnchanged(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "A", "B", "C")
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// Partition A, promote B.
c.Disconnect("A", "B")
c.Disconnect("B", "A")
c.Disconnect("A", "C")
c.Disconnect("C", "A")
c.Promote("B")
c.Nodes["A"].Epoch = c.Coordinator.Epoch - 1
// B writes (new epoch).
c.CommitWrite(3)
c.TickN(5)
committedBefore := c.Coordinator.CommittedLSN
// A stale writes through protocol.
c.StaleWrite("A", c.Nodes["A"].Epoch, 99)
// Protocol assertion: committed prefix unchanged by stale traffic.
committedAfter := c.Coordinator.CommittedLSN
if committedAfter != committedBefore {
t.Fatalf("stale traffic changed committed prefix: before=%d after=%d", committedBefore, committedAfter)
}
// All stale messages rejected by epoch.
if c.RejectedByReason(RejectEpochMismatch) == 0 {
t.Fatal("no epoch rejections for stale traffic")
}
}
// --- P1: S6 protocol-level closure ---
func TestP02_S6_NonConvergent_ExplicitStateTransition(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.MaxCatchupAttempts = 3
for i := uint64(1); i <= 5; i++ {
c.CommitWrite(i)
}
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
for i := uint64(6); i <= 100; i++ {
c.CommitWrite(i % 8)
}
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
// Protocol assertion: state transitions are explicit.
// Track the state at each step.
var stateTrace []ReplicaNodeState
for attempt := 0; attempt < 10; attempt++ {
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
for w := 0; w < 20; w++ {
c.CommitWrite(uint64(101+attempt*20+w) % 8)
}
c.TickN(2)
c.Connect("p", "r1")
c.Connect("r1", "p")
c.CatchUpWithEscalation("r1", 1)
stateTrace = append(stateTrace, r1.ReplicaState)
if r1.ReplicaState == NodeStateNeedsRebuild {
break
}
}
// Must have explicit state transitions: CatchingUp → ... → NeedsRebuild.
if r1.ReplicaState != NodeStateNeedsRebuild {
t.Fatalf("expected NeedsRebuild, got %s", r1.ReplicaState)
}
// Trace must show CatchingUp before NeedsRebuild.
hasCatchingUp := false
for _, s := range stateTrace {
if s == NodeStateCatchingUp {
hasCatchingUp = true
}
}
if !hasCatchingUp {
t.Fatal("state trace should include CatchingUp before NeedsRebuild")
}
t.Logf("state trace: %v", stateTrace)
}
// --- P1: S18 protocol-level closure ---
func TestP02_S18_DelayedAck_ExplicitRejection(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// Write 2 without r1 ack.
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(3)
// Restart primary with epoch bump.
c.StopNode("p")
c.Coordinator.Epoch++
for _, n := range c.Nodes {
if n.Running {
n.Epoch = c.Coordinator.Epoch
}
}
c.StartNode("p")
committedBefore := c.Coordinator.CommittedLSN
deliveriesBefore := len(c.Deliveries)
// Reconnect r1, deliver stale ack.
c.Connect("r1", "p")
c.Connect("p", "r1")
oldAck := Message{
Kind: MsgBarrierAck, From: "r1", To: "p",
Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2,
}
c.deliver(oldAck)
c.refreshCommits()
// Protocol assertion 1: committed prefix unchanged.
if c.Coordinator.CommittedLSN > committedBefore {
t.Fatalf("stale ack advanced prefix: %d → %d", committedBefore, c.Coordinator.CommittedLSN)
}
// Protocol assertion 2: the delivery was explicitly recorded as rejected.
newDeliveries := c.Deliveries[deliveriesBefore:]
found := false
for _, d := range newDeliveries {
if !d.Accepted && d.Reason == RejectEpochMismatch && d.Msg.Kind == MsgBarrierAck {
found = true
}
}
if !found {
t.Fatal("stale ack not recorded as epoch_mismatch rejection in delivery log")
}
}
// --- P2: Version comparison ---
func TestP02_VersionComparison_BriefDisconnect(t *testing.T) {
// Same scenario under V1, V1.5, V2 — different expected outcomes.
for _, tc := range []struct {
version ProtocolVersion
expectCatchup bool
expectRebuild bool
}{
{ProtocolV1, false, false}, // V1: no catch-up, stays degraded
{ProtocolV15, true, false}, // V1.5: catch-up possible if address stable
{ProtocolV2, true, false}, // V2: catch-up allowed for this recoverable short-gap case
} {
t.Run(string(tc.version), func(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, tc.version, "p", "r1", "r2")
c.MaxCatchupAttempts = 5
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// Brief disconnect.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(3)
c.CommitWrite(4)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
canCatchup := c.Protocol.CanAttemptCatchup(true)
if canCatchup != tc.expectCatchup {
t.Fatalf("CanAttemptCatchup: got %v, want %v", canCatchup, tc.expectCatchup)
}
if canCatchup {
// Catch up r1.
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("expected catch-up to converge for short gap")
}
if r1.ReplicaState != NodeStateInSync {
t.Fatalf("expected InSync after catch-up, got %s", r1.ReplicaState)
}
}
})
}
}
func TestP02_VersionComparison_BriefDisconnectActions(t *testing.T) {
for _, tc := range []struct {
version ProtocolVersion
addrStable bool
recoverable bool
expectAction string
}{
{ProtocolV1, true, true, "degrade_or_rebuild"},
{ProtocolV15, true, true, "catchup_if_history_survives"},
{ProtocolV15, false, true, "stall_or_control_plane_recovery"},
{ProtocolV2, true, true, "reserved_catchup"},
{ProtocolV2, false, false, "explicit_rebuild"},
} {
t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable)+"_recoverable="+boolStr(tc.recoverable), func(t *testing.T) {
policy := ProtocolPolicy{Version: tc.version}
action := policy.BriefDisconnectAction(tc.addrStable, tc.recoverable)
if action != tc.expectAction {
t.Fatalf("BriefDisconnectAction(%v,%v): got %q, want %q", tc.addrStable, tc.recoverable, action, tc.expectAction)
}
})
}
}
func TestP02_VersionComparison_TailChasing(t *testing.T) {
for _, tc := range []struct {
version ProtocolVersion
expectAction string
}{
{ProtocolV1, "degrade"},
{ProtocolV15, "stall_or_rebuild"},
{ProtocolV2, "abort_to_rebuild"},
} {
t.Run(string(tc.version), func(t *testing.T) {
policy := ProtocolPolicy{Version: tc.version}
action := policy.TailChasingAction(false) // non-convergent
if action != tc.expectAction {
t.Fatalf("TailChasingAction(false): got %q, want %q", action, tc.expectAction)
}
})
}
}
func TestP02_VersionComparison_RestartRejoin(t *testing.T) {
for _, tc := range []struct {
version ProtocolVersion
addrStable bool
expectAction string
}{
{ProtocolV1, true, "control_plane_only"},
{ProtocolV1, false, "control_plane_only"},
{ProtocolV15, true, "background_reconnect_or_control_plane"},
{ProtocolV15, false, "control_plane_only"},
{ProtocolV2, true, "direct_reconnect_or_control_plane"},
{ProtocolV2, false, "explicit_reassignment_or_rebuild"},
} {
t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable), func(t *testing.T) {
policy := ProtocolPolicy{Version: tc.version}
action := policy.RestartRejoinAction(tc.addrStable)
if action != tc.expectAction {
t.Fatalf("RestartRejoinAction(%v): got %q, want %q", tc.addrStable, action, tc.expectAction)
}
})
}
}
func TestP02_VersionComparison_V15RestartAddressInstability(t *testing.T) {
v15 := ProtocolPolicy{Version: ProtocolV15}
v2 := ProtocolPolicy{Version: ProtocolV2}
if got := v15.RestartRejoinAction(false); got != "control_plane_only" {
t.Fatalf("v1.5 changed-address restart should fall back to control plane, got %q", got)
}
if got := v2.ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" {
t.Fatalf("v2 changed-address recoverable restart should use explicit reassignment + catch-up, got %q", got)
}
if got := v2.ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" {
t.Fatalf("v2 changed-address unrecoverable restart should go to explicit reassignment/rebuild, got %q", got)
}
}
func boolStr(b bool) string {
if b {
return "true"
}
return "false"
}

434
sw-block/prototype/distsim/phase02_v1_failures_test.go

@ -0,0 +1,434 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 02 P2: Real V1/V1.5 failure reproductions
// Source: actual Phase 13 hardware behavior and CP13-8 findings
// ============================================================
// --- Scenario: Changed-address restart (CP13-8 T4b) ---
// Real bug: replica restarts on a different port. V1.5 shipper retries
// the old address forever. Catch-up never succeeds because the old
// address is dead.
func TestP02_V1_ChangedAddressRestart_NeverRecovers(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2")
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// r1 restarts with changed address — endpoint version bumps.
c.StopNode("r1")
c.Coordinator.Epoch++
for _, n := range c.Nodes {
if n.Running {
n.Epoch = c.Coordinator.Epoch
}
}
c.RestartNodeWithNewAddress("r1")
// Messages from primary to r1 now rejected: stale endpoint.
staleRejects := c.RejectedByReason(RejectStaleEndpoint)
// Writes accumulate — r1 can't receive (endpoint mismatch).
for i := uint64(3); i <= 12; i++ {
c.CommitWrite(i)
}
c.TickN(5)
// Verify: messages rejected by stale endpoint, not just link down.
newStaleRejects := c.RejectedByReason(RejectStaleEndpoint) - staleRejects
if newStaleRejects == 0 {
t.Fatal("V1: writes to r1 should be rejected by stale_endpoint")
}
// V1: no recovery trigger available.
trigger, _, ok := c.TriggerRecoverySession("r1")
if ok {
t.Fatalf("V1 should not trigger recovery, got %s", trigger)
}
// Gap confirmed.
r1 := c.Nodes["r1"]
if err := c.AssertCommittedRecoverable("r1"); err == nil {
t.Fatal("V1: r1 should have data inconsistency")
}
t.Logf("V1: gap=%d, %d stale_endpoint rejections, no recovery path",
c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN, newStaleRejects)
}
func TestP02_V15_ChangedAddressRestart_RetriesToStaleAddress(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2")
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// r1 restarts with changed address — endpoint version bumps.
c.StopNode("r1")
c.Coordinator.Epoch++
for _, n := range c.Nodes {
if n.Running {
n.Epoch = c.Coordinator.Epoch
}
}
c.RestartNodeWithNewAddress("r1")
// Writes accumulate — rejected by stale endpoint.
for i := uint64(3); i <= 12; i++ {
c.CommitWrite(i)
}
c.TickN(5)
// V1.5: recovery trigger fails — address mismatch detected.
trigger, _, ok := c.TriggerRecoverySession("r1")
if ok {
t.Fatalf("V1.5 should not trigger recovery with changed address, got %s", trigger)
}
// Heartbeat reveals new endpoint, but V1.5 can only do control_plane_only.
report := c.ReportHeartbeat("r1")
update := c.CoordinatorDetectEndpointChange(report)
if update == nil {
t.Fatal("coordinator should detect endpoint change")
}
// V1.5: does NOT apply assignment update — no mechanism to update primary.
if got := c.Protocol.ChangedAddressRestartAction(true); got != "control_plane_only" {
t.Fatalf("V1.5: got %q, want control_plane_only", got)
}
// Gap persists, data inconsistency.
r1 := c.Nodes["r1"]
if err := c.AssertCommittedRecoverable("r1"); err == nil {
t.Fatal("V1.5: r1 should have data inconsistency")
}
t.Logf("V1.5: gap=%d, stale endpoint blocks recovery — control_plane_only",
c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN)
}
func TestP02_V2_ChangedAddressRestart_ExplicitReassignment(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2")
c.MaxCatchupAttempts = 5
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// r1 restarts with changed address — endpoint version bumps.
c.StopNode("r1")
c.Coordinator.Epoch++
for _, n := range c.Nodes {
if n.Running {
n.Epoch = c.Coordinator.Epoch
}
}
c.RestartNodeWithNewAddress("r1")
// Writes accumulate — rejected by stale endpoint.
for i := uint64(3); i <= 12; i++ {
c.CommitWrite(i)
}
c.TickN(5)
// Before control-plane flow: recovery trigger fails (stale endpoint).
trigger, _, ok := c.TriggerRecoverySession("r1")
if ok {
t.Fatalf("V2: recovery should fail before assignment update, got %s", trigger)
}
// Step 1: heartbeat discovers new endpoint.
report := c.ReportHeartbeat("r1")
update := c.CoordinatorDetectEndpointChange(report)
if update == nil {
t.Fatal("coordinator should detect endpoint change")
}
// Step 2: coordinator applies assignment — primary learns new address.
c.ApplyAssignmentUpdate(*update)
// Step 3: recovery trigger now succeeds (endpoint matches).
trigger, _, ok = c.TriggerRecoverySession("r1")
if !ok || trigger != TriggerReassignment {
t.Fatalf("V2: expected reassignment trigger after update, got %s/%v", trigger, ok)
}
// Step 4: catch-up via protocol.
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("V2: catch-up should converge after reassignment")
}
// Data correct after full control-plane flow.
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatalf("V2: data incorrect after reassignment+catchup: %v", err)
}
t.Logf("V2: recovered via heartbeat→detect→assignment→trigger→catchup")
}
// --- Scenario: Same-address transient outage ---
// Common case: brief network hiccup, same ports.
func TestP02_V1_TransientOutage_Degrades(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// Brief partition.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.CommitWrite(3)
c.TickN(5)
// Heal.
c.Connect("p", "r1")
c.Connect("r1", "p")
// V1: no catch-up. r1 stays at flushed=1.
if c.Protocol.CanAttemptCatchup(true) {
t.Fatal("V1 should not catch-up even with stable address")
}
c.TickN(5)
r1 := c.Nodes["r1"]
if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN {
// V1 doesn't catch up — unless messages from BEFORE disconnect are still delivering.
// In our model, messages enqueued before disconnect may still arrive. That's a V1 "accident" not protocol.
}
t.Logf("V1 transient outage: flushed=%d committed=%d action=%s",
r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN,
c.Protocol.BriefDisconnectAction(true, true))
}
func TestP02_V15_TransientOutage_CatchesUp(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2")
c.MaxCatchupAttempts = 5
c.CommitWrite(1)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.CommitWrite(3)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// V1.5: catch-up works if address stable.
if !c.Protocol.CanAttemptCatchup(true) {
t.Fatal("V1.5 should catch-up with stable address")
}
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("V1.5: should converge for short gap with stable address")
}
if r1.ReplicaState != NodeStateInSync {
t.Fatalf("V1.5: expected InSync, got %s", r1.ReplicaState)
}
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatal(err)
}
t.Logf("V1.5 transient outage: recovered via catch-up, flushed=%d", r1.Storage.FlushedLSN)
}
func TestP02_V2_TransientOutage_ReservedCatchup(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2")
c.MaxCatchupAttempts = 5
c.CommitWrite(1)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.CommitWrite(3)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// V2: reserved catch-up — explicit recoverability check.
action := c.Protocol.BriefDisconnectAction(true, true)
if action != "reserved_catchup" {
t.Fatalf("V2 brief disconnect: got %q, want reserved_catchup", action)
}
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("V2: should converge for short gap")
}
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatal(err)
}
t.Logf("V2 transient outage: reserved catch-up succeeded")
}
// --- Scenario: Slow control-plane recovery ---
// Source: real Phase 13 hardware behavior.
// Data path recovers fast. Control plane (master) is slow to re-issue
// assignments. During this window, V1/V1.5 behavior differs from V2.
func TestP02_SlowControlPlane_V1_WaitsForMaster(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// r1 disconnects. Stays disconnected through outage + control-plane delay.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
// Writes accumulate: outage write + delay-window writes. r1 misses all.
for i := uint64(2); i <= 10; i++ {
c.CommitWrite(i)
}
c.TickN(5)
// Data path heals — but V1 has no catch-up protocol.
c.Connect("p", "r1")
c.Connect("r1", "p")
// V1: no recovery trigger even with address stable.
trigger, _, ok := c.TriggerRecoverySession("r1")
if ok {
t.Fatalf("V1 should not trigger recovery, got %s", trigger)
}
// r1 is behind: FlushedLSN=1, CommittedLSN=10. Gap = 9.
r1 := c.Nodes["r1"]
gap := c.Coordinator.CommittedLSN - r1.Storage.FlushedLSN
if gap < 9 {
t.Fatalf("V1: expected gap >= 9, got %d", gap)
}
// V1 data inconsistency: r1 missed writes 2-10. No self-heal mechanism.
err := c.AssertCommittedRecoverable("r1")
if err == nil {
t.Fatal("V1: r1 should have data inconsistency — no catch-up mechanism")
}
t.Logf("V1 slow control-plane: gap=%d, data inconsistency — %v", gap, err)
}
func TestP02_SlowControlPlane_V15_BackgroundReconnect(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2")
c.MaxCatchupAttempts = 5
c.CommitWrite(1)
c.TickN(5)
// r1 disconnects. Stays disconnected through outage + delay window.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
// Writes accumulate while r1 is disconnected.
for i := uint64(2); i <= 10; i++ {
c.CommitWrite(i)
}
c.TickN(5)
// Data path heals.
c.Connect("p", "r1")
c.Connect("r1", "p")
// Before catch-up: r1 is behind (FlushedLSN=1, CommittedLSN=10).
r1 := c.Nodes["r1"]
if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN {
t.Fatal("V1.5: r1 should be behind before catch-up")
}
if err := c.AssertCommittedRecoverable("r1"); err == nil {
t.Fatal("V1.5: r1 should have data gap before catch-up")
}
// V1.5 policy: background reconnect if address stable.
if c.Protocol.RestartRejoinAction(true) != "background_reconnect_or_control_plane" {
t.Fatal("V1.5 stable-address should be background_reconnect_or_control_plane")
}
// V1.5 recovery trigger: background reconnect (address stable → endpoint matches).
trigger, _, ok := c.TriggerRecoverySession("r1")
if !ok || trigger != TriggerBackgroundReconnect {
t.Fatalf("V1.5: expected background_reconnect trigger, got %s/%v", trigger, ok)
}
// r1.ReplicaState is now CatchingUp (set by TriggerRecoverySession).
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("V1.5: should catch up with stable address")
}
// After catch-up: data correct.
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatalf("V1.5: data should be correct after catch-up — %v", err)
}
// V1.5 changed-address: falls back to control plane.
if c.Protocol.RestartRejoinAction(false) != "control_plane_only" {
t.Fatal("V1.5 changed-address should fall back to control_plane_only")
}
t.Logf("V1.5 slow control-plane: caught up %d entries via background reconnect",
c.Coordinator.CommittedLSN-1)
}
func TestP02_SlowControlPlane_V2_DirectReconnect(t *testing.T) {
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2")
c.MaxCatchupAttempts = 5
c.CommitWrite(1)
c.TickN(5)
// r1 disconnects. Stays disconnected through outage + delay window.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
// Writes accumulate while r1 is disconnected.
for i := uint64(2); i <= 10; i++ {
c.CommitWrite(i)
}
c.TickN(5)
// Data path heals.
c.Connect("p", "r1")
c.Connect("r1", "p")
// Before catch-up: r1 is behind.
r1 := c.Nodes["r1"]
if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN {
t.Fatal("V2: r1 should be behind before direct reconnect")
}
if err := c.AssertCommittedRecoverable("r1"); err == nil {
t.Fatal("V2: r1 should have data gap before direct reconnect")
}
// V2 policy: direct reconnect, doesn't wait for master.
if c.Protocol.RestartRejoinAction(true) != "direct_reconnect_or_control_plane" {
t.Fatal("V2 should be direct_reconnect_or_control_plane")
}
// V2 recovery trigger: reassignment (address stable → endpoint matches).
trigger, _, ok := c.TriggerRecoverySession("r1")
if !ok || trigger != TriggerReassignment {
t.Fatalf("V2: expected reassignment trigger, got %s/%v", trigger, ok)
}
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("V2: should catch up directly without master intervention")
}
// After catch-up: data correct.
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatalf("V2: data should be correct after direct reconnect — %v", err)
}
t.Logf("V2 slow control-plane: caught up %d entries immediately via direct reconnect",
c.Coordinator.CommittedLSN-1)
}

287
sw-block/prototype/distsim/phase03_p2_race_test.go

@ -0,0 +1,287 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 03 P2: Timer-ordering races
// ============================================================
// --- Race 1: Concurrent barrier timeouts under sync_quorum ---
func TestP03_P2_ConcurrentBarrierTimeout_QuorumEdge(t *testing.T) {
// RF=3 (p, r1, r2). sync_quorum (quorum=2).
// Both r1 and r2 have barrier timeouts. r1's ack arrives in the same tick
// as r2's timeout fires. The "data before timers" rule means:
// r1 ack processed → cancels r1 timeout → r2 timeout fires → quorum = p+r1 = 2 → committed.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.BarrierTimeoutTicks = 5
// r2 disconnected — barrier will time out. r1 connected — will ack.
c.Disconnect("p", "r2")
c.Disconnect("r2", "p")
c.CommitWrite(1) // barrier to r1 at Now+2, barrier to r2 at Now+2
// Barrier timeout for both at Now+5.
c.TickN(10)
// r1 ack arrived → cancelled r1 timeout.
// r2 barrier timed out (link down, no ack).
firedBarriers := c.FiredTimeoutsByKind(TimeoutBarrier)
if firedBarriers != 1 {
t.Fatalf("expected 1 barrier timeout (r2), got %d", firedBarriers)
}
// Event log: r1's barrier timeout was cancelled (ack arrived earlier).
// r2's barrier timeout fired. Verify both are in the TickLog.
var cancelCount, fireCount int
for _, e := range c.TickLog {
if e.Kind == EventTimeoutCancelled {
cancelCount++
}
if e.Kind == EventTimeoutFired {
fireCount++
}
}
if cancelCount != 1 {
t.Fatalf("expected 1 timeout cancel (r1 ack), got %d", cancelCount)
}
if fireCount != 1 {
t.Fatalf("expected 1 timeout fire (r2), got %d", fireCount)
}
// Quorum: p + r1 = 2 of 3 → committed.
if c.Coordinator.CommittedLSN != 1 {
t.Fatalf("LSN 1 should commit via quorum (p+r1), committed=%d", c.Coordinator.CommittedLSN)
}
// DurableOn: p=true (self-ack), r1=true (ack), r2 NOT set (timed out).
p1 := c.Pending[1]
if !p1.DurableOn["p"] || !p1.DurableOn["r1"] {
t.Fatal("DurableOn should have p and r1")
}
if p1.DurableOn["r2"] {
t.Fatal("DurableOn should NOT have r2 (timed out)")
}
t.Logf("concurrent timeout: r1 acked, r2 timed out, quorum met, committed=%d", c.Coordinator.CommittedLSN)
}
func TestP03_P2_ConcurrentBarrierTimeout_BothTimeout_NoQuorum(t *testing.T) {
// Both r1 and r2 disconnected. Both timeouts fire. Quorum = p alone = 1 < 2.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.BarrierTimeoutTicks = 5
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.Disconnect("p", "r2")
c.Disconnect("r2", "p")
c.CommitWrite(1)
c.TickN(10)
// Both barriers timed out.
if c.FiredTimeoutsByKind(TimeoutBarrier) != 2 {
t.Fatalf("expected 2 barrier timeouts, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
}
// No quorum — uncommitted.
if c.Coordinator.CommittedLSN != 0 {
t.Fatalf("LSN 1 should not commit without quorum, committed=%d", c.Coordinator.CommittedLSN)
}
// Neither r1 nor r2 in DurableOn.
p1 := c.Pending[1]
if p1.DurableOn["r1"] || p1.DurableOn["r2"] {
t.Fatal("DurableOn should not have r1 or r2")
}
t.Logf("both timeouts: no quorum, LSN 1 uncommitted")
}
func TestP03_P2_ConcurrentBarrierTimeout_SameTick_AckAndTimeout(t *testing.T) {
// The precise same-tick race: r1 ack arrives at exactly the tick when r2's
// timeout fires. Verify data-before-timers ordering in the event log.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.BarrierTimeoutTicks = 4 // timeout at Now+4
c.Disconnect("p", "r2")
c.Disconnect("r2", "p")
c.CommitWrite(1)
// Write at Now+1, barrier at Now+2, ack back at Now+3.
// Timeout for r1 at Now+4, timeout for r2 at Now+4.
// Tick to barrier ack arrival (tick 3): r1 ack delivered, cancels r1 timeout.
// Tick 4: r2 timeout fires. r1 timeout already cancelled.
c.TickN(6)
// Check event ordering at the timeout tick.
timeoutTick := uint64(0)
for _, ft := range c.FiredTimeouts {
timeoutTick = ft.FiredAt
}
events := c.TickEventsAt(timeoutTick)
// At the timeout tick, we should see: r2 timeout fired (r1 was cancelled earlier).
var firedDetails []string
for _, e := range events {
if e.Kind == EventTimeoutFired {
firedDetails = append(firedDetails, e.Detail)
}
}
if len(firedDetails) != 1 {
t.Fatalf("expected 1 timeout fire at tick %d, got %d: %v", timeoutTick, len(firedDetails), firedDetails)
}
// Committed via quorum.
if c.Coordinator.CommittedLSN != 1 {
t.Fatalf("committed=%d, want 1", c.Coordinator.CommittedLSN)
}
t.Logf("same-tick race: r1 ack cancelled at tick 3, r2 timeout fired at tick %d, committed=1", timeoutTick)
}
// --- Race 2: Epoch bump during active barrier timeout window ---
func TestP03_P2_EpochBumpDuringBarrierTimeout_CrossSurface(t *testing.T) {
// Three cleanup mechanisms interact for the same barrier:
// 1. Epoch fencing in deliver() rejects old-epoch messages
// 2. Barrier timeout in fireTimeouts() removes queued barriers + marks expired
// 3. ExpiredBarriers in deliver() rejects late acks
//
// Scenario: barrier re-queues (r1 missing data), epoch bumps, then timeout fires.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.BarrierTimeoutTicks = 10
c.CommitWrite(1) // write+barrier to r1, r2
// Drop write to r1 so barrier keeps re-queuing.
var kept []inFlightMessage
for _, item := range c.Queue {
if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 1 {
continue
}
kept = append(kept, item)
}
c.Queue = kept
// Tick 1-3: r1's barrier delivers but r1 doesn't have data → re-queues.
// r2 gets write+barrier normally → acks.
c.TickN(3)
// Epoch bump: promote r2 (p stays running as demoted replica).
// This ensures the old-epoch barrier hits epoch fencing, not node_down.
if err := c.Promote("r2"); err != nil {
t.Fatal(err)
}
// Record state before timeout window.
epochRejectsBefore := c.RejectedByReason(RejectEpochMismatch)
// Tick 4-5: old-epoch barrier (p→r1) is in queue. deliver() rejects
// with epoch_mismatch (msg epoch=1 vs coordinator epoch=2).
c.TickN(2)
// Old barrier rejected by epoch fencing.
epochRejectsAfter := c.RejectedByReason(RejectEpochMismatch)
newEpochRejects := epochRejectsAfter - epochRejectsBefore
if newEpochRejects == 0 {
t.Fatal("old-epoch barrier should be rejected by epoch fencing")
}
// Tick past barrier timeout deadline.
c.TickN(10)
// Barrier timeout fires for r1/LSN 1 (removes any remaining queued copies).
if c.FiredTimeoutsByKind(TimeoutBarrier) == 0 {
t.Fatal("barrier timeout should fire for r1/LSN 1")
}
// Expired barrier marked.
if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] {
t.Fatal("r1/LSN 1 should be in ExpiredBarriers")
}
// Inject late ack from r1 for LSN 1 at current epoch (to new primary r2).
// The barrier is expired — ack should be rejected by barrier_expired.
deliveriesBefore := len(c.Deliveries)
c.InjectMessage(Message{
Kind: MsgBarrierAck, From: "r1", To: "r2",
Epoch: c.Coordinator.Epoch, TargetLSN: 1,
}, c.Now+1)
c.TickN(2)
// Late ack rejected by barrier_expired.
lateRejected := false
for _, d := range c.Deliveries[deliveriesBefore:] {
if d.Msg.Kind == MsgBarrierAck && d.Msg.From == "r1" && d.Msg.TargetLSN == 1 {
if !d.Accepted && d.Reason == RejectBarrierExpired {
lateRejected = true
}
}
}
if !lateRejected {
t.Fatal("late ack for expired barrier should be rejected as barrier_expired")
}
// Verify event log shows the cross-surface interaction.
var epochRejectEvents, timeoutFireEvents int
for _, e := range c.TickLog {
if e.Kind == EventDeliveryRejected {
epochRejectEvents++
}
if e.Kind == EventTimeoutFired {
timeoutFireEvents++
}
}
if epochRejectEvents == 0 || timeoutFireEvents == 0 {
t.Fatalf("event log should show both epoch rejections (%d) and timeout fires (%d)",
epochRejectEvents, timeoutFireEvents)
}
t.Logf("cross-surface: epoch_rejects=%d, timeout_fires=%d, expired_barrier=true, late_ack_rejected=true",
newEpochRejects, c.FiredTimeoutsByKind(TimeoutBarrier))
}
// --- TickEvents trace verification ---
func TestP03_P2_TickEvents_OrderingVerifiable(t *testing.T) {
// Verify that TickEvents captures delivery → timeout ordering within a tick.
c := NewCluster(CommitSyncAll, "p", "r1")
c.BarrierTimeoutTicks = 5
c.CommitWrite(1)
c.TickN(10) // normal flow: ack cancels timeout
// TickLog should have events.
if len(c.TickLog) == 0 {
t.Fatal("TickLog should record events")
}
// Find delivery events and timeout events.
var deliveries, cancels int
for _, e := range c.TickLog {
switch e.Kind {
case EventDeliveryAccepted:
deliveries++
case EventTimeoutCancelled:
cancels++
}
}
if deliveries == 0 {
t.Fatal("should have delivery events")
}
if cancels == 0 {
t.Fatal("should have timeout cancel events (ack cancelled barrier timeout)")
}
// BuildTrace includes TickEvents.
trace := BuildTrace(c)
if len(trace.TickEvents) == 0 {
t.Fatal("BuildTrace should include TickEvents")
}
t.Logf("tick events: %d deliveries, %d cancels, %d total events",
deliveries, cancels, len(c.TickLog))
}

281
sw-block/prototype/distsim/phase03_race_test.go

@ -0,0 +1,281 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 03 P1: Race-focused tests with trace quality
// ============================================================
// --- Race 1: Promotion vs delayed catch-up timeout ---
func TestP03_Race_PromotionThenStaleCatchupTimeout(t *testing.T) {
// r1 is CatchingUp with a catch-up timeout registered.
// Before the timeout fires, primary crashes and r1 is promoted.
// The stale catch-up timeout must not regress r1 (now primary) to NeedsRebuild.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// r1 falls behind, starts catching up.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(3)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10)
// r1 catches up successfully.
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("r1 should converge before promotion")
}
// Primary crashes. Promote r1.
c.StopNode("p")
if err := c.Promote("r1"); err != nil {
t.Fatal(err)
}
// Tick past the catch-up timeout deadline.
c.TickN(15)
// Stale timeout must not fire (was auto-cancelled on convergence).
if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 {
t.Fatal("stale catch-up timeout must not fire after promotion")
}
// r1 must remain primary and running.
if r1.Role != RolePrimary {
t.Fatalf("r1 should be primary, got %s", r1.Role)
}
if r1.ReplicaState == NodeStateNeedsRebuild {
t.Fatal("stale timeout regressed promoted r1 to NeedsRebuild")
}
t.Logf("promotion vs timeout: stale catch-up timeout suppressed, r1 is primary")
}
func TestP03_Race_PromotionThenStaleBarrierTimeout(t *testing.T) {
// Barrier timeout registered for r1 at old epoch.
// Promotion bumps epoch. The stale barrier timeout fires but must not
// affect the new epoch's commit state.
c := NewCluster(CommitSyncAll, "p", "r1")
c.BarrierTimeoutTicks = 8
// Write 1 — barrier to r1. Disconnect r1 so barrier can't ack.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(1)
// Tick 2 — barrier timeout registered at Now+8.
c.TickN(2)
// Primary crashes, promote r1 (even though it doesn't have write 1).
c.StopNode("p")
c.StartNode("r1")
if err := c.Promote("r1"); err != nil {
t.Fatal(err)
}
// Snapshot committed prefix before stale timeout window.
committedBefore := c.Coordinator.CommittedLSN
// r1 is now primary at new epoch. Write new data.
c.CommitWrite(10)
c.TickN(10) // well past barrier timeout deadline
// Stale barrier timeout fires (from old epoch, old primary "p" → old replica "r1").
barriersFired := c.FiredTimeoutsByKind(TimeoutBarrier)
// Assert 1: old timed-out barrier did not change committed prefix unexpectedly.
// CommittedLSN may advance from r1's new-epoch writes, but must not regress
// or be influenced by the stale barrier timeout.
if c.Coordinator.CommittedLSN < committedBefore {
t.Fatalf("committed prefix regressed: before=%d after=%d",
committedBefore, c.Coordinator.CommittedLSN)
}
// Assert 2: old-epoch barrier did not set DurableOn for new-epoch writes.
// LSN 1 was written by old primary "p". Under the new epoch, DurableOn
// should not have been modified by the stale barrier's timeout path.
if p1 := c.Pending[1]; p1 != nil {
if p1.DurableOn["r1"] {
t.Fatal("stale barrier timeout should not set DurableOn[r1] for old-epoch LSN 1")
}
}
// Assert 3: old-epoch LSN 1 barrier is marked expired (stale timeout fired correctly).
if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] {
t.Fatal("old-epoch barrier for r1/LSN 1 should be in ExpiredBarriers")
}
t.Logf("promotion vs barrier timeout: committed=%d, fired=%d, DurableOn[r1]=%v, expired[r1/1]=%v",
c.Coordinator.CommittedLSN, barriersFired,
c.Pending[1] != nil && c.Pending[1].DurableOn["r1"],
c.ExpiredBarriers[barrierExpiredKey{"r1", 1}])
}
// --- Race 2: Rebuild completion vs epoch bump ---
func TestP03_Race_RebuildCompletes_ThenEpochBumps(t *testing.T) {
// r1 needs rebuild. Rebuild completes, but before r1 can rejoin,
// epoch bumps (another failover). The rebuild result is valid but
// the replica must re-validate against the new epoch before rejoining.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
for i := uint64(1); i <= 10; i++ {
c.CommitWrite(i)
}
c.TickN(5)
c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN)
// r1 needs rebuild.
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateNeedsRebuild
// Rebuild from snapshot — succeeds.
c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN)
r1.ReplicaState = NodeStateRebuilding // transitional
// Before r1 can rejoin: epoch bumps (simulate another failure/promotion).
epochBefore := c.Coordinator.Epoch
c.StopNode("p")
if err := c.Promote("r2"); err != nil {
t.Fatal(err)
}
epochAfter := c.Coordinator.Epoch
if epochAfter <= epochBefore {
t.Fatal("epoch should have bumped")
}
// r1's epoch is now stale (was set to epochBefore, promotion updated running nodes).
// r1 was stopped? No, r1 is still running. But Promote sets all running nodes' epoch.
// Wait — r1 IS running, so Promote set r1.Epoch = new epoch. Let me check.
// Actually Promote() sets all running nodes' epoch to new coordinator epoch.
// r1 is running. So r1.Epoch = epochAfter. But r1.Role = RoleReplica.
// The rebuild data is from the OLD epoch's committed prefix.
// Under the new primary (r2), committed prefix may differ.
// r1 must NOT be promoted to InSync until validated against new epoch.
// Eligibility check: r1 is Rebuilding — ineligible for promotion.
e := c.EvaluateCandidateEligibility("r1")
if e.Eligible {
t.Fatal("r1 in Rebuilding state should not be eligible")
}
// r1 should NOT be InSync until it completes catch-up from new primary.
if r1.ReplicaState == NodeStateInSync {
t.Fatal("r1 should not be InSync after epoch bump during rebuild")
}
// After catch-up from new primary (r2), r1 can rejoin.
r1.ReplicaState = NodeStateCatchingUp
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("r1 should converge from new primary")
}
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatalf("r1 data incorrect after post-epoch-bump catch-up: %v", err)
}
t.Logf("rebuild vs epoch bump: r1 rebuilt at epoch %d, bumped to %d, caught up from r2",
epochBefore, epochAfter)
}
func TestP03_Race_EpochBumpsDuringCatchupTimeout(t *testing.T) {
// Catch-up timeout registered. Epoch bumps before timeout fires.
// The timeout is now stale (different epoch context).
// Must not mutate state under the new epoch.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10)
// Epoch bumps (promotion) before timeout.
c.StopNode("p")
if err := c.Promote("r1"); err != nil {
t.Fatal(err)
}
// r1 is now primary. State changes from CatchingUp to... well, we need to
// set it. In production, promotion sets the role but the replica state is
// reset. Let me set it to InSync (as new primary).
r1.ReplicaState = NodeStateInSync
// Tick past timeout deadline.
c.TickN(15)
// Timeout should be ignored (r1 is InSync, not CatchingUp).
if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 {
t.Fatal("catch-up timeout should not fire after epoch bump + promotion")
}
if len(c.IgnoredTimeouts) != 1 {
t.Fatalf("expected 1 ignored (stale) timeout, got %d", len(c.IgnoredTimeouts))
}
if r1.ReplicaState != NodeStateInSync {
t.Fatalf("r1 should remain InSync, got %s", r1.ReplicaState)
}
t.Logf("epoch bump vs timeout: stale catch-up timeout correctly ignored")
}
// --- Trace quality: dump state on failure ---
func TestP03_TraceQuality_FailingScenarioDumpsState(t *testing.T) {
// Verify that the timeout model produces debuggable traces.
// This test does NOT intentionally fail — it verifies that trace
// information is available for inspection.
c := NewCluster(CommitSyncAll, "p", "r1")
c.BarrierTimeoutTicks = 5
c.CommitWrite(1)
c.TickN(3)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(10)
// Build trace.
trace := BuildTrace(c)
// Trace must contain key debugging information.
if trace.Tick == 0 {
t.Fatal("trace should have non-zero tick")
}
if trace.CommittedLSN == 0 && len(c.Pending) == 0 {
t.Fatal("trace should reflect cluster state")
}
if len(trace.FiredTimeouts) == 0 {
t.Fatal("trace should include fired timeouts")
}
if len(trace.NodeStates) < 2 {
t.Fatal("trace should include all node states")
}
if trace.Deliveries == 0 {
t.Fatal("trace should include deliveries")
}
t.Logf("trace: tick=%d committed=%d fired_timeouts=%d deliveries=%d nodes=%v",
trace.Tick, trace.CommittedLSN, len(trace.FiredTimeouts),
trace.Deliveries, trace.NodeStates)
}
// Trace infrastructure lives in eventsim.go (BuildTrace / Trace type).

333
sw-block/prototype/distsim/phase03_timeout_test.go

@ -0,0 +1,333 @@
package distsim
import (
"testing"
)
// ============================================================
// Phase 03 P0: Timeout-backed scenarios
// ============================================================
// --- Barrier timeout ---
func TestP03_BarrierTimeout_SyncAllBlocked(t *testing.T) {
// Barrier sent to replica, link goes down, ack never arrives.
// Barrier timeout fires → barrier removed from queue.
// sync_all: write stays uncommitted.
c := NewCluster(CommitSyncAll, "p", "r1")
c.BarrierTimeoutTicks = 5
c.CommitWrite(1)
c.TickN(10) // enough for barrier timeout to fire and normal commit
// LSN 1: p self-acks. r1 acks. sync_all: both must ack. Should commit.
if c.Coordinator.CommittedLSN != 1 {
t.Fatalf("LSN 1 should commit normally, got committed=%d", c.Coordinator.CommittedLSN)
}
// No timeouts fired for LSN 1 (ack arrived in time).
if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 {
t.Fatal("no barrier timeouts should have fired for LSN 1")
}
// Now disconnect r1. Write LSN 2. Barrier can't be acked.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(10) // barrier timeout fires after 5 ticks
// Barrier timeout should have fired for r1/LSN 2.
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
t.Fatalf("expected 1 barrier timeout, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
}
// sync_all: LSN 2 NOT committed (r1 never acked).
if c.Coordinator.CommittedLSN != 1 {
t.Fatalf("LSN 2 should not commit under sync_all without r1 ack, committed=%d",
c.Coordinator.CommittedLSN)
}
// Barrier removed from queue (no indefinite re-queuing).
for _, item := range c.Queue {
if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 {
t.Fatal("timed-out barrier should be removed from queue")
}
}
t.Logf("barrier timeout: LSN 2 uncommitted, barrier cleaned from queue")
}
func TestP03_BarrierTimeout_SyncQuorum_StillCommits(t *testing.T) {
// RF=3 sync_quorum: r1 times out, but r2 acks → quorum met → commits.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.BarrierTimeoutTicks = 5
// Disconnect r1 only. r2 stays connected.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(1)
c.TickN(10)
// r1 barrier times out, but r2 acked. quorum = p + r2 = 2 of 3.
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
t.Fatalf("expected 1 barrier timeout (r1), got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
}
if c.Coordinator.CommittedLSN != 1 {
t.Fatalf("LSN 1 should commit via quorum (p+r2), committed=%d", c.Coordinator.CommittedLSN)
}
t.Logf("barrier timeout: r1 timed out, LSN 1 committed via quorum")
}
// --- Catch-up timeout ---
func TestP03_CatchupTimeout_EscalatesToNeedsRebuild(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// r1 disconnects, primary writes more.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
for i := uint64(2); i <= 20; i++ {
c.CommitWrite(i)
}
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// Register catch-up timeout: 3 ticks from now.
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+3)
// Tick 3 times — timeout fires before catch-up completes.
c.TickN(3)
if r1.ReplicaState != NodeStateNeedsRebuild {
t.Fatalf("catch-up timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState)
}
if c.FiredTimeoutsByKind(TimeoutCatchup) != 1 {
t.Fatalf("expected 1 catchup timeout, got %d", c.FiredTimeoutsByKind(TimeoutCatchup))
}
t.Logf("catch-up timeout: escalated to NeedsRebuild after 3 ticks")
}
// --- Reservation expiry as timeout event ---
func TestP03_ReservationTimeout_AbortsCatchup(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
for i := uint64(1); i <= 10; i++ {
c.CommitWrite(i)
}
c.TickN(5)
// r1 disconnects, more writes.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
for i := uint64(11); i <= 30; i++ {
c.CommitWrite(i)
}
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// Register reservation timeout: 2 ticks.
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+2)
c.TickN(2)
if r1.ReplicaState != NodeStateNeedsRebuild {
t.Fatalf("reservation timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState)
}
if c.FiredTimeoutsByKind(TimeoutReservation) != 1 {
t.Fatalf("expected 1 reservation timeout, got %d", c.FiredTimeoutsByKind(TimeoutReservation))
}
}
// --- Timer-race scenarios: same-tick resolution ---
func TestP03_Race_AckArrivesBeforeTimeout_Cancels(t *testing.T) {
// Barrier ack arrives in the same tick as the timeout deadline.
// Rule: data events (ack) process before timeouts → timeout is cancelled.
c := NewCluster(CommitSyncAll, "p", "r1")
c.BarrierTimeoutTicks = 4 // timeout at Now+4
c.CommitWrite(1) // barrier enqueued at Now+2, ack back at Now+3
// Barrier timeout registered at Now+4.
// Tick 1: write delivered.
// Tick 2: barrier delivered, ack enqueued at Now+1 = tick 3.
// Tick 3: ack delivered → cancels timeout.
// Tick 4: timeout deadline reached — but already cancelled.
c.TickN(5)
// Ack arrived first → timeout cancelled → LSN 1 committed.
if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 {
t.Fatal("barrier timeout should be cancelled by ack arriving first")
}
if c.Coordinator.CommittedLSN != 1 {
t.Fatalf("LSN 1 should commit (ack arrived before timeout), committed=%d",
c.Coordinator.CommittedLSN)
}
t.Logf("race resolved: ack cancelled timeout, LSN 1 committed")
}
func TestP03_Race_TimeoutBeforeAck_Fires(t *testing.T) {
// Timeout fires before barrier can deliver (timeout < barrier delivery time).
// CommitWrite enqueues barrier at Now+2. Timeout at Now+1 fires first.
c := NewCluster(CommitSyncAll, "p", "r1")
c.BarrierTimeoutTicks = 1 // timeout at Now+1 — before barrier delivers at Now+2
c.CommitWrite(1)
c.TickN(5)
// Timeout fires at tick 1. Barrier would deliver at tick 2, but timeout
// removes it from queue first.
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
}
// sync_all: uncommitted (r1 never acked).
if c.Coordinator.CommittedLSN != 0 {
t.Fatalf("LSN 1 should not commit (timeout before barrier delivery), committed=%d",
c.Coordinator.CommittedLSN)
}
t.Logf("race resolved: timeout fired before barrier delivery, LSN 1 uncommitted")
}
func TestP03_Race_CatchupConverges_CancelsTimeout(t *testing.T) {
// Catch-up completes before the timeout fires.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.CommitWrite(3)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// Register catch-up timeout: 10 ticks (generous).
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10)
// Catch-up completes immediately (small gap).
// CatchUpWithEscalation auto-cancels recovery timeouts on convergence.
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("catch-up should converge for small gap")
}
// Tick past deadline — timeout should already be cancelled.
c.TickN(15)
// Timeout should NOT have fired (was cancelled).
if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 {
t.Fatal("catch-up timeout should be cancelled on convergence")
}
if r1.ReplicaState != NodeStateInSync {
t.Fatalf("r1 should be InSync after convergence, got %s", r1.ReplicaState)
}
t.Logf("race resolved: catch-up converged, timeout auto-cancelled")
}
// --- Stale timeout hardening ---
func TestP03_StaleReservationTimeout_AfterRecoverySuccess(t *testing.T) {
// Reservation timeout registered, but recovery completes before deadline.
// The stale timeout must NOT regress state from InSync back to NeedsRebuild.
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(3)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// Register reservation timeout: 10 ticks.
r1 := c.Nodes["r1"]
r1.ReplicaState = NodeStateCatchingUp
c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+10)
// Catch-up succeeds immediately — auto-cancels reservation timeout.
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("catch-up should converge")
}
if r1.ReplicaState != NodeStateInSync {
t.Fatalf("expected InSync after convergence, got %s", r1.ReplicaState)
}
// Tick well past the deadline.
c.TickN(20)
// Stale reservation timeout must NOT fire (cancelled by convergence).
if c.FiredTimeoutsByKind(TimeoutReservation) != 0 {
t.Fatal("stale reservation timeout should not fire after recovery success")
}
if r1.ReplicaState != NodeStateInSync {
t.Fatalf("stale timeout regressed state: expected InSync, got %s", r1.ReplicaState)
}
t.Logf("stale reservation timeout correctly suppressed after recovery")
}
func TestP03_LateBarrierAck_AfterTimeout_Rejected(t *testing.T) {
// Barrier times out, then a late ack arrives. The late ack must be
// rejected — it must not count toward DurableOn.
c := NewCluster(CommitSyncAll, "p", "r1")
c.BarrierTimeoutTicks = 1 // timeout at Now+1
c.CommitWrite(1)
// Tick 1: write delivered, timeout fires (barrier at Now+2 not yet delivered).
c.TickN(1)
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 {
t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier))
}
// LSN 1 should NOT be committed.
if c.Coordinator.CommittedLSN != 0 {
t.Fatalf("LSN 1 should not be committed after timeout, got %d", c.Coordinator.CommittedLSN)
}
// Now inject a late barrier ack (as if the network delayed it massively).
c.InjectMessage(Message{
Kind: MsgBarrierAck,
From: "r1",
To: "p",
Epoch: c.Coordinator.Epoch,
TargetLSN: 1,
}, c.Now+1)
c.TickN(5)
// Late ack must be rejected with barrier_expired reason.
expiredRejects := c.RejectedByReason(RejectBarrierExpired)
if expiredRejects == 0 {
t.Fatal("late barrier ack should be rejected as barrier_expired")
}
// LSN 1 must still be uncommitted (late ack did not count).
if c.Coordinator.CommittedLSN != 0 {
t.Fatalf("late ack should not commit LSN 1, got committed=%d", c.Coordinator.CommittedLSN)
}
// DurableOn should NOT include r1.
p1 := c.Pending[1]
if p1 != nil && p1.DurableOn["r1"] {
t.Fatal("late ack should not set DurableOn for r1")
}
t.Logf("late barrier ack: rejected as barrier_expired, LSN 1 stays uncommitted")
}

243
sw-block/prototype/distsim/phase04a_ownership_test.go

@ -0,0 +1,243 @@
package distsim
import "testing"
// ============================================================
// Phase 04a: Session ownership validation in distsim
// ============================================================
// --- Scenario 1: Endpoint change during active catch-up ---
func TestP04a_EndpointChangeDuringCatchup_InvalidatesSession(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
// Start catch-up session for r1.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
trigger, sessID, ok := c.TriggerRecoverySession("r1")
if !ok || trigger != TriggerReassignment {
t.Fatalf("should trigger reassignment, got %s/%v", trigger, ok)
}
// Session is active.
sess := c.Sessions["r1"]
if !sess.Active {
t.Fatal("session should be active")
}
// Endpoint changes (replica restarts on new address).
c.StopNode("r1")
c.RestartNodeWithNewAddress("r1")
// Session invalidated by endpoint change.
if sess.Active {
t.Fatal("session should be invalidated after endpoint change")
}
if sess.Reason != "endpoint_changed" {
t.Fatalf("invalidation reason: got %q, want endpoint_changed", sess.Reason)
}
// Stale completion from old session is rejected.
if c.CompleteRecoverySession("r1", sessID) {
t.Fatal("stale session completion should be rejected")
}
t.Logf("endpoint change: session %d invalidated, stale completion rejected", sessID)
}
// --- Scenario 2: Epoch bump during active catch-up ---
func TestP04a_EpochBumpDuringCatchup_InvalidatesSession(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
_, sessID, ok := c.TriggerRecoverySession("r1")
if !ok {
t.Fatal("trigger should succeed")
}
sess := c.Sessions["r1"]
// Epoch bumps (promotion).
c.StopNode("p")
c.Promote("r2")
// Session invalidated by epoch bump.
if sess.Active {
t.Fatal("session should be invalidated after epoch bump")
}
if sess.Reason != "epoch_bump_promotion" {
t.Fatalf("reason: got %q", sess.Reason)
}
// Stale completion rejected.
if c.CompleteRecoverySession("r1", sessID) {
t.Fatal("stale completion after epoch bump should be rejected")
}
t.Logf("epoch bump: session %d invalidated, completion rejected", sessID)
}
// --- Scenario 3: Stale late completion from old session ---
func TestP04a_StaleCompletion_AfterSupersede_Rejected(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// First session.
_, oldSessID, _ := c.TriggerRecoverySession("r1")
oldSess := c.Sessions["r1"]
// Invalidate old session manually (simulate timeout or abort).
c.InvalidateReplicaSession("r1", "timeout")
if oldSess.Active {
t.Fatal("old session should be invalidated")
}
// New session triggered.
c.Nodes["r1"].ReplicaState = NodeStateLagging // reset state to allow retrigger
_, newSessID, ok := c.TriggerRecoverySession("r1")
if !ok {
t.Fatal("second trigger should succeed after invalidation")
}
newSess := c.Sessions["r1"]
// Old session completion attempt — must be rejected by ID mismatch.
if c.CompleteRecoverySession("r1", oldSessID) {
t.Fatal("old session completion must be rejected")
}
// New session still active.
if !newSess.Active {
t.Fatal("new session should still be active")
}
// New session completion succeeds.
if !c.CompleteRecoverySession("r1", newSessID) {
t.Fatal("new session completion should succeed")
}
if c.Nodes["r1"].ReplicaState != NodeStateInSync {
t.Fatalf("r1 should be InSync after new session completes, got %s", c.Nodes["r1"].ReplicaState)
}
t.Logf("stale completion: old=%d rejected, new=%d accepted", oldSessID, newSessID)
}
// --- Scenario 4: Duplicate recovery trigger while session active ---
func TestP04a_DuplicateTrigger_WhileActive_Rejected(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.TickN(5)
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
c.CommitWrite(2)
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// First trigger succeeds.
_, _, ok := c.TriggerRecoverySession("r1")
if !ok {
t.Fatal("first trigger should succeed")
}
// Duplicate trigger while session active — rejected.
_, _, ok = c.TriggerRecoverySession("r1")
if ok {
t.Fatal("duplicate trigger should be rejected while session active")
}
// Session count: only one in history.
sessCount := 0
for _, s := range c.SessionHistory {
if s.ReplicaID == "r1" {
sessCount++
}
}
if sessCount != 1 {
t.Fatalf("should have exactly 1 session in history, got %d", sessCount)
}
t.Logf("duplicate trigger correctly rejected")
}
// --- Scenario 5: Session tracking through full lifecycle ---
func TestP04a_FullLifecycle_SessionTracking(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
c.CommitWrite(1)
c.CommitWrite(2)
c.TickN(5)
// Disconnect, write, reconnect.
c.Disconnect("p", "r1")
c.Disconnect("r1", "p")
for i := uint64(3); i <= 10; i++ {
c.CommitWrite(i)
}
c.TickN(5)
c.Connect("p", "r1")
c.Connect("r1", "p")
// Trigger session.
trigger, sessID, ok := c.TriggerRecoverySession("r1")
if !ok {
t.Fatal("trigger failed")
}
if trigger != TriggerReassignment {
t.Fatalf("expected reassignment, got %s", trigger)
}
// Catch up.
converged := c.CatchUpWithEscalation("r1", 100)
if !converged {
t.Fatal("catch-up should converge")
}
// Complete session.
if !c.CompleteRecoverySession("r1", sessID) {
t.Fatal("completion should succeed")
}
// Verify final state.
if c.Nodes["r1"].ReplicaState != NodeStateInSync {
t.Fatalf("r1 should be InSync, got %s", c.Nodes["r1"].ReplicaState)
}
if err := c.AssertCommittedRecoverable("r1"); err != nil {
t.Fatalf("data incorrect: %v", err)
}
// Session in history, not active.
sess := c.Sessions["r1"]
if sess.Active {
t.Fatal("session should not be active after completion")
}
if len(c.SessionHistory) != 1 {
t.Fatalf("expected 1 session in history, got %d", len(c.SessionHistory))
}
t.Logf("full lifecycle: trigger=%s session=%d → catch-up → complete → InSync", trigger, sessID)
}

102
sw-block/prototype/distsim/protocol.go

@ -0,0 +1,102 @@
package distsim
type ProtocolVersion string
const (
ProtocolV1 ProtocolVersion = "v1"
ProtocolV15 ProtocolVersion = "v1_5"
ProtocolV2 ProtocolVersion = "v2"
)
type ProtocolPolicy struct {
Version ProtocolVersion
}
func (p ProtocolPolicy) CanAttemptCatchup(addressStable bool) bool {
switch p.Version {
case ProtocolV1:
return false
case ProtocolV15:
return addressStable
case ProtocolV2:
return true
default:
return false
}
}
func (p ProtocolPolicy) BriefDisconnectAction(addressStable, recoverable bool) string {
switch p.Version {
case ProtocolV1:
return "degrade_or_rebuild"
case ProtocolV15:
if addressStable && recoverable {
return "catchup_if_history_survives"
}
return "stall_or_control_plane_recovery"
case ProtocolV2:
if recoverable {
return "reserved_catchup"
}
return "explicit_rebuild"
default:
return "unknown"
}
}
func (p ProtocolPolicy) TailChasingAction(converged bool) string {
switch p.Version {
case ProtocolV1:
if converged {
return "unexpected_catchup"
}
return "degrade"
case ProtocolV15:
if converged {
return "catchup"
}
return "stall_or_rebuild"
case ProtocolV2:
if converged {
return "catchup"
}
return "abort_to_rebuild"
default:
return "unknown"
}
}
func (p ProtocolPolicy) RestartRejoinAction(addressStable bool) string {
switch p.Version {
case ProtocolV1:
return "control_plane_only"
case ProtocolV15:
if addressStable {
return "background_reconnect_or_control_plane"
}
return "control_plane_only"
case ProtocolV2:
if addressStable {
return "direct_reconnect_or_control_plane"
}
return "explicit_reassignment_or_rebuild"
default:
return "unknown"
}
}
func (p ProtocolPolicy) ChangedAddressRestartAction(recoverable bool) string {
switch p.Version {
case ProtocolV1:
return "control_plane_only"
case ProtocolV15:
return "control_plane_only"
case ProtocolV2:
if recoverable {
return "explicit_reassignment_then_catchup"
}
return "explicit_reassignment_or_rebuild"
default:
return "unknown"
}
}

84
sw-block/prototype/distsim/protocol_test.go

@ -0,0 +1,84 @@
package distsim
import "testing"
func TestProtocolV1CannotAttemptCatchup(t *testing.T) {
p := ProtocolPolicy{Version: ProtocolV1}
if p.CanAttemptCatchup(true) {
t.Fatal("v1 should not expose meaningful catch-up path")
}
}
func TestProtocolV15CatchupDependsOnStableAddress(t *testing.T) {
p := ProtocolPolicy{Version: ProtocolV15}
if !p.CanAttemptCatchup(true) {
t.Fatal("v1.5 should allow catch-up when address is stable")
}
if p.CanAttemptCatchup(false) {
t.Fatal("v1.5 should not assume reconnect with changed address")
}
}
func TestProtocolV2AllowsCatchupByPolicy(t *testing.T) {
p := ProtocolPolicy{Version: ProtocolV2}
if !p.CanAttemptCatchup(true) || !p.CanAttemptCatchup(false) {
t.Fatal("v2 policy should allow catch-up attempt subject to explicit recoverability checks")
}
}
func TestProtocolBriefDisconnectActions(t *testing.T) {
if got := (ProtocolPolicy{Version: ProtocolV1}).BriefDisconnectAction(true, true); got != "degrade_or_rebuild" {
t.Fatalf("v1 brief-disconnect action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(true, true); got != "catchup_if_history_survives" {
t.Fatalf("v1.5 brief-disconnect action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(false, true); got != "stall_or_control_plane_recovery" {
t.Fatalf("v1.5 changed-address brief-disconnect action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(true, false); got != "explicit_rebuild" {
t.Fatalf("v2 unrecoverable brief-disconnect action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(false, true); got != "reserved_catchup" {
t.Fatalf("v2 recoverable brief-disconnect action = %s", got)
}
}
func TestProtocolTailChasingActions(t *testing.T) {
if got := (ProtocolPolicy{Version: ProtocolV1}).TailChasingAction(false); got != "degrade" {
t.Fatalf("v1 tail-chasing action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV15}).TailChasingAction(false); got != "stall_or_rebuild" {
t.Fatalf("v1.5 tail-chasing action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV2}).TailChasingAction(false); got != "abort_to_rebuild" {
t.Fatalf("v2 tail-chasing action = %s", got)
}
}
func TestProtocolRestartRejoinActions(t *testing.T) {
if got := (ProtocolPolicy{Version: ProtocolV1}).RestartRejoinAction(true); got != "control_plane_only" {
t.Fatalf("v1 restart action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV15}).RestartRejoinAction(false); got != "control_plane_only" {
t.Fatalf("v1.5 changed-address restart action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV2}).RestartRejoinAction(false); got != "explicit_reassignment_or_rebuild" {
t.Fatalf("v2 changed-address restart action = %s", got)
}
}
func TestProtocolChangedAddressRestartActions(t *testing.T) {
if got := (ProtocolPolicy{Version: ProtocolV1}).ChangedAddressRestartAction(true); got != "control_plane_only" {
t.Fatalf("v1 changed-address restart action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV15}).ChangedAddressRestartAction(true); got != "control_plane_only" {
t.Fatalf("v1.5 changed-address restart action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" {
t.Fatalf("v2 recoverable changed-address restart action = %s", got)
}
if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" {
t.Fatalf("v2 unrecoverable changed-address restart action = %s", got)
}
}

256
sw-block/prototype/distsim/random.go

@ -0,0 +1,256 @@
package distsim
import (
"fmt"
"math/rand"
"sort"
)
type RandomEvent string
const (
RandomCommitWrite RandomEvent = "commit_write"
RandomTick RandomEvent = "tick"
RandomDisconnect RandomEvent = "disconnect"
RandomReconnect RandomEvent = "reconnect"
RandomStopNode RandomEvent = "stop_node"
RandomStartNode RandomEvent = "start_node"
RandomPromote RandomEvent = "promote"
RandomTakeSnapshot RandomEvent = "take_snapshot"
RandomCatchup RandomEvent = "catchup"
RandomRebuild RandomEvent = "rebuild"
)
type RandomStep struct {
Step int
Event RandomEvent
Detail string
}
type RandomResult struct {
Seed int64
Steps []RandomStep
Cluster *Cluster
Snapshots []string
}
func RunRandomScenario(seed int64, steps int) (*RandomResult, error) {
rng := rand.New(rand.NewSource(seed))
cluster := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
result := &RandomResult{
Seed: seed,
Cluster: cluster,
}
for i := 0; i < steps; i++ {
step, err := runRandomStep(cluster, rng, i)
if err != nil {
result.Steps = append(result.Steps, step)
return result, err
}
result.Steps = append(result.Steps, step)
if err := assertClusterInvariants(cluster); err != nil {
return result, fmt.Errorf("seed=%d step=%d event=%s detail=%s: %w", seed, i, step.Event, step.Detail, err)
}
}
return result, assertClusterInvariants(cluster)
}
func runRandomStep(c *Cluster, rng *rand.Rand, step int) (RandomStep, error) {
events := []RandomEvent{
RandomCommitWrite,
RandomTick,
RandomDisconnect,
RandomReconnect,
RandomStopNode,
RandomStartNode,
RandomPromote,
RandomTakeSnapshot,
RandomCatchup,
RandomRebuild,
}
ev := events[rng.Intn(len(events))]
rs := RandomStep{Step: step, Event: ev}
switch ev {
case RandomCommitWrite:
block := uint64(rng.Intn(8) + 1)
lsn := c.CommitWrite(block)
rs.Detail = fmt.Sprintf("block=%d lsn=%d", block, lsn)
case RandomTick:
n := rng.Intn(3) + 1
c.TickN(n)
rs.Detail = fmt.Sprintf("ticks=%d", n)
case RandomDisconnect:
from, to := randomPair(c, rng)
c.Disconnect(from, to)
rs.Detail = fmt.Sprintf("%s->%s", from, to)
case RandomReconnect:
from, to := randomPair(c, rng)
c.Connect(from, to)
rs.Detail = fmt.Sprintf("%s->%s", from, to)
case RandomStopNode:
id := randomNodeID(c, rng)
c.StopNode(id)
rs.Detail = id
case RandomStartNode:
id := randomNodeID(c, rng)
c.StartNode(id)
rs.Detail = id
case RandomPromote:
if primary := c.Primary(); primary != nil && primary.Running {
rs.Detail = "primary_still_running"
return rs, nil
}
candidates := promotableNodes(c)
if len(candidates) == 0 {
rs.Detail = "no_candidate"
return rs, nil
}
id := candidates[rng.Intn(len(candidates))]
rs.Detail = id
if err := c.Promote(id); err != nil {
return rs, err
}
case RandomTakeSnapshot:
primary := c.Primary()
if primary == nil || !primary.Running {
rs.Detail = "no_primary"
return rs, nil
}
lsn := c.Coordinator.CommittedLSN
id := fmt.Sprintf("snap-%s-%d", primary.ID, lsn)
primary.Storage.TakeSnapshot(id, lsn)
rs.Detail = fmt.Sprintf("%s@%d", id, lsn)
case RandomCatchup:
id := randomReplicaID(c, rng)
if id == "" {
rs.Detail = "no_replica"
return rs, nil
}
node := c.Nodes[id]
if node == nil || !node.Running {
rs.Detail = id + ":down"
return rs, nil
}
start := node.Storage.FlushedLSN
end := c.Coordinator.CommittedLSN
if end <= start {
rs.Detail = fmt.Sprintf("%s:no_gap", id)
return rs, nil
}
rs.Detail = fmt.Sprintf("%s:%d..%d", id, start+1, end)
if err := c.RecoverReplicaFromPrimary(id, start, end); err != nil {
return rs, err
}
case RandomRebuild:
id := randomReplicaID(c, rng)
if id == "" {
rs.Detail = "no_replica"
return rs, nil
}
primary := c.Primary()
node := c.Nodes[id]
if primary == nil || node == nil || !primary.Running || !node.Running {
rs.Detail = id + ":unavailable"
return rs, nil
}
snapshotIDs := make([]string, 0, len(primary.Storage.Snapshots))
for snapID := range primary.Storage.Snapshots {
snapshotIDs = append(snapshotIDs, snapID)
}
if len(snapshotIDs) == 0 {
rs.Detail = id + ":no_snapshot"
return rs, nil
}
sort.Strings(snapshotIDs)
snapID := snapshotIDs[rng.Intn(len(snapshotIDs))]
rs.Detail = fmt.Sprintf("%s:%s->%d", id, snapID, c.Coordinator.CommittedLSN)
if err := c.RebuildReplicaFromSnapshot(id, snapID, c.Coordinator.CommittedLSN); err != nil {
return rs, err
}
default:
return rs, fmt.Errorf("unknown random event %s", ev)
}
return rs, nil
}
func randomNodeID(c *Cluster, rng *rand.Rand) string {
ids := append([]string(nil), c.Coordinator.Members...)
sort.Strings(ids)
if len(ids) == 0 {
return ""
}
return ids[rng.Intn(len(ids))]
}
func randomReplicaID(c *Cluster, rng *rand.Rand) string {
ids := c.replicaIDs()
if len(ids) == 0 {
return ""
}
return ids[rng.Intn(len(ids))]
}
func randomPair(c *Cluster, rng *rand.Rand) (string, string) {
from := randomNodeID(c, rng)
to := randomNodeID(c, rng)
if from == to {
ids := append([]string(nil), c.Coordinator.Members...)
sort.Strings(ids)
for _, id := range ids {
if id != from {
to = id
break
}
}
}
return from, to
}
func promotableNodes(c *Cluster) []string {
out := make([]string, 0)
want := c.Reference.StateAt(c.Coordinator.CommittedLSN)
for _, id := range c.Coordinator.Members {
n := c.Nodes[id]
if n == nil || !n.Running || n.Storage.FlushedLSN < c.Coordinator.CommittedLSN {
continue
}
if !EqualState(n.Storage.StateAt(c.Coordinator.CommittedLSN), want) {
continue
}
out = append(out, id)
}
sort.Strings(out)
return out
}
func assertClusterInvariants(c *Cluster) error {
committed := c.Coordinator.CommittedLSN
want := c.Reference.StateAt(committed)
for lsn, p := range c.Pending {
if p.Committed && lsn > committed {
return fmt.Errorf("pending lsn %d marked committed above coordinator committed lsn %d", lsn, committed)
}
}
for _, id := range promotableNodes(c) {
n := c.Nodes[id]
got := n.Storage.StateAt(committed)
if !EqualState(got, want) {
return fmt.Errorf("promotable node %s mismatch at committed lsn %d: got=%v want=%v", id, committed, got, want)
}
}
primary := c.Primary()
if primary != nil && primary.Running && primary.Epoch == c.Coordinator.Epoch {
got := primary.Storage.StateAt(committed)
if !EqualState(got, want) {
return fmt.Errorf("primary %s mismatch at committed lsn %d: got=%v want=%v", primary.ID, committed, got, want)
}
}
return nil
}

43
sw-block/prototype/distsim/random_test.go

@ -0,0 +1,43 @@
package distsim
import "testing"
func TestRandomScenarioSeeds(t *testing.T) {
seeds := []int64{
1, 2, 3, 4, 5,
11, 21, 34, 55, 89,
101, 202, 303, 404, 505,
}
for _, seed := range seeds {
seed := seed
t.Run("seed_"+itoa64(seed), func(t *testing.T) {
t.Parallel()
if _, err := RunRandomScenario(seed, 60); err != nil {
t.Fatal(err)
}
})
}
}
func itoa64(v int64) string {
if v == 0 {
return "0"
}
neg := v < 0
if neg {
v = -v
}
buf := make([]byte, 0, 20)
for v > 0 {
buf = append(buf, byte('0'+v%10))
v /= 10
}
if neg {
buf = append(buf, '-')
}
for i, j := 0, len(buf)-1; i < j; i, j = i+1, j-1 {
buf[i], buf[j] = buf[j], buf[i]
}
return string(buf)
}

95
sw-block/prototype/distsim/reference.go

@ -0,0 +1,95 @@
package distsim
type Write struct {
LSN uint64
Block uint64
Value uint64
}
type Snapshot struct {
LSN uint64
State map[uint64]uint64
}
type Reference struct {
writes []Write
snapshots map[uint64]Snapshot
}
func NewReference() *Reference {
return &Reference{snapshots: map[uint64]Snapshot{}}
}
func (r *Reference) Apply(w Write) {
r.writes = append(r.writes, w)
}
func (r *Reference) StateAt(lsn uint64) map[uint64]uint64 {
state := make(map[uint64]uint64)
for _, w := range r.writes {
if w.LSN > lsn {
break
}
state[w.Block] = w.Value
}
return state
}
func cloneMap(in map[uint64]uint64) map[uint64]uint64 {
out := make(map[uint64]uint64, len(in))
for k, v := range in {
out[k] = v
}
return out
}
func (r *Reference) TakeSnapshot(lsn uint64) Snapshot {
s := Snapshot{LSN: lsn, State: cloneMap(r.StateAt(lsn))}
r.snapshots[lsn] = s
return s
}
func (r *Reference) SnapshotAt(lsn uint64) (Snapshot, bool) {
s, ok := r.snapshots[lsn]
return s, ok
}
type Node struct {
Extent map[uint64]uint64
}
func NewNode() *Node {
return &Node{Extent: map[uint64]uint64{}}
}
func (n *Node) ApplyWrite(w Write) {
n.Extent[w.Block] = w.Value
}
func (n *Node) LoadSnapshot(s Snapshot) {
n.Extent = cloneMap(s.State)
}
func (n *Node) ReplayFromWrites(writes []Write, startExclusive, endInclusive uint64) {
for _, w := range writes {
if w.LSN <= startExclusive {
continue
}
if w.LSN > endInclusive {
break
}
n.ApplyWrite(w)
}
}
func EqualState(a, b map[uint64]uint64) bool {
if len(a) != len(b) {
return false
}
for k, v := range a {
if b[k] != v {
return false
}
}
return true
}

66
sw-block/prototype/distsim/reference_test.go

@ -0,0 +1,66 @@
package distsim
import "testing"
func TestWALReplayPreservesHistoricalValue(t *testing.T) {
ref := NewReference()
ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
node := NewNode()
node.ReplayFromWrites(ref.writes, 0, 10)
want := ref.StateAt(10)
if !EqualState(node.Extent, want) {
t.Fatalf("replay mismatch: got=%v want=%v", node.Extent, want)
}
}
func TestCurrentExtentCannotRecoverOldLSN(t *testing.T) {
ref := NewReference()
ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
primary := NewNode()
for _, w := range ref.writes {
primary.ApplyWrite(w)
}
wantOld := ref.StateAt(10)
if EqualState(primary.Extent, wantOld) {
t.Fatalf("latest extent should not equal old LSN state: latest=%v old=%v", primary.Extent, wantOld)
}
}
func TestSnapshotAtCpLSNRecoversCorrectHistoricalValue(t *testing.T) {
ref := NewReference()
ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
snap := ref.TakeSnapshot(10)
ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
node := NewNode()
node.LoadSnapshot(snap)
want := ref.StateAt(10)
if !EqualState(node.Extent, want) {
t.Fatalf("snapshot mismatch: got=%v want=%v", node.Extent, want)
}
}
func TestSnapshotPlusTrailingReplayReachesTargetLSN(t *testing.T) {
ref := NewReference()
ref.Apply(Write{LSN: 10, Block: 7, Value: 10})
ref.Apply(Write{LSN: 11, Block: 2, Value: 11})
snap := ref.TakeSnapshot(11)
ref.Apply(Write{LSN: 12, Block: 7, Value: 12})
ref.Apply(Write{LSN: 13, Block: 9, Value: 13})
node := NewNode()
node.LoadSnapshot(snap)
node.ReplayFromWrites(ref.writes, 11, 13)
want := ref.StateAt(13)
if !EqualState(node.Extent, want) {
t.Fatalf("snapshot+replay mismatch: got=%v want=%v", node.Extent, want)
}
}

581
sw-block/prototype/distsim/simulator.go

@ -0,0 +1,581 @@
package distsim
import (
"container/heap"
"fmt"
"math/rand"
"strings"
)
// --- Event types ---
type EventKind int
const (
EvWriteStart EventKind = iota // client writes to primary
EvShipEntry // primary sends WAL entry to replica
EvShipDeliver // entry arrives at replica
EvBarrierSend // primary sends barrier to replica
EvBarrierDeliver // barrier arrives at replica
EvBarrierFsync // replica fsync completes
EvBarrierAck // ack arrives back at primary
EvNodeCrash // node crashes
EvNodeRestart // node restarts
EvLinkDown // network link drops
EvLinkUp // network link restores
EvFlusherTick // flusher checkpoint cycle
EvPromote // coordinator promotes a node
EvLockAcquire // thread tries to acquire lock
EvLockRelease // thread releases lock
)
func (k EventKind) String() string {
names := [...]string{
"WriteStart", "ShipEntry", "ShipDeliver",
"BarrierSend", "BarrierDeliver", "BarrierFsync", "BarrierAck",
"NodeCrash", "NodeRestart", "LinkDown", "LinkUp",
"FlusherTick", "Promote",
"LockAcquire", "LockRelease",
}
if int(k) < len(names) {
return names[k]
}
return fmt.Sprintf("Event(%d)", k)
}
type Event struct {
Time uint64
ID uint64 // unique, for stable ordering
Kind EventKind
NodeID string
Payload EventPayload
}
type EventPayload struct {
Write Write // for WriteStart, ShipEntry, ShipDeliver
TargetLSN uint64 // for barriers
FromNode string // for delivered messages
ToNode string
LockName string // for lock events
ThreadID string
PromoteID string // for EvPromote
}
// --- Priority queue ---
type eventHeap []Event
func (h eventHeap) Len() int { return len(h) }
func (h eventHeap) Swap(i, j int) { h[i], h[j] = h[j], h[i] }
func (h eventHeap) Less(i, j int) bool {
if h[i].Time != h[j].Time {
return h[i].Time < h[j].Time
}
return h[i].ID < h[j].ID // stable tie-break
}
func (h *eventHeap) Push(x interface{}) { *h = append(*h, x.(Event)) }
func (h *eventHeap) Pop() interface{} {
old := *h
n := len(old)
e := old[n-1]
*h = old[:n-1]
return e
}
// --- Lock model ---
type lockState struct {
held bool
holder string // threadID
waiting []Event // parked EvLockAcquire events
}
// --- Trace ---
type TraceEntry struct {
Time uint64
Event Event
Note string
}
// --- Simulator ---
type Simulator struct {
Cluster *Cluster
rng *rand.Rand
queue eventHeap
nextID uint64
locks map[string]*lockState // lockName -> state
trace []TraceEntry
Errors []string
maxTime uint64
jitterMax uint64 // max random delay added to message delivery
// Config
FaultRate float64 // probability of injecting a fault per step [0,1]
MaxEvents int // stop after this many events
eventsRun int
}
func NewSimulator(cluster *Cluster, seed int64) *Simulator {
return &Simulator{
Cluster: cluster,
rng: rand.New(rand.NewSource(seed)),
locks: map[string]*lockState{},
maxTime: 100000,
jitterMax: 3,
FaultRate: 0.05,
MaxEvents: 5000,
}
}
// Enqueue adds an event to the priority queue.
func (s *Simulator) Enqueue(e Event) {
s.nextID++
e.ID = s.nextID
heap.Push(&s.queue, e)
}
// EnqueueAt is a convenience for enqueueing at a specific time.
func (s *Simulator) EnqueueAt(time uint64, kind EventKind, nodeID string, payload EventPayload) {
s.Enqueue(Event{Time: time, Kind: kind, NodeID: nodeID, Payload: payload})
}
// jitter returns a random delay in [1, jitterMax].
func (s *Simulator) jitter() uint64 {
if s.jitterMax <= 1 {
return 1
}
return 1 + uint64(s.rng.Int63n(int64(s.jitterMax)))
}
// --- Main loop ---
// Step executes the next event. Returns false if queue is empty or limit reached.
// When multiple events share the same timestamp, one is chosen randomly
// to explore different interleavings across runs with different seeds.
func (s *Simulator) Step() bool {
if s.queue.Len() == 0 || s.eventsRun >= s.MaxEvents {
return false
}
// Collect all events at the earliest timestamp.
earliest := s.queue[0].Time
if earliest > s.maxTime {
return false
}
var ready []Event
for s.queue.Len() > 0 && s.queue[0].Time == earliest {
ready = append(ready, heap.Pop(&s.queue).(Event))
}
// Shuffle to randomize interleaving of equal-time events.
s.rng.Shuffle(len(ready), func(i, j int) { ready[i], ready[j] = ready[j], ready[i] })
// Execute the first, re-enqueue the rest.
e := ready[0]
for _, r := range ready[1:] {
heap.Push(&s.queue, r)
}
s.Cluster.Now = e.Time
s.eventsRun++
s.execute(e)
s.checkInvariants(e)
return len(s.Errors) == 0
}
// Run executes until queue empty, limit reached, or invariant violated.
func (s *Simulator) Run() {
for s.Step() {
}
}
// --- Event execution ---
func (s *Simulator) execute(e Event) {
node := s.Cluster.Nodes[e.NodeID]
switch e.Kind {
case EvWriteStart:
s.executeWriteStart(e)
case EvShipEntry:
// Primary ships entry to a replica. Enqueue delivery with jitter.
if node != nil && node.Running {
deliverTime := s.Cluster.Now + s.jitter()
s.EnqueueAt(deliverTime, EvShipDeliver, e.Payload.ToNode, EventPayload{
Write: e.Payload.Write,
FromNode: e.NodeID,
})
s.record(e, fmt.Sprintf("ship LSN=%d to %s, deliver@%d", e.Payload.Write.LSN, e.Payload.ToNode, deliverTime))
}
case EvShipDeliver:
if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch {
if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] {
node.Storage.AppendWrite(e.Payload.Write)
s.record(e, fmt.Sprintf("deliver LSN=%d on %s, receivedLSN=%d", e.Payload.Write.LSN, e.NodeID, node.Storage.ReceivedLSN))
} else {
s.record(e, fmt.Sprintf("drop LSN=%d to %s (link down)", e.Payload.Write.LSN, e.NodeID))
}
}
case EvBarrierSend:
if node != nil && node.Running {
deliverTime := s.Cluster.Now + s.jitter()
s.EnqueueAt(deliverTime, EvBarrierDeliver, e.Payload.ToNode, EventPayload{
TargetLSN: e.Payload.TargetLSN,
FromNode: e.NodeID,
})
s.record(e, fmt.Sprintf("barrier LSN=%d to %s", e.Payload.TargetLSN, e.Payload.ToNode))
}
case EvBarrierDeliver:
if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch {
if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] {
if node.Storage.ReceivedLSN >= e.Payload.TargetLSN {
// Can fsync now. Enqueue fsync completion with small delay.
s.EnqueueAt(s.Cluster.Now+1, EvBarrierFsync, e.NodeID, EventPayload{
TargetLSN: e.Payload.TargetLSN,
FromNode: e.Payload.FromNode,
})
s.record(e, fmt.Sprintf("barrier deliver LSN=%d, fsync scheduled", e.Payload.TargetLSN))
} else {
// Not enough entries yet. Re-enqueue barrier with delay (retry).
s.EnqueueAt(s.Cluster.Now+1, EvBarrierDeliver, e.NodeID, e.Payload)
s.record(e, fmt.Sprintf("barrier LSN=%d waiting (received=%d)", e.Payload.TargetLSN, node.Storage.ReceivedLSN))
}
}
}
case EvBarrierFsync:
if node != nil && node.Running {
node.Storage.AdvanceFlush(e.Payload.TargetLSN)
// Send ack back to primary.
deliverTime := s.Cluster.Now + s.jitter()
s.EnqueueAt(deliverTime, EvBarrierAck, e.Payload.FromNode, EventPayload{
TargetLSN: e.Payload.TargetLSN,
FromNode: e.NodeID,
})
s.record(e, fmt.Sprintf("fsync LSN=%d on %s, flushedLSN=%d", e.Payload.TargetLSN, e.NodeID, node.Storage.FlushedLSN))
}
case EvBarrierAck:
// Only process acks on running nodes in the current epoch.
// After crash+promote, stale acks for the old primary must not advance commits.
if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch {
if pending := s.Cluster.Pending[e.Payload.TargetLSN]; pending != nil {
pending.DurableOn[e.Payload.FromNode] = true
s.Cluster.refreshCommits()
s.record(e, fmt.Sprintf("ack LSN=%d from %s, durable=%d", e.Payload.TargetLSN, e.Payload.FromNode, s.Cluster.durableAckCount(pending)))
}
} else {
s.record(e, fmt.Sprintf("ack LSN=%d from %s DROPPED (node down or stale epoch)", e.Payload.TargetLSN, e.Payload.FromNode))
}
case EvNodeCrash:
if node != nil {
node.Running = false
// Drop all pending events for this node.
s.dropEventsForNode(e.NodeID)
s.record(e, fmt.Sprintf("CRASH %s", e.NodeID))
}
case EvNodeRestart:
if node != nil {
node.Running = true
node.Epoch = s.Cluster.Coordinator.Epoch
s.record(e, fmt.Sprintf("RESTART %s epoch=%d", e.NodeID, node.Epoch))
}
case EvLinkDown:
s.Cluster.Disconnect(e.Payload.FromNode, e.Payload.ToNode)
s.Cluster.Disconnect(e.Payload.ToNode, e.Payload.FromNode)
s.record(e, fmt.Sprintf("LINK DOWN %s <-> %s", e.Payload.FromNode, e.Payload.ToNode))
case EvLinkUp:
s.Cluster.Connect(e.Payload.FromNode, e.Payload.ToNode)
s.Cluster.Connect(e.Payload.ToNode, e.Payload.FromNode)
s.record(e, fmt.Sprintf("LINK UP %s <-> %s", e.Payload.FromNode, e.Payload.ToNode))
case EvFlusherTick:
if node != nil && node.Running {
node.Storage.AdvanceCheckpoint(node.Storage.FlushedLSN)
s.record(e, fmt.Sprintf("flusher tick %s checkpoint=%d", e.NodeID, node.Storage.CheckpointLSN))
}
case EvPromote:
if err := s.Cluster.Promote(e.Payload.PromoteID); err != nil {
s.record(e, fmt.Sprintf("promote %s FAILED: %v", e.Payload.PromoteID, err))
} else {
s.record(e, fmt.Sprintf("PROMOTE %s epoch=%d", e.Payload.PromoteID, s.Cluster.Coordinator.Epoch))
}
case EvLockAcquire:
s.executeLockAcquire(e)
case EvLockRelease:
s.executeLockRelease(e)
}
}
func (s *Simulator) executeWriteStart(e Event) {
c := s.Cluster
primary := c.Primary()
if primary == nil || !primary.Running || primary.Epoch != c.Coordinator.Epoch {
s.record(e, "write rejected: no valid primary")
return
}
c.nextLSN++
w := Write{LSN: c.nextLSN, Block: e.Payload.Write.Block, Value: c.nextLSN}
primary.Storage.AppendWrite(w)
primary.Storage.AdvanceFlush(w.LSN)
c.Reference.Apply(w)
c.Pending[w.LSN] = &PendingCommit{
Write: w,
DurableOn: map[string]bool{primary.ID: true},
}
c.refreshCommits()
// Ship to each replica with jitter.
for _, rid := range c.replicaIDs() {
shipTime := s.Cluster.Now + s.jitter()
s.EnqueueAt(shipTime, EvShipEntry, primary.ID, EventPayload{
Write: w,
ToNode: rid,
})
}
// Barrier after ship.
for _, rid := range c.replicaIDs() {
barrierTime := s.Cluster.Now + s.jitter() + 2
s.EnqueueAt(barrierTime, EvBarrierSend, primary.ID, EventPayload{
TargetLSN: w.LSN,
ToNode: rid,
})
}
s.record(e, fmt.Sprintf("write block=%d LSN=%d", w.Block, w.LSN))
}
func (s *Simulator) executeLockAcquire(e Event) {
name := e.Payload.LockName
ls, ok := s.locks[name]
if !ok {
ls = &lockState{}
s.locks[name] = ls
}
if !ls.held {
ls.held = true
ls.holder = e.Payload.ThreadID
s.record(e, fmt.Sprintf("lock %s acquired by %s", name, e.Payload.ThreadID))
} else {
// Park — will be released when current holder releases.
ls.waiting = append(ls.waiting, e)
s.record(e, fmt.Sprintf("lock %s BLOCKED %s (held by %s)", name, e.Payload.ThreadID, ls.holder))
}
}
func (s *Simulator) executeLockRelease(e Event) {
name := e.Payload.LockName
ls := s.locks[name]
if ls == nil || !ls.held {
return
}
// Validate: only the holder can release.
if ls.holder != e.Payload.ThreadID {
s.record(e, fmt.Sprintf("lock %s release REJECTED: %s is not holder (held by %s)", name, e.Payload.ThreadID, ls.holder))
return
}
s.record(e, fmt.Sprintf("lock %s released by %s", name, ls.holder))
ls.held = false
ls.holder = ""
// Grant to next waiter (random pick among waiters for interleaving exploration).
if len(ls.waiting) > 0 {
idx := s.rng.Intn(len(ls.waiting))
next := ls.waiting[idx]
ls.waiting = append(ls.waiting[:idx], ls.waiting[idx+1:]...)
ls.held = true
ls.holder = next.Payload.ThreadID
s.record(next, fmt.Sprintf("lock %s granted to %s (was waiting)", name, next.Payload.ThreadID))
}
}
func (s *Simulator) dropEventsForNode(nodeID string) {
var kept eventHeap
for _, e := range s.queue {
if e.NodeID != nodeID {
kept = append(kept, e)
}
}
s.queue = kept
heap.Init(&s.queue)
}
// --- Invariant checking ---
func (s *Simulator) checkInvariants(after Event) {
// 1. Commit safety: committed LSN must be durable on policy-required nodes.
for lsn := uint64(1); lsn <= s.Cluster.Coordinator.CommittedLSN; lsn++ {
p := s.Cluster.Pending[lsn]
if p == nil {
continue
}
if !s.Cluster.commitSatisfied(p) {
s.addError(after, fmt.Sprintf("committed LSN %d not durable per policy", lsn))
}
}
// 2. No false commit on promoted node.
primary := s.Cluster.Primary()
if primary != nil && primary.Running {
committedLSN := s.Cluster.Coordinator.CommittedLSN
for lsn := committedLSN + 1; lsn <= s.Cluster.nextLSN; lsn++ {
p := s.Cluster.Pending[lsn]
if p != nil && !p.Committed && p.DurableOn[primary.ID] {
// Uncommitted but durable on primary — only a problem if primary changed.
// This is expected on the original primary. Only flag if this is a PROMOTED node.
}
}
}
// 3. Data correctness: primary state matches reference at the LSN it actually has.
// After promotion, the new primary may not have all writes the old primary committed.
// Verify correctness only up to what the current primary has durably received.
if primary != nil && primary.Running {
checkLSN := primary.Storage.FlushedLSN
if checkLSN > s.Cluster.Coordinator.CommittedLSN {
checkLSN = s.Cluster.Coordinator.CommittedLSN
}
if checkLSN > 0 {
refState := s.Cluster.Reference.StateAt(checkLSN)
nodeState := primary.Storage.StateAt(checkLSN)
if !EqualState(refState, nodeState) {
s.addError(after, fmt.Sprintf("data divergence on primary %s at LSN=%d",
primary.ID, checkLSN))
}
}
}
// 4. Epoch fencing: no node has accepted a stale epoch.
for id, node := range s.Cluster.Nodes {
if node.Running && node.Epoch > s.Cluster.Coordinator.Epoch {
s.addError(after, fmt.Sprintf("node %s has future epoch %d > coordinator %d", id, node.Epoch, s.Cluster.Coordinator.Epoch))
}
}
// 5. Lock safety: no two threads hold the same lock.
for name, ls := range s.locks {
if ls.held && ls.holder == "" {
s.addError(after, fmt.Sprintf("lock %s held but no holder", name))
}
}
}
func (s *Simulator) addError(after Event, msg string) {
s.Errors = append(s.Errors, fmt.Sprintf("t=%d after %s on %s: %s",
after.Time, after.Kind, after.NodeID, msg))
}
func (s *Simulator) record(e Event, note string) {
s.trace = append(s.trace, TraceEntry{Time: e.Time, Event: e, Note: note})
}
// --- Random fault injection ---
// InjectRandomFault schedules a random fault (crash, partition, heal)
// at a random future time within [Now+1, Now+spread).
func (s *Simulator) InjectRandomFault() {
s.InjectRandomFaultWithin(30)
}
// InjectRandomFaultWithin schedules a random fault at a random time
// within [Now+1, Now+spread).
func (s *Simulator) InjectRandomFaultWithin(spread uint64) {
if s.rng.Float64() > s.FaultRate {
return
}
members := s.Cluster.Coordinator.Members
if len(members) == 0 {
return
}
faultTime := s.Cluster.Now + 1 + uint64(s.rng.Int63n(int64(spread)))
switch s.rng.Intn(3) {
case 0: // crash a random node
id := members[s.rng.Intn(len(members))]
s.EnqueueAt(faultTime, EvNodeCrash, id, EventPayload{})
case 1: // drop a link
from := members[s.rng.Intn(len(members))]
to := members[s.rng.Intn(len(members))]
if from != to {
s.EnqueueAt(faultTime, EvLinkDown, from, EventPayload{FromNode: from, ToNode: to})
}
case 2: // restore a link
from := members[s.rng.Intn(len(members))]
to := members[s.rng.Intn(len(members))]
if from != to {
s.EnqueueAt(faultTime, EvLinkUp, from, EventPayload{FromNode: from, ToNode: to})
}
}
}
// --- Scenario helpers ---
// ScheduleWrites enqueues n writes at random times in [start, start+spread).
func (s *Simulator) ScheduleWrites(n int, start, spread uint64) {
for i := 0; i < n; i++ {
t := start + uint64(s.rng.Int63n(int64(spread)))
block := uint64(s.rng.Intn(16))
s.EnqueueAt(t, EvWriteStart, s.Cluster.Coordinator.PrimaryID, EventPayload{
Write: Write{Block: block},
})
}
}
// ScheduleCrashAndPromote enqueues a primary crash at crashTime and promotes promoteID at promoteTime.
func (s *Simulator) ScheduleCrashAndPromote(crashTime uint64, promoteID string, promoteTime uint64) {
s.EnqueueAt(crashTime, EvNodeCrash, s.Cluster.Coordinator.PrimaryID, EventPayload{})
s.EnqueueAt(promoteTime, EvPromote, "", EventPayload{PromoteID: promoteID})
}
// ScheduleFlusherTicks enqueues periodic flusher ticks for a node.
func (s *Simulator) ScheduleFlusherTicks(nodeID string, start, interval uint64, count int) {
for i := 0; i < count; i++ {
s.EnqueueAt(start+uint64(i)*interval, EvFlusherTick, nodeID, EventPayload{})
}
}
// --- Output ---
// TraceString returns the full trace as a string.
func (s *Simulator) TraceString() string {
var sb strings.Builder
for _, te := range s.trace {
fmt.Fprintf(&sb, "[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note)
}
return sb.String()
}
// ErrorString returns all errors.
func (s *Simulator) ErrorString() string {
return strings.Join(s.Errors, "\n")
}
// AssertCommittedDataCorrect checks that the current primary's state matches the reference.
func (s *Simulator) AssertCommittedDataCorrect() error {
primary := s.Cluster.Primary()
if primary == nil {
return fmt.Errorf("no primary")
}
committedLSN := s.Cluster.Coordinator.CommittedLSN
if committedLSN == 0 {
return nil
}
refState := s.Cluster.Reference.StateAt(committedLSN)
nodeState := primary.Storage.StateAt(committedLSN)
if !EqualState(refState, nodeState) {
return fmt.Errorf("data divergence on %s at LSN=%d: ref=%v node=%v",
primary.ID, committedLSN, refState, nodeState)
}
return nil
}

285
sw-block/prototype/distsim/simulator_test.go

@ -0,0 +1,285 @@
package distsim
import (
"fmt"
"strings"
"testing"
)
// --- Fixed scenarios ---
func TestSim_BasicWriteAndCommit(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, 42)
sim.ScheduleWrites(3, 1, 5)
sim.Run()
if c.Coordinator.CommittedLSN < 1 {
t.Fatalf("expected at least 1 committed write, got %d", c.Coordinator.CommittedLSN)
}
if err := sim.AssertCommittedDataCorrect(); err != nil {
t.Fatal(err)
}
if len(sim.Errors) > 0 {
t.Fatalf("invariant violations:\n%s", sim.ErrorString())
}
}
func TestSim_CrashAfterCommit_DataSurvives(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, 99)
// Write, let it commit, then crash primary, promote r1.
sim.ScheduleWrites(5, 1, 3)
sim.ScheduleCrashAndPromote(20, "r1", 22)
sim.Run()
if len(sim.Errors) > 0 {
t.Fatalf("invariant violations:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString())
}
if err := sim.AssertCommittedDataCorrect(); err != nil {
t.Fatal(err)
}
}
func TestSim_PartitionThenHeal(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, 777)
// Write some data.
sim.ScheduleWrites(3, 1, 3)
// Partition r2 at time 5.
sim.EnqueueAt(5, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r2"})
// Write more during partition.
sim.ScheduleWrites(3, 8, 3)
// Heal at time 15.
sim.EnqueueAt(15, EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r2"})
// Write after heal.
sim.ScheduleWrites(2, 18, 3)
sim.Run()
if len(sim.Errors) > 0 {
t.Fatalf("invariant violations:\n%s", sim.ErrorString())
}
if err := sim.AssertCommittedDataCorrect(); err != nil {
t.Fatal(err)
}
}
func TestSim_SyncAll_UncommittedNotVisible(t *testing.T) {
c := NewCluster(CommitSyncAll, "p", "r1")
sim := NewSimulator(c, 123)
// Partition r1 so nothing can commit under sync_all.
sim.EnqueueAt(0, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"})
sim.ScheduleWrites(3, 1, 3)
sim.Run()
// Nothing should be committed.
if c.Coordinator.CommittedLSN != 0 {
t.Fatalf("sync_all with partitioned replica should not commit, got %d", c.Coordinator.CommittedLSN)
}
if len(sim.Errors) > 0 {
t.Fatalf("invariant violations:\n%s", sim.ErrorString())
}
}
func TestSim_MessageReorderingDoesNotBreakSafety(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, 555)
sim.jitterMax = 8 // high jitter to force reordering
sim.ScheduleWrites(10, 1, 5)
sim.Run()
if len(sim.Errors) > 0 {
t.Fatalf("invariant violations with high jitter:\n%s", sim.ErrorString())
}
if err := sim.AssertCommittedDataCorrect(); err != nil {
t.Fatal(err)
}
}
// --- Randomized property-based testing ---
func TestSim_Randomized_CommitSafety(t *testing.T) {
const numSeeds = 500
const numWrites = 20
failures := 0
for seed := int64(0); seed < numSeeds; seed++ {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, seed)
sim.MaxEvents = 2000
// Random writes.
sim.ScheduleWrites(numWrites, 1, 30)
// Random crash + promote somewhere in the middle.
crashTime := uint64(sim.rng.Intn(25) + 5)
sim.ScheduleCrashAndPromote(crashTime, "r1", crashTime+3)
sim.Run()
if len(sim.Errors) > 0 {
t.Errorf("seed %d: invariant violation:\n%s\nTrace (last 20):\n%s",
seed, sim.ErrorString(), lastN(sim.trace, 20))
failures++
if failures >= 3 {
t.Fatal("too many failures, stopping")
}
}
}
t.Logf("randomized: %d/%d seeds passed", numSeeds-failures, numSeeds)
}
func TestSim_Randomized_WithFaults(t *testing.T) {
const numSeeds = 300
failures := 0
for seed := int64(0); seed < numSeeds; seed++ {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, seed)
sim.FaultRate = 0.08
sim.MaxEvents = 1500
sim.jitterMax = 5
// Interleave writes and random faults.
for i := 0; i < 15; i++ {
t := uint64(i*3 + 1)
sim.EnqueueAt(t, EvWriteStart, "p", EventPayload{
Write: Write{Block: uint64(sim.rng.Intn(8))},
})
sim.InjectRandomFault()
}
sim.Run()
if len(sim.Errors) > 0 {
t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString())
failures++
if failures >= 3 {
t.Fatal("too many failures, stopping")
}
}
}
t.Logf("randomized+faults: %d/%d seeds passed", numSeeds-failures, numSeeds)
}
func TestSim_Randomized_SyncAll(t *testing.T) {
const numSeeds = 200
failures := 0
for seed := int64(0); seed < numSeeds; seed++ {
c := NewCluster(CommitSyncAll, "p", "r1")
sim := NewSimulator(c, seed)
sim.MaxEvents = 1000
sim.ScheduleWrites(10, 1, 20)
// Random partition/heal.
if sim.rng.Float64() < 0.5 {
pTime := uint64(sim.rng.Intn(15) + 1)
sim.EnqueueAt(pTime, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"})
sim.EnqueueAt(pTime+uint64(sim.rng.Intn(10)+3), EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r1"})
}
sim.Run()
if len(sim.Errors) > 0 {
t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString())
failures++
if failures >= 3 {
t.Fatal("too many failures, stopping")
}
}
}
t.Logf("sync_all randomized: %d/%d seeds passed", numSeeds-failures, numSeeds)
}
// --- Lock contention tests ---
func TestSim_LockContention_NoDoubleHold(t *testing.T) {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, 42)
// Two threads try to acquire the same lock at the same time.
sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"})
sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"})
// First release.
sim.EnqueueAt(8, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"})
// Second release (whoever got granted after writer-1 releases).
sim.EnqueueAt(11, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"})
sim.Run()
if len(sim.Errors) > 0 {
t.Fatalf("lock invariant violated:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString())
}
// Verify the trace shows one blocked, one granted.
trace := sim.TraceString()
if !containsStr(trace, "BLOCKED") {
t.Fatal("expected one thread to be BLOCKED on lock contention")
}
if !containsStr(trace, "granted to") {
t.Fatal("expected blocked thread to be granted after release")
}
}
func TestSim_LockContention_Randomized(t *testing.T) {
// Run many seeds with concurrent lock acquires at the same time.
// The simulator should pick a random winner each time (seed-dependent).
winners := map[string]int{}
for seed := int64(0); seed < 100; seed++ {
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2")
sim := NewSimulator(c, seed)
sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "A"})
sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "B"})
sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "A"})
sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "B"})
sim.Run()
if len(sim.Errors) > 0 {
t.Fatalf("seed %d: %s", seed, sim.ErrorString())
}
// Check who got the lock first by looking at the trace.
for _, te := range sim.trace {
if te.Event.Kind == EvLockAcquire && containsStr(te.Note, "acquired") {
winners[te.Event.Payload.ThreadID]++
break
}
}
}
// Both threads should win at least some seeds (randomization works).
if winners["A"] == 0 || winners["B"] == 0 {
t.Fatalf("lock winner not randomized: A=%d B=%d", winners["A"], winners["B"])
}
t.Logf("lock winner distribution: A=%d B=%d", winners["A"], winners["B"])
}
func containsStr(s, substr string) bool {
return len(s) > 0 && len(substr) > 0 && strings.Contains(s, substr)
}
// --- Helpers ---
func lastN(trace []TraceEntry, n int) string {
start := len(trace) - n
if start < 0 {
start = 0
}
s := ""
for _, te := range trace[start:] {
s += fmt.Sprintf("[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note)
}
return s
}

129
sw-block/prototype/distsim/storage.go

@ -0,0 +1,129 @@
package distsim
import "sort"
type SnapshotState struct {
ID string
LSN uint64
State map[uint64]uint64
}
type Storage struct {
WAL []Write
Extent map[uint64]uint64
ReceivedLSN uint64
FlushedLSN uint64
CheckpointLSN uint64
Snapshots map[string]SnapshotState
BaseSnapshot *SnapshotState
}
func NewStorage() *Storage {
return &Storage{
Extent: map[uint64]uint64{},
Snapshots: map[string]SnapshotState{},
}
}
func (s *Storage) AppendWrite(w Write) {
// Insert in LSN order (handles out-of-order delivery from jitter).
inserted := false
for i, existing := range s.WAL {
if w.LSN == existing.LSN {
return // duplicate, skip
}
if w.LSN < existing.LSN {
s.WAL = append(s.WAL[:i], append([]Write{w}, s.WAL[i:]...)...)
inserted = true
break
}
}
if !inserted {
s.WAL = append(s.WAL, w)
}
s.Extent[w.Block] = w.Value
if w.LSN > s.ReceivedLSN {
s.ReceivedLSN = w.LSN
}
}
func (s *Storage) AdvanceFlush(lsn uint64) {
if lsn > s.ReceivedLSN {
lsn = s.ReceivedLSN
}
if lsn > s.FlushedLSN {
s.FlushedLSN = lsn
}
}
func (s *Storage) AdvanceCheckpoint(lsn uint64) {
if lsn > s.FlushedLSN {
lsn = s.FlushedLSN
}
if lsn > s.CheckpointLSN {
s.CheckpointLSN = lsn
}
}
func (s *Storage) StateAt(lsn uint64) map[uint64]uint64 {
state := map[uint64]uint64{}
if s.BaseSnapshot != nil {
if s.BaseSnapshot.LSN > lsn {
return cloneMap(s.BaseSnapshot.State)
}
state = cloneMap(s.BaseSnapshot.State)
}
for _, w := range s.WAL {
if w.LSN > lsn {
break
}
if s.BaseSnapshot != nil && w.LSN <= s.BaseSnapshot.LSN {
continue
}
state[w.Block] = w.Value
}
return state
}
func (s *Storage) TakeSnapshot(id string, lsn uint64) SnapshotState {
snap := SnapshotState{
ID: id,
LSN: lsn,
State: cloneMap(s.StateAt(lsn)),
}
s.Snapshots[id] = snap
return snap
}
func (s *Storage) LoadSnapshot(snap SnapshotState) {
s.Extent = cloneMap(snap.State)
s.FlushedLSN = snap.LSN
s.ReceivedLSN = snap.LSN
s.CheckpointLSN = snap.LSN
s.BaseSnapshot = &SnapshotState{
ID: snap.ID,
LSN: snap.LSN,
State: cloneMap(snap.State),
}
s.WAL = nil
}
func (s *Storage) ReplaceWAL(writes []Write) {
s.WAL = append([]Write(nil), writes...)
sort.Slice(s.WAL, func(i, j int) bool { return s.WAL[i].LSN < s.WAL[j].LSN })
s.Extent = s.StateAt(s.ReceivedLSN)
}
func writesInRange(writes []Write, startExclusive, endInclusive uint64) []Write {
out := make([]Write, 0)
for _, w := range writes {
if w.LSN <= startExclusive {
continue
}
if w.LSN > endInclusive {
break
}
out = append(out, w)
}
return out
}

64
sw-block/prototype/enginev2/assignment.go

@ -0,0 +1,64 @@
package enginev2
// AssignmentIntent represents a coordinator-driven assignment update.
// It specifies the desired replica set and which replicas need recovery.
type AssignmentIntent struct {
Endpoints map[string]Endpoint // desired replica set
Epoch uint64 // current epoch
RecoveryTargets map[string]SessionKind // replicas that need recovery (nil = no recovery)
}
// AssignmentResult records what the SenderGroup did in response to an assignment.
type AssignmentResult struct {
Added []string // new senders created
Removed []string // old senders stopped
SessionsCreated []string // fresh recovery sessions attached
SessionsSuperseded []string // existing sessions superseded by new ones
SessionsFailed []string // recovery sessions that couldn't be created
}
// ApplyAssignment processes a coordinator assignment intent:
// 1. Reconcile endpoints — add/remove/update senders
// 2. For each recovery target, create a recovery session on the sender
//
// Epoch fencing: if intent.Epoch < sender.Epoch for any target, that target
// is rejected. Stale assignment intent cannot create live sessions.
func (sg *SenderGroup) ApplyAssignment(intent AssignmentIntent) AssignmentResult {
var result AssignmentResult
// Step 1: reconcile topology.
result.Added, result.Removed = sg.Reconcile(intent.Endpoints, intent.Epoch)
// Step 2: create recovery sessions for designated targets.
if intent.RecoveryTargets == nil {
return result
}
sg.mu.RLock()
defer sg.mu.RUnlock()
for replicaID, kind := range intent.RecoveryTargets {
sender, ok := sg.senders[replicaID]
if !ok {
result.SessionsFailed = append(result.SessionsFailed, replicaID)
continue
}
// Reject stale assignment: intent epoch must match sender epoch.
if intent.Epoch < sender.Epoch {
result.SessionsFailed = append(result.SessionsFailed, replicaID)
continue
}
_, err := sender.AttachSession(intent.Epoch, kind)
if err != nil {
// Session already active at current epoch — supersede it.
sess := sender.SupersedeSession(kind, "assignment_intent")
if sess != nil {
result.SessionsSuperseded = append(result.SessionsSuperseded, replicaID)
} else {
result.SessionsFailed = append(result.SessionsFailed, replicaID)
}
continue
}
result.SessionsCreated = append(result.SessionsCreated, replicaID)
}
return result
}

420
sw-block/prototype/enginev2/execution_test.go

@ -0,0 +1,420 @@
package enginev2
import "testing"
// ============================================================
// Phase 04 P1: Session execution and sender-group orchestration
// ============================================================
// --- Execution API: full lifecycle ---
func TestExec_FullRecoveryLifecycle(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
id := sess.ID
// init → connecting
if err := s.BeginConnect(id); err != nil {
t.Fatalf("BeginConnect: %v", err)
}
if s.State != StateConnecting {
t.Fatalf("state=%s, want connecting", s.State)
}
// connecting → handshake
if err := s.RecordHandshake(id, 5, 20); err != nil {
t.Fatalf("RecordHandshake: %v", err)
}
if sess.StartLSN != 5 || sess.TargetLSN != 20 {
t.Fatalf("range: start=%d target=%d", sess.StartLSN, sess.TargetLSN)
}
// handshake → catchup
if err := s.BeginCatchUp(id); err != nil {
t.Fatalf("BeginCatchUp: %v", err)
}
if s.State != StateCatchingUp {
t.Fatalf("state=%s, want catching_up", s.State)
}
// progress
if err := s.RecordCatchUpProgress(id, 15); err != nil {
t.Fatalf("progress to 15: %v", err)
}
if err := s.RecordCatchUpProgress(id, 20); err != nil {
t.Fatalf("progress to 20: %v", err)
}
if !sess.Converged() {
t.Fatal("should be converged at 20/20")
}
// complete
if !s.CompleteSessionByID(id) {
t.Fatal("completion should succeed")
}
if s.State != StateInSync {
t.Fatalf("state=%s, want in_sync", s.State)
}
}
// --- Stale sessionID rejection across all execution APIs ---
func TestExec_StaleID_AllAPIsReject(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess1, _ := s.AttachSession(1, SessionCatchUp)
oldID := sess1.ID
// Supersede with new session.
s.UpdateEpoch(2)
sess2, _ := s.AttachSession(2, SessionCatchUp)
_ = sess2
// All APIs must reject oldID.
if err := s.BeginConnect(oldID); err == nil {
t.Fatal("BeginConnect should reject stale ID")
}
if err := s.RecordHandshake(oldID, 0, 10); err == nil {
t.Fatal("RecordHandshake should reject stale ID")
}
if err := s.BeginCatchUp(oldID); err == nil {
t.Fatal("BeginCatchUp should reject stale ID")
}
if err := s.RecordCatchUpProgress(oldID, 5); err == nil {
t.Fatal("RecordCatchUpProgress should reject stale ID")
}
if s.CompleteSessionByID(oldID) {
t.Fatal("CompleteSessionByID should reject stale ID")
}
}
// --- Phase ordering enforcement ---
func TestExec_WrongPhaseOrder_Rejected(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
id := sess.ID
// Skip connecting → go directly to handshake: rejected.
if err := s.RecordHandshake(id, 0, 10); err == nil {
t.Fatal("handshake from init should be rejected")
}
// Skip to catch-up from init: rejected.
if err := s.BeginCatchUp(id); err == nil {
t.Fatal("catch-up from init should be rejected")
}
// Progress from init: rejected (not in catch-up phase).
if err := s.RecordCatchUpProgress(id, 5); err == nil {
t.Fatal("progress from init should be rejected")
}
// Correct path: init → connecting.
s.BeginConnect(id)
// Now try catch-up from connecting: rejected (must handshake first).
if err := s.BeginCatchUp(id); err == nil {
t.Fatal("catch-up from connecting should be rejected")
}
}
// --- Progress regression rejection ---
func TestExec_ProgressRegression_Rejected(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
id := sess.ID
s.BeginConnect(id)
s.RecordHandshake(id, 0, 100)
s.BeginCatchUp(id)
s.RecordCatchUpProgress(id, 50)
// Regression: 30 < 50.
if err := s.RecordCatchUpProgress(id, 30); err == nil {
t.Fatal("progress regression should be rejected")
}
// Same value: 50 = 50.
if err := s.RecordCatchUpProgress(id, 50); err == nil {
t.Fatal("non-advancing progress should be rejected")
}
// Advance: 60 > 50.
if err := s.RecordCatchUpProgress(id, 60); err != nil {
t.Fatalf("valid progress should succeed: %v", err)
}
}
// --- Epoch bump during execution ---
func TestExec_EpochBumpDuringExecution_InvalidatesAuthority(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
id := sess.ID
s.BeginConnect(id)
s.RecordHandshake(id, 0, 100)
s.BeginCatchUp(id)
s.RecordCatchUpProgress(id, 50)
// Epoch bumps mid-execution.
s.UpdateEpoch(2)
// All further execution on old session rejected.
if err := s.RecordCatchUpProgress(id, 60); err == nil {
t.Fatal("progress after epoch bump should be rejected")
}
if s.CompleteSessionByID(id) {
t.Fatal("completion after epoch bump should be rejected")
}
// Sender is disconnected, ready for new session.
if s.State != StateDisconnected {
t.Fatalf("state=%s, want disconnected", s.State)
}
}
// --- Endpoint change during execution ---
func TestExec_EndpointChangeDuringExecution_InvalidatesAuthority(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
id := sess.ID
s.BeginConnect(id)
s.RecordHandshake(id, 0, 50)
s.BeginCatchUp(id)
// Endpoint changes mid-execution.
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", CtrlAddr: "r1:9445", Version: 2})
// All further execution rejected.
if err := s.RecordCatchUpProgress(id, 10); err == nil {
t.Fatal("progress after endpoint change should be rejected")
}
if s.CompleteSessionByID(id) {
t.Fatal("completion after endpoint change should be rejected")
}
}
// --- Completion authority enforcement ---
func TestExec_CompletionRejected_FromInit(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
if s.CompleteSessionByID(sess.ID) {
t.Fatal("completion from PhaseInit should be rejected")
}
}
func TestExec_CompletionRejected_FromConnecting(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
if s.CompleteSessionByID(sess.ID) {
t.Fatal("completion from PhaseConnecting should be rejected")
}
}
func TestExec_CompletionRejected_FromHandshakeWithGap(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
s.RecordHandshake(sess.ID, 5, 20) // gap exists: 5 → 20
if s.CompleteSessionByID(sess.ID) {
t.Fatal("completion from PhaseHandshake with gap should be rejected")
}
}
func TestExec_CompletionAllowed_FromHandshakeZeroGap(t *testing.T) {
// Fast path: handshake shows replica already at target (zero gap).
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
s.RecordHandshake(sess.ID, 10, 10) // zero gap: start == target
if !s.CompleteSessionByID(sess.ID) {
t.Fatal("completion from handshake with zero gap should be allowed")
}
if s.State != StateInSync {
t.Fatalf("state=%s, want in_sync", s.State)
}
}
func TestExec_CompletionRejected_FromCatchUpNotConverged(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
s.RecordHandshake(sess.ID, 0, 100)
s.BeginCatchUp(sess.ID)
s.RecordCatchUpProgress(sess.ID, 50) // not converged (50 < 100)
if s.CompleteSessionByID(sess.ID) {
t.Fatal("completion before convergence should be rejected")
}
}
func TestExec_HandshakeInvalidRange_Rejected(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
if err := s.RecordHandshake(sess.ID, 20, 5); err == nil {
t.Fatal("handshake with target < start should be rejected")
}
}
// --- SenderGroup orchestration ---
func TestOrch_RepeatedReconnectCycles_PreserveSenderIdentity(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
}, 1)
s := sg.Sender("r1:9333")
original := s // save pointer
// 5 reconnect cycles — sender identity preserved.
for cycle := 0; cycle < 5; cycle++ {
sess, err := s.AttachSession(1, SessionCatchUp)
if err != nil {
t.Fatalf("cycle %d attach: %v", cycle, err)
}
s.BeginConnect(sess.ID)
s.RecordHandshake(sess.ID, 0, 10)
s.BeginCatchUp(sess.ID)
s.RecordCatchUpProgress(sess.ID, 10)
s.CompleteSessionByID(sess.ID)
if s.State != StateInSync {
t.Fatalf("cycle %d: state=%s, want in_sync", cycle, s.State)
}
}
// Same pointer — identity preserved.
if sg.Sender("r1:9333") != original {
t.Fatal("sender identity should be preserved across cycles")
}
}
func TestOrch_EndpointUpdateSupersedesActiveSession(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1},
}, 1)
s := sg.Sender("r1:9333")
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
// Endpoint update via reconcile — session invalidated.
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 2},
}, 1)
if sess.Active() {
t.Fatal("session should be invalidated by endpoint update")
}
// Sender preserved, session gone.
if sg.Sender("r1:9333") != s {
t.Fatal("sender identity should be preserved")
}
if s.Session() != nil {
t.Fatal("session should be nil after endpoint invalidation")
}
}
func TestOrch_ReconcileMixedAddRemoveUpdate(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
"r3:9333": {DataAddr: "r3:9333", Version: 1},
}, 1)
r1 := sg.Sender("r1:9333")
r2 := sg.Sender("r2:9333")
// Attach sessions to r1 and r2.
r1Sess, _ := r1.AttachSession(1, SessionCatchUp)
r2Sess, _ := r2.AttachSession(1, SessionCatchUp)
// Reconcile: keep r1, remove r2, update r3, add r4.
added, removed := sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1}, // kept
"r3:9333": {DataAddr: "r3:9333", Version: 2}, // updated
"r4:9333": {DataAddr: "r4:9333", Version: 1}, // added
}, 1)
if len(added) != 1 || added[0] != "r4:9333" {
t.Fatalf("added=%v", added)
}
if len(removed) != 1 || removed[0] != "r2:9333" {
t.Fatalf("removed=%v", removed)
}
// r1: preserved with active session.
if sg.Sender("r1:9333") != r1 {
t.Fatal("r1 should be preserved")
}
if !r1Sess.Active() {
t.Fatal("r1 session should still be active")
}
// r2: stopped and removed.
if sg.Sender("r2:9333") != nil {
t.Fatal("r2 should be removed")
}
if r2.Stopped() != true {
t.Fatal("r2 should be stopped")
}
if r2Sess.Active() {
t.Fatal("r2 session should be invalidated (sender stopped)")
}
// r4: new sender, no session.
if sg.Sender("r4:9333") == nil {
t.Fatal("r4 should exist")
}
}
func TestOrch_EpochBumpInvalidatesExecutingSessions(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
}, 1)
r1 := sg.Sender("r1:9333")
r2 := sg.Sender("r2:9333")
sess1, _ := r1.AttachSession(1, SessionCatchUp)
r1.BeginConnect(sess1.ID)
r1.RecordHandshake(sess1.ID, 0, 50)
r1.BeginCatchUp(sess1.ID)
r1.RecordCatchUpProgress(sess1.ID, 25) // mid-execution
sess2, _ := r2.AttachSession(1, SessionCatchUp)
r2.BeginConnect(sess2.ID)
// Epoch bump.
count := sg.InvalidateEpoch(2)
if count != 2 {
t.Fatalf("should invalidate 2 sessions, got %d", count)
}
// Both sessions dead.
if sess1.Active() || sess2.Active() {
t.Fatal("both sessions should be invalidated")
}
// r1's mid-execution progress cannot continue.
if err := r1.RecordCatchUpProgress(sess1.ID, 30); err == nil {
t.Fatal("progress on invalidated session should be rejected")
}
}

3
sw-block/prototype/enginev2/go.mod

@ -0,0 +1,3 @@
module github.com/seaweedfs/seaweedfs/sw-block/prototype/enginev2
go 1.23.0

39
sw-block/prototype/enginev2/outcome.go

@ -0,0 +1,39 @@
package enginev2
// HandshakeResult captures what the reconnect handshake reveals about a
// replica's state relative to the primary's lineage-safe boundary.
type HandshakeResult struct {
ReplicaFlushedLSN uint64 // highest LSN durably persisted on replica
CommittedLSN uint64 // lineage-safe recovery target (committed prefix)
RetentionStartLSN uint64 // oldest LSN still available in primary WAL
}
// RecoveryOutcome classifies the gap between replica and primary.
type RecoveryOutcome string
const (
OutcomeZeroGap RecoveryOutcome = "zero_gap" // replica has full committed prefix
OutcomeCatchUp RecoveryOutcome = "catchup" // gap within WAL retention
OutcomeNeedsRebuild RecoveryOutcome = "needs_rebuild" // gap exceeds retention
)
// ClassifyRecoveryOutcome determines the recovery path from handshake data.
//
// Uses CommittedLSN (not WAL head) as the target boundary. This is the
// lineage-safe recovery point — only acknowledged data counts. A replica
// with FlushedLSN > CommittedLSN has divergent/uncommitted tail that must
// NOT be treated as "already in sync."
//
// Decision matrix (matches CP13-5 gap analysis):
// - ReplicaFlushedLSN >= CommittedLSN → zero gap, has full committed prefix
// - ReplicaFlushedLSN+1 >= RetentionStartLSN → recoverable via WAL catch-up
// - otherwise → gap too large, needs rebuild
func ClassifyRecoveryOutcome(result HandshakeResult) RecoveryOutcome {
if result.ReplicaFlushedLSN >= result.CommittedLSN {
return OutcomeZeroGap
}
if result.RetentionStartLSN == 0 || result.ReplicaFlushedLSN+1 >= result.RetentionStartLSN {
return OutcomeCatchUp
}
return OutcomeNeedsRebuild
}

482
sw-block/prototype/enginev2/p2_test.go

@ -0,0 +1,482 @@
package enginev2
import "testing"
// ============================================================
// Phase 04 P2: Outcome branching, assignment intent, end-to-end
// ============================================================
// --- Recovery outcome classification ---
func TestOutcome_ZeroGap(t *testing.T) {
o := ClassifyRecoveryOutcome(HandshakeResult{
ReplicaFlushedLSN: 100,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if o != OutcomeZeroGap {
t.Fatalf("got %s, want zero_gap", o)
}
}
func TestOutcome_ZeroGap_ReplicaAtCommitted(t *testing.T) {
// Replica has exactly the committed prefix — zero gap.
// Note: replica may have uncommitted tail beyond CommittedLSN;
// that is handled by truncation, not by recovery classification.
o := ClassifyRecoveryOutcome(HandshakeResult{
ReplicaFlushedLSN: 100,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if o != OutcomeZeroGap {
t.Fatalf("got %s, want zero_gap", o)
}
}
func TestOutcome_CatchUp(t *testing.T) {
o := ClassifyRecoveryOutcome(HandshakeResult{
ReplicaFlushedLSN: 80,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if o != OutcomeCatchUp {
t.Fatalf("got %s, want catchup", o)
}
}
func TestOutcome_CatchUp_ExactBoundary(t *testing.T) {
// ReplicaFlushedLSN+1 == RetentionStartLSN → recoverable (just barely).
o := ClassifyRecoveryOutcome(HandshakeResult{
ReplicaFlushedLSN: 49,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if o != OutcomeCatchUp {
t.Fatalf("got %s, want catchup (exact boundary)", o)
}
}
func TestOutcome_NeedsRebuild(t *testing.T) {
o := ClassifyRecoveryOutcome(HandshakeResult{
ReplicaFlushedLSN: 10,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if o != OutcomeNeedsRebuild {
t.Fatalf("got %s, want needs_rebuild", o)
}
}
func TestOutcome_NeedsRebuild_OffByOne(t *testing.T) {
// ReplicaFlushedLSN+1 < RetentionStartLSN → unrecoverable.
o := ClassifyRecoveryOutcome(HandshakeResult{
ReplicaFlushedLSN: 48,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if o != OutcomeNeedsRebuild {
t.Fatalf("got %s, want needs_rebuild (off-by-one)", o)
}
}
// --- RecordHandshakeWithOutcome execution ---
func TestExec_HandshakeOutcome_ZeroGap_FastComplete(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
ReplicaFlushedLSN: 100,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if err != nil {
t.Fatal(err)
}
if outcome != OutcomeZeroGap {
t.Fatalf("outcome=%s, want zero_gap", outcome)
}
// Zero-gap: can complete directly from handshake phase.
if !s.CompleteSessionByID(sess.ID) {
t.Fatal("zero-gap fast completion should succeed")
}
if s.State != StateInSync {
t.Fatalf("state=%s, want in_sync", s.State)
}
}
func TestExec_HandshakeOutcome_CatchUp_NormalPath(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
ReplicaFlushedLSN: 80,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if err != nil {
t.Fatal(err)
}
if outcome != OutcomeCatchUp {
t.Fatalf("outcome=%s, want catchup", outcome)
}
// Must catch up before completing.
if s.CompleteSessionByID(sess.ID) {
t.Fatal("completion should be rejected before catch-up")
}
s.BeginCatchUp(sess.ID)
s.RecordCatchUpProgress(sess.ID, 100)
if !s.CompleteSessionByID(sess.ID) {
t.Fatal("completion should succeed after convergence")
}
}
func TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.BeginConnect(sess.ID)
outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
ReplicaFlushedLSN: 10,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if err != nil {
t.Fatal(err)
}
if outcome != OutcomeNeedsRebuild {
t.Fatalf("outcome=%s, want needs_rebuild", outcome)
}
// Session invalidated, sender at NeedsRebuild.
if sess.Active() {
t.Fatal("session should be invalidated")
}
if s.State != StateNeedsRebuild {
t.Fatalf("state=%s, want needs_rebuild", s.State)
}
if s.Session() != nil {
t.Fatal("session should be nil after NeedsRebuild")
}
}
// --- Assignment-intent orchestration ---
func TestAssignment_CreatesSessionsForTargets(t *testing.T) {
sg := NewSenderGroup()
result := sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{
"r1:9333": SessionCatchUp,
},
})
if len(result.Added) != 2 {
t.Fatalf("added=%d, want 2", len(result.Added))
}
if len(result.SessionsCreated) != 1 || result.SessionsCreated[0] != "r1:9333" {
t.Fatalf("sessions created=%v", result.SessionsCreated)
}
// r1 has session, r2 does not.
r1 := sg.Sender("r1:9333")
if r1.Session() == nil {
t.Fatal("r1 should have a session")
}
r2 := sg.Sender("r2:9333")
if r2.Session() != nil {
t.Fatal("r2 should not have a session")
}
}
func TestAssignment_SupersedesExistingSession(t *testing.T) {
sg := NewSenderGroup()
// First assignment with catch-up session.
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
})
oldSess := sg.Sender("r1:9333").Session()
// Second assignment with rebuild session — supersedes.
result := sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild},
})
newSess := sg.Sender("r1:9333").Session()
if oldSess.Active() {
t.Fatal("old session should be invalidated")
}
if !newSess.Active() {
t.Fatal("new session should be active")
}
if newSess.Kind != SessionRebuild {
t.Fatalf("new session kind=%s, want rebuild", newSess.Kind)
}
if len(result.SessionsSuperseded) != 1 || result.SessionsSuperseded[0] != "r1:9333" {
t.Fatalf("superseded=%v, want [r1:9333]", result.SessionsSuperseded)
}
}
func TestAssignment_FailsForUnknownReplica(t *testing.T) {
sg := NewSenderGroup()
result := sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r99:9333": SessionCatchUp},
})
if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r99:9333" {
t.Fatalf("sessions failed=%v, want [r99:9333]", result.SessionsFailed)
}
}
func TestAssignment_StaleEpoch_Rejected(t *testing.T) {
sg := NewSenderGroup()
// Epoch 2 assignment.
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 2,
})
// Stale epoch 1 assignment with recovery — must be rejected.
result := sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
})
if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r1:9333" {
t.Fatalf("stale epoch should fail: failed=%v created=%v", result.SessionsFailed, result.SessionsCreated)
}
if sg.Sender("r1:9333").Session() != nil {
t.Fatal("stale intent must not create a session")
}
}
// --- End-to-end prototype recovery flows ---
func TestE2E_CatchUpRecovery_FullFlow(t *testing.T) {
sg := NewSenderGroup()
// Step 1: Assignment creates replicas + recovery intent.
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
})
r1 := sg.Sender("r1:9333")
sess := r1.Session()
// Step 2: Execute recovery.
r1.BeginConnect(sess.ID)
outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
ReplicaFlushedLSN: 80,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if outcome != OutcomeCatchUp {
t.Fatalf("outcome=%s", outcome)
}
r1.BeginCatchUp(sess.ID)
r1.RecordCatchUpProgress(sess.ID, 90)
r1.RecordCatchUpProgress(sess.ID, 100) // converged
// Step 3: Complete.
if !r1.CompleteSessionByID(sess.ID) {
t.Fatal("completion should succeed")
}
// Step 4: Verify final state.
if r1.State != StateInSync {
t.Fatalf("r1 state=%s, want in_sync", r1.State)
}
if r1.Session() != nil {
t.Fatal("session should be nil after completion")
}
t.Logf("e2e catch-up: assignment → connect → handshake(catchup) → progress → complete → InSync")
}
func TestE2E_NeedsRebuild_Escalation(t *testing.T) {
sg := NewSenderGroup()
// Step 1: Assignment with catch-up intent.
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
})
r1 := sg.Sender("r1:9333")
sess := r1.Session()
// Step 2: Connect + handshake → unrecoverable gap.
r1.BeginConnect(sess.ID)
outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
ReplicaFlushedLSN: 10,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if outcome != OutcomeNeedsRebuild {
t.Fatalf("outcome=%s", outcome)
}
// Step 3: Sender is at NeedsRebuild, session dead.
if r1.State != StateNeedsRebuild {
t.Fatalf("state=%s", r1.State)
}
// Step 4: New assignment with rebuild intent.
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild},
})
rebuildSess := r1.Session()
if rebuildSess == nil || rebuildSess.Kind != SessionRebuild {
t.Fatal("should have rebuild session")
}
// Step 5: Execute rebuild recovery (simulated).
r1.BeginConnect(rebuildSess.ID)
r1.RecordHandshake(rebuildSess.ID, 0, 100) // full rebuild range
r1.BeginCatchUp(rebuildSess.ID)
r1.RecordCatchUpProgress(rebuildSess.ID, 100)
if !r1.CompleteSessionByID(rebuildSess.ID) {
t.Fatal("rebuild completion should succeed")
}
if r1.State != StateInSync {
t.Fatalf("after rebuild: state=%s, want in_sync", r1.State)
}
t.Logf("e2e rebuild: catch-up→NeedsRebuild→rebuild assignment→recover→InSync")
}
func TestE2E_ZeroGap_FastPath(t *testing.T) {
sg := NewSenderGroup()
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
})
r1 := sg.Sender("r1:9333")
sess := r1.Session()
r1.BeginConnect(sess.ID)
outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{
ReplicaFlushedLSN: 100,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
if outcome != OutcomeZeroGap {
t.Fatalf("outcome=%s", outcome)
}
// Fast path: complete directly from handshake.
if !r1.CompleteSessionByID(sess.ID) {
t.Fatal("zero-gap fast completion should succeed")
}
if r1.State != StateInSync {
t.Fatalf("state=%s, want in_sync", r1.State)
}
t.Logf("e2e zero-gap: assignment → connect → handshake(zero_gap) → complete → InSync")
}
func TestE2E_EpochBump_MidRecovery_FullCycle(t *testing.T) {
sg := NewSenderGroup()
// Epoch 1: start recovery.
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 1,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
})
r1 := sg.Sender("r1:9333")
sess1 := r1.Session()
r1.BeginConnect(sess1.ID)
// Epoch bumps mid-recovery.
sg.InvalidateEpoch(2)
// Must also update sender epoch for the new assignment.
r1.UpdateEpoch(2)
// Old session dead.
if sess1.Active() {
t.Fatal("epoch-1 session should be invalidated")
}
// Epoch 2: new assignment, new session.
sg.ApplyAssignment(AssignmentIntent{
Endpoints: map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
},
Epoch: 2,
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp},
})
sess2 := r1.Session()
if sess2 == nil || sess2.Epoch != 2 {
t.Fatal("should have new session at epoch 2")
}
// Complete at epoch 2.
r1.BeginConnect(sess2.ID)
r1.RecordHandshakeWithOutcome(sess2.ID, HandshakeResult{
ReplicaFlushedLSN: 100,
CommittedLSN: 100,
RetentionStartLSN: 50,
})
r1.CompleteSessionByID(sess2.ID)
if r1.State != StateInSync {
t.Fatalf("state=%s", r1.State)
}
t.Logf("e2e epoch bump: epoch1 recovery → bump → epoch2 recovery → InSync")
}

347
sw-block/prototype/enginev2/sender.go

@ -0,0 +1,347 @@
// Package enginev2 implements V2 per-replica sender/session ownership.
//
// Each replica has exactly one Sender that owns its identity (canonical address)
// and at most one active RecoverySession per epoch. The Sender survives topology
// changes; the session does not survive epoch bumps.
package enginev2
import (
"fmt"
"sync"
)
// ReplicaState tracks the per-replica replication state machine.
type ReplicaState string
const (
StateDisconnected ReplicaState = "disconnected"
StateConnecting ReplicaState = "connecting"
StateCatchingUp ReplicaState = "catching_up"
StateInSync ReplicaState = "in_sync"
StateDegraded ReplicaState = "degraded"
StateNeedsRebuild ReplicaState = "needs_rebuild"
)
// Endpoint represents a replica's network identity.
type Endpoint struct {
DataAddr string
CtrlAddr string
Version uint64 // bumped on address change
}
// Sender owns the replication channel to one replica. It is identified
// by ReplicaID (canonical data address at creation time) and survives
// topology changes as long as the replica stays in the set.
//
// A Sender holds at most one active RecoverySession. Normal in-sync
// operation does not require a session — Ship/Barrier work directly.
type Sender struct {
mu sync.Mutex
ReplicaID string // canonical identity — stable across reconnects
Endpoint Endpoint // current network address (may change via UpdateEndpoint)
Epoch uint64 // current epoch
State ReplicaState
session *RecoverySession // nil when in-sync or disconnected without recovery
stopped bool
}
// NewSender creates a sender for a replica at the given endpoint and epoch.
func NewSender(replicaID string, endpoint Endpoint, epoch uint64) *Sender {
return &Sender{
ReplicaID: replicaID,
Endpoint: endpoint,
Epoch: epoch,
State: StateDisconnected,
}
}
// UpdateEpoch updates the sender's epoch. If a recovery session is active
// at a stale epoch, it is invalidated.
func (s *Sender) UpdateEpoch(epoch uint64) {
s.mu.Lock()
defer s.mu.Unlock()
if s.stopped || epoch <= s.Epoch {
return
}
oldEpoch := s.Epoch
s.Epoch = epoch
if s.session != nil && s.session.Epoch < epoch {
s.session.invalidate(fmt.Sprintf("epoch_advanced_%d_to_%d", oldEpoch, epoch))
s.session = nil
s.State = StateDisconnected
}
}
// UpdateEndpoint updates the sender's target address after a control-plane
// assignment refresh. If a recovery session is active and the address changed,
// the session is invalidated (the new address needs a fresh session).
func (s *Sender) UpdateEndpoint(ep Endpoint) {
s.mu.Lock()
defer s.mu.Unlock()
if s.stopped {
return
}
addrChanged := s.Endpoint.DataAddr != ep.DataAddr || s.Endpoint.CtrlAddr != ep.CtrlAddr || s.Endpoint.Version != ep.Version
s.Endpoint = ep
if addrChanged && s.session != nil {
s.session.invalidate("endpoint_changed")
s.session = nil
s.State = StateDisconnected
}
}
// AttachSession creates and attaches a new recovery session for this sender.
// The session epoch must match the sender's current epoch — stale or future
// epoch sessions are rejected. Returns an error if a session is already active,
// the sender is stopped, or the epoch doesn't match.
func (s *Sender) AttachSession(epoch uint64, kind SessionKind) (*RecoverySession, error) {
s.mu.Lock()
defer s.mu.Unlock()
if s.stopped {
return nil, fmt.Errorf("sender stopped")
}
if epoch != s.Epoch {
return nil, fmt.Errorf("epoch mismatch: sender=%d session=%d", s.Epoch, epoch)
}
if s.session != nil && s.session.Active() {
return nil, fmt.Errorf("session already active (epoch=%d kind=%s)", s.session.Epoch, s.session.Kind)
}
sess := newRecoverySession(s.ReplicaID, epoch, kind)
s.session = sess
// Ownership established but execution not started.
// BeginConnect() is the first execution-state transition.
return sess, nil
}
// SupersedeSession invalidates the current session (if any) and attaches
// a new one at the sender's current epoch. Used when an assignment change
// requires a fresh recovery path. The old session is invalidated with the
// given reason. Always uses s.Epoch — does not accept an epoch parameter
// to prevent epoch coherence drift.
//
// Establishes ownership only — does not mutate sender state.
// BeginConnect() starts execution.
func (s *Sender) SupersedeSession(kind SessionKind, reason string) *RecoverySession {
s.mu.Lock()
defer s.mu.Unlock()
if s.stopped {
return nil
}
if s.session != nil {
s.session.invalidate(reason)
}
sess := newRecoverySession(s.ReplicaID, s.Epoch, kind)
s.session = sess
return sess
}
// Session returns the current recovery session, or nil if none.
func (s *Sender) Session() *RecoverySession {
s.mu.Lock()
defer s.mu.Unlock()
return s.session
}
// CompleteSessionByID marks the session as completed and transitions the
// sender to InSync. Requires:
// - sessionID matches the current active session
// - session is in PhaseCatchUp and has Converged (normal path)
// - OR session is in PhaseHandshake and gap is zero (fast path: already in sync)
//
// Returns false if any check fails (stale ID, wrong phase, not converged).
func (s *Sender) CompleteSessionByID(sessionID uint64) bool {
s.mu.Lock()
defer s.mu.Unlock()
if err := s.checkSessionAuthority(sessionID); err != nil {
return false
}
sess := s.session
switch sess.Phase {
case PhaseCatchUp:
if !sess.Converged() {
return false // not converged yet
}
case PhaseHandshake:
if sess.TargetLSN != sess.StartLSN {
return false // has a gap — must catch up first
}
// Zero-gap fast path: handshake showed replica already at target.
default:
return false // not at a completion-ready phase
}
sess.complete()
s.session = nil
s.State = StateInSync
return true
}
// === Execution APIs — sender-owned authority gate ===
//
// All execution APIs validate the sessionID against the current active session.
// This prevents stale results from old/superseded sessions from mutating state.
// The sender is the authority boundary, not the session object.
// BeginConnect transitions the session from init to connecting.
// Mutates: session.Phase → PhaseConnecting. Sender.State → StateConnecting.
// Rejects: wrong sessionID, stopped sender, session not in PhaseInit.
func (s *Sender) BeginConnect(sessionID uint64) error {
s.mu.Lock()
defer s.mu.Unlock()
if err := s.checkSessionAuthority(sessionID); err != nil {
return err
}
if !s.session.Advance(PhaseConnecting) {
return fmt.Errorf("cannot begin connect: session phase=%s", s.session.Phase)
}
s.State = StateConnecting
return nil
}
// RecordHandshake records a successful handshake result and sets the catch-up range.
// Mutates: session.Phase → PhaseHandshake, session.StartLSN/TargetLSN.
// Rejects: wrong sessionID, wrong phase, invalid range.
func (s *Sender) RecordHandshake(sessionID uint64, startLSN, targetLSN uint64) error {
s.mu.Lock()
defer s.mu.Unlock()
if err := s.checkSessionAuthority(sessionID); err != nil {
return err
}
if targetLSN < startLSN {
return fmt.Errorf("invalid handshake range: target=%d < start=%d", targetLSN, startLSN)
}
if !s.session.Advance(PhaseHandshake) {
return fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase)
}
s.session.SetRange(startLSN, targetLSN)
return nil
}
// RecordHandshakeWithOutcome records the handshake AND classifies the recovery
// outcome. This is the preferred handshake API — it determines the recovery
// path in one step:
// - OutcomeZeroGap: sets zero range, ready for fast completion
// - OutcomeCatchUp: sets catch-up range, ready for BeginCatchUp
// - OutcomeNeedsRebuild: invalidates session, transitions sender to NeedsRebuild
//
// Returns the outcome. On NeedsRebuild, the session is dead and the caller
// should not attempt further execution.
func (s *Sender) RecordHandshakeWithOutcome(sessionID uint64, result HandshakeResult) (RecoveryOutcome, error) {
outcome := ClassifyRecoveryOutcome(result)
s.mu.Lock()
defer s.mu.Unlock()
if err := s.checkSessionAuthority(sessionID); err != nil {
return outcome, err
}
// Must be in PhaseConnecting — require valid execution entry point.
if s.session.Phase != PhaseConnecting {
return outcome, fmt.Errorf("handshake requires PhaseConnecting, got %s", s.session.Phase)
}
if outcome == OutcomeNeedsRebuild {
s.session.invalidate("gap_exceeds_retention")
s.session = nil
s.State = StateNeedsRebuild
return outcome, nil
}
if !s.session.Advance(PhaseHandshake) {
return outcome, fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase)
}
switch outcome {
case OutcomeZeroGap:
s.session.SetRange(result.ReplicaFlushedLSN, result.ReplicaFlushedLSN)
case OutcomeCatchUp:
s.session.SetRange(result.ReplicaFlushedLSN, result.CommittedLSN)
}
return outcome, nil
}
// BeginCatchUp transitions the session from handshake to catch-up phase.
// Mutates: session.Phase → PhaseCatchUp. Sender.State → StateCatchingUp.
// Rejects: wrong sessionID, wrong phase.
func (s *Sender) BeginCatchUp(sessionID uint64) error {
s.mu.Lock()
defer s.mu.Unlock()
if err := s.checkSessionAuthority(sessionID); err != nil {
return err
}
if !s.session.Advance(PhaseCatchUp) {
return fmt.Errorf("cannot begin catch-up: session phase=%s", s.session.Phase)
}
s.State = StateCatchingUp
return nil
}
// RecordCatchUpProgress records catch-up progress (highest LSN recovered).
// Mutates: session.RecoveredTo (monotonic only).
// Rejects: wrong sessionID, wrong phase, progress regression, invalidated session.
func (s *Sender) RecordCatchUpProgress(sessionID uint64, recoveredTo uint64) error {
s.mu.Lock()
defer s.mu.Unlock()
if err := s.checkSessionAuthority(sessionID); err != nil {
return err
}
if s.session.Phase != PhaseCatchUp {
return fmt.Errorf("cannot record progress: session phase=%s, want catchup", s.session.Phase)
}
if recoveredTo <= s.session.RecoveredTo {
return fmt.Errorf("progress regression: current=%d proposed=%d", s.session.RecoveredTo, recoveredTo)
}
s.session.UpdateProgress(recoveredTo)
return nil
}
// checkSessionAuthority validates that the sender has an active session
// matching the given ID. Must be called with s.mu held.
func (s *Sender) checkSessionAuthority(sessionID uint64) error {
if s.stopped {
return fmt.Errorf("sender stopped")
}
if s.session == nil {
return fmt.Errorf("no active session")
}
if s.session.ID != sessionID {
return fmt.Errorf("session ID mismatch: active=%d requested=%d", s.session.ID, sessionID)
}
if !s.session.Active() {
return fmt.Errorf("session %d is no longer active (phase=%s)", sessionID, s.session.Phase)
}
return nil
}
// InvalidateSession invalidates the current session with a reason.
// Transitions the sender to the given target state.
func (s *Sender) InvalidateSession(reason string, targetState ReplicaState) {
s.mu.Lock()
defer s.mu.Unlock()
if s.session != nil {
s.session.invalidate(reason)
s.session = nil
}
s.State = targetState
}
// Stop shuts down the sender and any active session.
func (s *Sender) Stop() {
s.mu.Lock()
defer s.mu.Unlock()
if s.stopped {
return
}
s.stopped = true
if s.session != nil {
s.session.invalidate("sender_stopped")
s.session = nil
}
}
// Stopped returns true if the sender has been stopped.
func (s *Sender) Stopped() bool {
s.mu.Lock()
defer s.mu.Unlock()
return s.stopped
}

119
sw-block/prototype/enginev2/sender_group.go

@ -0,0 +1,119 @@
package enginev2
import (
"sort"
"sync"
)
// SenderGroup manages per-replica Senders with identity-preserving reconciliation.
// It is the V2 equivalent of ShipperGroup.
type SenderGroup struct {
mu sync.RWMutex
senders map[string]*Sender // keyed by ReplicaID
}
// NewSenderGroup creates an empty SenderGroup.
func NewSenderGroup() *SenderGroup {
return &SenderGroup{
senders: map[string]*Sender{},
}
}
// Reconcile diffs the current sender set against newEndpoints.
// Matching senders (same ReplicaID) are preserved with all state.
// Removed senders are stopped. New senders are created at the given epoch.
// Returns lists of added and removed ReplicaIDs.
func (sg *SenderGroup) Reconcile(newEndpoints map[string]Endpoint, epoch uint64) (added, removed []string) {
sg.mu.Lock()
defer sg.mu.Unlock()
// Stop and remove senders not in the new set.
for id, s := range sg.senders {
if _, keep := newEndpoints[id]; !keep {
s.Stop()
delete(sg.senders, id)
removed = append(removed, id)
}
}
// Add new senders; update endpoints and epoch for existing.
for id, ep := range newEndpoints {
if existing, ok := sg.senders[id]; ok {
existing.UpdateEndpoint(ep)
existing.UpdateEpoch(epoch)
} else {
sg.senders[id] = NewSender(id, ep, epoch)
added = append(added, id)
}
}
sort.Strings(added)
sort.Strings(removed)
return added, removed
}
// Sender returns the sender for a ReplicaID, or nil.
func (sg *SenderGroup) Sender(replicaID string) *Sender {
sg.mu.RLock()
defer sg.mu.RUnlock()
return sg.senders[replicaID]
}
// All returns all senders in deterministic order (sorted by ReplicaID).
func (sg *SenderGroup) All() []*Sender {
sg.mu.RLock()
defer sg.mu.RUnlock()
out := make([]*Sender, 0, len(sg.senders))
for _, s := range sg.senders {
out = append(out, s)
}
sort.Slice(out, func(i, j int) bool {
return out[i].ReplicaID < out[j].ReplicaID
})
return out
}
// Len returns the number of senders.
func (sg *SenderGroup) Len() int {
sg.mu.RLock()
defer sg.mu.RUnlock()
return len(sg.senders)
}
// StopAll stops all senders.
func (sg *SenderGroup) StopAll() {
sg.mu.Lock()
defer sg.mu.Unlock()
for _, s := range sg.senders {
s.Stop()
}
}
// InSyncCount returns the number of senders in StateInSync.
func (sg *SenderGroup) InSyncCount() int {
sg.mu.RLock()
defer sg.mu.RUnlock()
count := 0
for _, s := range sg.senders {
if s.State == StateInSync {
count++
}
}
return count
}
// InvalidateEpoch invalidates all active sessions that are bound to
// a stale epoch. Called after promotion/epoch bump.
func (sg *SenderGroup) InvalidateEpoch(currentEpoch uint64) int {
sg.mu.RLock()
defer sg.mu.RUnlock()
count := 0
for _, s := range sg.senders {
sess := s.Session()
if sess != nil && sess.Epoch < currentEpoch && sess.Active() {
s.InvalidateSession("epoch_bump", StateDisconnected)
count++
}
}
return count
}

203
sw-block/prototype/enginev2/sender_group_test.go

@ -0,0 +1,203 @@
package enginev2
import "testing"
// === SenderGroup reconciliation ===
func TestSenderGroup_Reconcile_AddNew(t *testing.T) {
sg := NewSenderGroup()
eps := map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
}
added, removed := sg.Reconcile(eps, 1)
if len(added) != 2 || len(removed) != 0 {
t.Fatalf("added=%v removed=%v", added, removed)
}
if sg.Len() != 2 {
t.Fatalf("len: got %d, want 2", sg.Len())
}
}
func TestSenderGroup_Reconcile_RemoveStale(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
}, 1)
// Remove r2, keep r1.
_, removed := sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
}, 1)
if len(removed) != 1 || removed[0] != "r2:9333" {
t.Fatalf("removed=%v, want [r2:9333]", removed)
}
if sg.Sender("r2:9333") != nil {
t.Fatal("r2 should be removed")
}
if sg.Sender("r1:9333") == nil {
t.Fatal("r1 should be preserved")
}
}
func TestSenderGroup_Reconcile_PreservesState(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
}, 1)
// Attach session and advance.
s := sg.Sender("r1:9333")
sess, _ := s.AttachSession(1, SessionCatchUp)
sess.SetRange(0, 100)
sess.UpdateProgress(50)
// Reconcile with same address — sender preserved.
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
}, 1)
s2 := sg.Sender("r1:9333")
if s2 != s {
t.Fatal("reconcile should preserve the same sender object")
}
if s2.Session() != sess {
t.Fatal("reconcile should preserve the session")
}
if !sess.Active() {
t.Fatal("session should still be active after same-address reconcile")
}
}
func TestSenderGroup_Reconcile_MixedUpdate(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
}, 1)
// Keep r1, remove r2, add r3.
added, removed := sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r3:9333": {DataAddr: "r3:9333", Version: 1},
}, 1)
if len(added) != 1 || added[0] != "r3:9333" {
t.Fatalf("added=%v, want [r3:9333]", added)
}
if len(removed) != 1 || removed[0] != "r2:9333" {
t.Fatalf("removed=%v, want [r2:9333]", removed)
}
if sg.Len() != 2 {
t.Fatalf("len=%d, want 2", sg.Len())
}
}
func TestSenderGroup_Reconcile_EndpointChange_InvalidatesSession(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
}, 1)
s := sg.Sender("r1:9333")
sess, _ := s.AttachSession(1, SessionCatchUp)
// Same ReplicaID but new endpoint version.
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 2},
}, 1)
if sess.Active() {
t.Fatal("endpoint version change should invalidate session")
}
if s.Session() != nil {
t.Fatal("session should be nil after endpoint change")
}
}
// === Epoch invalidation ===
func TestSenderGroup_InvalidateEpoch(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
}, 1)
// Both have sessions at epoch 1.
s1 := sg.Sender("r1:9333")
s2 := sg.Sender("r2:9333")
sess1, _ := s1.AttachSession(1, SessionCatchUp)
sess2, _ := s2.AttachSession(1, SessionCatchUp)
// Epoch bumps to 2. Both sessions stale.
count := sg.InvalidateEpoch(2)
if count != 2 {
t.Fatalf("should invalidate 2 sessions, got %d", count)
}
if sess1.Active() || sess2.Active() {
t.Fatal("both sessions should be invalidated")
}
if s1.State != StateDisconnected || s2.State != StateDisconnected {
t.Fatal("senders should be disconnected after epoch invalidation")
}
}
func TestSenderGroup_InvalidateEpoch_SkipsCurrentEpoch(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
}, 2)
s := sg.Sender("r1:9333")
sess, _ := s.AttachSession(2, SessionCatchUp) // epoch 2 session
// Invalidate epoch 2 — session AT epoch 2 should NOT be invalidated.
count := sg.InvalidateEpoch(2)
if count != 0 {
t.Fatalf("should not invalidate current-epoch session, got %d", count)
}
if !sess.Active() {
t.Fatal("current-epoch session should remain active")
}
}
func TestSenderGroup_StopAll(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
}, 1)
sg.StopAll()
for _, s := range sg.All() {
if !s.Stopped() {
t.Fatalf("%s should be stopped", s.ReplicaID)
}
}
}
func TestSenderGroup_All_DeterministicOrder(t *testing.T) {
sg := NewSenderGroup()
sg.Reconcile(map[string]Endpoint{
"r3:9333": {DataAddr: "r3:9333", Version: 1},
"r1:9333": {DataAddr: "r1:9333", Version: 1},
"r2:9333": {DataAddr: "r2:9333", Version: 1},
}, 1)
all := sg.All()
if len(all) != 3 {
t.Fatalf("len=%d, want 3", len(all))
}
expected := []string{"r1:9333", "r2:9333", "r3:9333"}
for i, exp := range expected {
if all[i].ReplicaID != exp {
t.Fatalf("all[%d]=%s, want %s", i, all[i].ReplicaID, exp)
}
}
}

407
sw-block/prototype/enginev2/sender_test.go

@ -0,0 +1,407 @@
package enginev2
import "testing"
// === Sender lifecycle ===
func TestSender_NewSender_Disconnected(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1)
if s.State != StateDisconnected {
t.Fatalf("new sender should be Disconnected, got %s", s.State)
}
if s.Session() != nil {
t.Fatal("new sender should have no session")
}
}
func TestSender_AttachSession_Success(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, err := s.AttachSession(1, SessionCatchUp)
if err != nil {
t.Fatal(err)
}
if sess.Kind != SessionCatchUp {
t.Fatalf("session kind: got %s, want catchup", sess.Kind)
}
if sess.Epoch != 1 {
t.Fatalf("session epoch: got %d, want 1", sess.Epoch)
}
if !sess.Active() {
t.Fatal("session should be active")
}
// AttachSession is ownership-only — sender stays Disconnected until BeginConnect.
if s.State != StateDisconnected {
t.Fatalf("sender state after attach: got %s, want disconnected (ownership-only)", s.State)
}
}
func TestSender_AttachSession_RejectsDouble(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
_, err := s.AttachSession(1, SessionCatchUp)
if err != nil {
t.Fatal(err)
}
_, err = s.AttachSession(1, SessionBootstrap)
if err == nil {
t.Fatal("should reject second attach while session active")
}
}
func TestSender_CompleteSession_TransitionsInSync(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
// Must execute full lifecycle before completing.
s.BeginConnect(sess.ID)
s.RecordHandshake(sess.ID, 5, 10)
s.BeginCatchUp(sess.ID)
s.RecordCatchUpProgress(sess.ID, 10) // converged
if !s.CompleteSessionByID(sess.ID) {
t.Fatal("completion should succeed when converged")
}
if s.State != StateInSync {
t.Fatalf("after complete: got %s, want in_sync", s.State)
}
if s.Session() != nil {
t.Fatal("session should be nil after complete")
}
if sess.Active() {
t.Fatal("completed session should not be active")
}
}
func TestSender_SupersedeSession(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
old, _ := s.AttachSession(1, SessionCatchUp)
s.UpdateEpoch(2) // epoch bumps — old session invalidated by UpdateEpoch
new := s.SupersedeSession(SessionReassign, "explicit_supersede")
if old.Active() {
t.Fatal("old session should be invalidated")
}
// Invalidated by UpdateEpoch, not by SupersedeSession (already dead).
if old.InvalidateReason == "" {
t.Fatal("old session should have invalidation reason")
}
if !new.Active() {
t.Fatal("new session should be active")
}
if new.Epoch != 2 {
t.Fatalf("new session epoch: got %d, want 2", new.Epoch)
}
}
func TestSender_UpdateEndpoint_InvalidatesSession(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2})
if sess.Active() {
t.Fatal("session should be invalidated after endpoint change")
}
if sess.InvalidateReason != "endpoint_changed" {
t.Fatalf("invalidation reason: got %q", sess.InvalidateReason)
}
if s.State != StateDisconnected {
t.Fatalf("sender should be disconnected after endpoint change, got %s", s.State)
}
if s.Session() != nil {
t.Fatal("session should be nil after endpoint change")
}
}
func TestSender_UpdateEndpoint_SameAddr_PreservesSession(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", Version: 1})
if !sess.Active() {
t.Fatal("same-address update should preserve session")
}
}
func TestSender_UpdateEndpoint_CtrlAddrOnly_InvalidatesSession(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9444", Version: 1})
if sess.Active() {
t.Fatal("CtrlAddr-only change should invalidate session")
}
if s.State != StateDisconnected {
t.Fatalf("sender should be disconnected, got %s", s.State)
}
}
func TestSender_Stop_InvalidatesSession(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.Stop()
if sess.Active() {
t.Fatal("session should be invalidated after stop")
}
if !s.Stopped() {
t.Fatal("sender should be stopped")
}
// Attach after stop fails.
_, err := s.AttachSession(1, SessionBootstrap)
if err == nil {
t.Fatal("attach after stop should fail")
}
}
func TestSender_InvalidateSession_TargetState(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.InvalidateSession("timeout", StateNeedsRebuild)
if sess.Active() {
t.Fatal("session should be invalidated")
}
if s.State != StateNeedsRebuild {
t.Fatalf("sender state: got %s, want needs_rebuild", s.State)
}
}
// === Session lifecycle ===
func TestSession_Advance_ValidTransitions(t *testing.T) {
sess := newRecoverySession("r1", 1, SessionCatchUp)
if !sess.Advance(PhaseConnecting) {
t.Fatal("init → connecting should succeed")
}
if !sess.Advance(PhaseHandshake) {
t.Fatal("connecting → handshake should succeed")
}
if !sess.Advance(PhaseCatchUp) {
t.Fatal("handshake → catchup should succeed")
}
if !sess.Advance(PhaseCompleted) {
t.Fatal("catchup → completed should succeed")
}
}
func TestSession_Advance_RejectsInvalidJump(t *testing.T) {
sess := newRecoverySession("r1", 1, SessionCatchUp)
// init → catchup is not valid (must go through connecting, handshake)
if sess.Advance(PhaseCatchUp) {
t.Fatal("init → catchup should be rejected")
}
// init → completed is not valid
if sess.Advance(PhaseCompleted) {
t.Fatal("init → completed should be rejected")
}
}
func TestSession_Advance_StopsOnInvalidate(t *testing.T) {
sess := newRecoverySession("r1", 1, SessionCatchUp)
sess.Advance(PhaseConnecting)
sess.Advance(PhaseHandshake)
sess.invalidate("test")
if sess.Advance(PhaseCatchUp) {
t.Fatal("advance after invalidate should fail")
}
}
func TestSender_AttachSession_RejectsEpochMismatch(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
_, err := s.AttachSession(2, SessionCatchUp)
if err == nil {
t.Fatal("should reject session at epoch 2 when sender is at epoch 1")
}
}
func TestSender_UpdateEpoch_InvalidatesStaleSession(t *testing.T) {
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
s.UpdateEpoch(2)
if sess.Active() {
t.Fatal("session at epoch 1 should be invalidated after UpdateEpoch(2)")
}
if s.Epoch != 2 {
t.Fatalf("sender epoch should be 2, got %d", s.Epoch)
}
if s.State != StateDisconnected {
t.Fatalf("sender should be disconnected after epoch bump, got %s", s.State)
}
// Can now attach at epoch 2.
sess2, err := s.AttachSession(2, SessionCatchUp)
if err != nil {
t.Fatalf("attach at new epoch should succeed: %v", err)
}
if sess2.Epoch != 2 {
t.Fatalf("new session epoch: got %d, want 2", sess2.Epoch)
}
}
func TestSession_Progress_StopsOnComplete(t *testing.T) {
sess := newRecoverySession("r1", 1, SessionCatchUp)
sess.SetRange(0, 100)
sess.UpdateProgress(50)
if sess.Converged() {
t.Fatal("should not converge at 50/100")
}
sess.complete()
if sess.UpdateProgress(100) {
t.Fatal("update after complete should return false")
}
}
func TestSession_Converged(t *testing.T) {
sess := newRecoverySession("r1", 1, SessionCatchUp)
sess.SetRange(0, 10)
sess.UpdateProgress(9)
if sess.Converged() {
t.Fatal("9 < 10: not converged")
}
sess.UpdateProgress(10)
if !sess.Converged() {
t.Fatal("10 >= 10: should be converged")
}
}
// === Bridge tests: ownership invariants matching distsim scenarios ===
func TestBridge_StaleCompletion_AfterSupersede_HasNoEffect(t *testing.T) {
// Matches distsim TestP04a_StaleCompletion_AfterSupersede_Rejected.
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
// First session.
sess1, _ := s.AttachSession(1, SessionCatchUp)
sess1.Advance(PhaseConnecting)
sess1.Advance(PhaseHandshake)
sess1.Advance(PhaseCatchUp)
// Supersede with new session.
s.UpdateEpoch(2)
sess2, _ := s.AttachSession(2, SessionCatchUp)
// Old session: advance/complete has no effect (already invalidated).
if sess1.Advance(PhaseCompleted) {
t.Fatal("stale session should not advance to completed")
}
if sess1.Active() {
t.Fatal("old session should be inactive")
}
// New session: still active and owns the sender.
if !sess2.Active() {
t.Fatal("new session should be active")
}
if s.Session() != sess2 {
t.Fatal("sender should own the new session")
}
// Stale completion by OLD session ID — REJECTED by identity check.
if s.CompleteSessionByID(sess1.ID) {
t.Fatal("stale completion with old session ID must be rejected")
}
// Sender must NOT have moved to InSync.
if s.State == StateInSync {
t.Fatal("sender must not be InSync after stale completion")
}
// New session must still be active.
if !sess2.Active() {
t.Fatal("new session must still be active after stale completion rejected")
}
// Correct completion by NEW session ID — requires full execution path.
s.BeginConnect(sess2.ID)
s.RecordHandshake(sess2.ID, 0, 10)
s.BeginCatchUp(sess2.ID)
s.RecordCatchUpProgress(sess2.ID, 10)
if !s.CompleteSessionByID(sess2.ID) {
t.Fatal("completion with correct session ID should succeed after convergence")
}
if s.State != StateInSync {
t.Fatalf("sender should be InSync after correct completion, got %s", s.State)
}
}
func TestBridge_EpochBump_RejectedCompletion(t *testing.T) {
// Matches distsim TestP04a_EpochBumpDuringCatchup_InvalidatesSession.
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
sess.Advance(PhaseConnecting)
// Epoch bumps — session invalidated.
s.UpdateEpoch(2)
// Attempting to advance the old session fails.
if sess.Advance(PhaseHandshake) {
t.Fatal("stale session should not advance after epoch bump")
}
// Attempting to attach at old epoch fails.
_, err := s.AttachSession(1, SessionCatchUp)
if err == nil {
t.Fatal("attach at stale epoch should fail")
}
// Attach at new epoch succeeds.
sess2, err := s.AttachSession(2, SessionCatchUp)
if err != nil {
t.Fatalf("attach at new epoch should succeed: %v", err)
}
if sess2.Epoch != 2 {
t.Fatalf("new session epoch=%d, want 2", sess2.Epoch)
}
}
func TestBridge_EndpointChange_InvalidatesAndAllowsNewSession(t *testing.T) {
// Matches distsim TestP04a_EndpointChangeDuringCatchup_InvalidatesSession.
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1)
sess, _ := s.AttachSession(1, SessionCatchUp)
// Endpoint changes.
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2})
// Old session dead.
if sess.Active() {
t.Fatal("session should be invalidated")
}
// New session can be attached (same epoch, new endpoint).
sess2, err := s.AttachSession(1, SessionCatchUp)
if err != nil {
t.Fatalf("new session after endpoint change: %v", err)
}
if !sess2.Active() {
t.Fatal("new session should be active")
}
}
func TestSession_DoubleInvalidate_Safe(t *testing.T) {
sess := newRecoverySession("r1", 1, SessionCatchUp)
sess.invalidate("first")
sess.invalidate("second") // should not panic or change reason
if sess.InvalidateReason != "first" {
t.Fatalf("reason should be first, got %q", sess.InvalidateReason)
}
}

151
sw-block/prototype/enginev2/session.go

@ -0,0 +1,151 @@
package enginev2
import (
"sync"
"sync/atomic"
)
// SessionKind identifies how the recovery session was created.
type SessionKind string
const (
SessionBootstrap SessionKind = "bootstrap" // fresh replica, no prior state
SessionCatchUp SessionKind = "catchup" // WAL gap recovery
SessionRebuild SessionKind = "rebuild" // full extent + WAL rebuild
SessionReassign SessionKind = "reassign" // address change recovery
)
// SessionPhase tracks progress within a recovery session.
type SessionPhase string
const (
PhaseInit SessionPhase = "init"
PhaseConnecting SessionPhase = "connecting"
PhaseHandshake SessionPhase = "handshake"
PhaseCatchUp SessionPhase = "catchup"
PhaseCompleted SessionPhase = "completed"
PhaseInvalidated SessionPhase = "invalidated"
)
// sessionIDCounter generates unique session IDs across all senders.
var sessionIDCounter atomic.Uint64
// RecoverySession represents one recovery attempt for a specific replica
// at a specific epoch. It is owned by a Sender and has exclusive authority
// to transition the replica through connecting → handshake → catchup → complete.
//
// Each session has a unique ID. Stale completions are rejected by ID, not
// by pointer comparison. This prevents old sessions from mutating state
// even if they retain a reference to the sender.
//
// Lifecycle rules:
// - At most one active session per Sender
// - Session is bound to an epoch; epoch bump invalidates it
// - Session is bound to an endpoint; address change invalidates it
// - Completed sessions release ownership back to the Sender
// - Invalidated sessions are dead and cannot be reused
type RecoverySession struct {
mu sync.Mutex
ID uint64 // unique, monotonic, never reused
ReplicaID string
Epoch uint64
Kind SessionKind
Phase SessionPhase
InvalidateReason string // non-empty when invalidated
// Progress tracking.
StartLSN uint64 // gap start (exclusive)
TargetLSN uint64 // gap end (inclusive)
RecoveredTo uint64 // highest LSN recovered so far
}
func newRecoverySession(replicaID string, epoch uint64, kind SessionKind) *RecoverySession {
return &RecoverySession{
ID: sessionIDCounter.Add(1),
ReplicaID: replicaID,
Epoch: epoch,
Kind: kind,
Phase: PhaseInit,
}
}
// Active returns true if the session has not been completed or invalidated.
func (rs *RecoverySession) Active() bool {
rs.mu.Lock()
defer rs.mu.Unlock()
return rs.Phase != PhaseCompleted && rs.Phase != PhaseInvalidated
}
// validTransitions defines the allowed phase transitions.
// Each phase maps to the set of phases it can transition to.
var validTransitions = map[SessionPhase]map[SessionPhase]bool{
PhaseInit: {PhaseConnecting: true, PhaseInvalidated: true},
PhaseConnecting: {PhaseHandshake: true, PhaseInvalidated: true},
PhaseHandshake: {PhaseCatchUp: true, PhaseCompleted: true, PhaseInvalidated: true},
PhaseCatchUp: {PhaseCompleted: true, PhaseInvalidated: true},
}
// Advance moves the session to the next phase. Returns false if the
// transition is not valid (wrong source phase, already terminal, or
// illegal jump). Enforces the lifecycle:
//
// init → connecting → handshake → catchup → completed
// ↘ invalidated (from any non-terminal)
func (rs *RecoverySession) Advance(phase SessionPhase) bool {
rs.mu.Lock()
defer rs.mu.Unlock()
if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated {
return false
}
allowed := validTransitions[rs.Phase]
if !allowed[phase] {
return false
}
rs.Phase = phase
return true
}
// UpdateProgress records catch-up progress. Returns false if stale.
func (rs *RecoverySession) UpdateProgress(recoveredTo uint64) bool {
rs.mu.Lock()
defer rs.mu.Unlock()
if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated {
return false
}
if recoveredTo > rs.RecoveredTo {
rs.RecoveredTo = recoveredTo
}
return true
}
// SetRange sets the recovery LSN range.
func (rs *RecoverySession) SetRange(start, target uint64) {
rs.mu.Lock()
defer rs.mu.Unlock()
rs.StartLSN = start
rs.TargetLSN = target
}
// Converged returns true if recovery has reached the target.
func (rs *RecoverySession) Converged() bool {
rs.mu.Lock()
defer rs.mu.Unlock()
return rs.TargetLSN > 0 && rs.RecoveredTo >= rs.TargetLSN
}
func (rs *RecoverySession) complete() {
rs.mu.Lock()
defer rs.mu.Unlock()
rs.Phase = PhaseCompleted
}
func (rs *RecoverySession) invalidate(reason string) {
rs.mu.Lock()
defer rs.mu.Unlock()
if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated {
return
}
rs.Phase = PhaseInvalidated
rs.InvalidateReason = reason
}

162
sw-block/prototype/fsmv2/apply.go

@ -0,0 +1,162 @@
package fsmv2
func (f *FSM) Apply(evt Event) ([]Action, error) {
switch evt.Kind {
case EventEpochChanged:
if evt.Epoch <= f.Epoch {
return nil, nil
}
f.Epoch = evt.Epoch
switch f.State {
case StateInSync:
f.State = StateLagging
f.clearCatchup()
f.clearRebuild()
return []Action{ActionRevokeSyncEligibility}, nil
case StateCatchingUp, StatePromotionHold, StateRebuilding, StateCatchUpAfterBuild:
f.State = StateLagging
f.clearCatchup()
f.clearRebuild()
return []Action{ActionAbortRecovery, ActionRevokeSyncEligibility}, nil
default:
f.clearCatchup()
f.clearRebuild()
return nil, nil
}
case EventFatal:
f.State = StateFailed
f.clearCatchup()
f.clearRebuild()
return []Action{ActionFailReplica, ActionRevokeSyncEligibility}, nil
}
switch f.State {
case StateBootstrapping:
switch evt.Kind {
case EventBootstrapComplete:
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
f.State = StateInSync
return []Action{ActionGrantSyncEligibility}, nil
case EventDisconnect:
f.State = StateLagging
return nil, nil
}
case StateInSync:
switch evt.Kind {
case EventDurableProgress:
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
return nil, invalid(f.State, evt.Kind)
}
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
return nil, nil
case EventDisconnect:
f.State = StateLagging
return []Action{ActionRevokeSyncEligibility}, nil
}
case StateLagging:
switch evt.Kind {
case EventReconnectCatchup:
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
f.CatchupStartLSN = evt.ReplicaFlushedLSN
f.CatchupTargetLSN = evt.TargetLSN
f.PromotionBarrierLSN = evt.TargetLSN
f.RecoveryReservationID = evt.ReservationID
f.ReservationExpiry = evt.ReservationTTL
f.State = StateCatchingUp
return []Action{ActionStartCatchup}, nil
case EventReconnectRebuild:
f.State = StateNeedsRebuild
return []Action{ActionRevokeSyncEligibility}, nil
}
case StateCatchingUp:
switch evt.Kind {
case EventCatchupProgress:
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
return nil, invalid(f.State, evt.Kind)
}
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN {
f.State = StatePromotionHold
f.PromotionHoldUntil = evt.PromotionHoldTill
return []Action{ActionEnterPromotionHold}, nil
}
return nil, nil
case EventRetentionLost, EventCatchupTimeout:
f.State = StateNeedsRebuild
f.clearCatchup()
return []Action{ActionAbortRecovery}, nil
case EventDisconnect:
f.State = StateLagging
f.clearCatchup()
return []Action{ActionAbortRecovery}, nil
}
case StatePromotionHold:
switch evt.Kind {
case EventDurableProgress:
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
return nil, invalid(f.State, evt.Kind)
}
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
return nil, nil
case EventPromotionHealthy:
if evt.Now < f.PromotionHoldUntil {
return nil, nil
}
f.State = StateInSync
f.clearCatchup()
return []Action{ActionGrantSyncEligibility}, nil
case EventDisconnect:
f.State = StateLagging
f.clearCatchup()
return []Action{ActionRevokeSyncEligibility}, nil
}
case StateNeedsRebuild:
switch evt.Kind {
case EventStartRebuild:
f.State = StateRebuilding
f.SnapshotID = evt.SnapshotID
f.SnapshotCpLSN = evt.SnapshotCpLSN
f.RecoveryReservationID = evt.ReservationID
f.ReservationExpiry = evt.ReservationTTL
return []Action{ActionStartRebuild}, nil
}
case StateRebuilding:
switch evt.Kind {
case EventRebuildBaseApplied:
f.State = StateCatchUpAfterBuild
f.ReplicaFlushedLSN = f.SnapshotCpLSN
f.CatchupStartLSN = f.SnapshotCpLSN
f.CatchupTargetLSN = evt.TargetLSN
f.PromotionBarrierLSN = evt.TargetLSN
return []Action{ActionStartCatchup}, nil
case EventRetentionLost, EventRebuildTooSlow, EventDisconnect:
f.State = StateNeedsRebuild
f.clearCatchup()
f.clearRebuild()
return []Action{ActionAbortRecovery}, nil
}
case StateCatchUpAfterBuild:
switch evt.Kind {
case EventCatchupProgress:
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN {
return nil, invalid(f.State, evt.Kind)
}
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN
if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN {
f.State = StatePromotionHold
f.PromotionHoldUntil = evt.PromotionHoldTill
return []Action{ActionEnterPromotionHold}, nil
}
return nil, nil
case EventRetentionLost, EventCatchupTimeout, EventDisconnect:
f.State = StateNeedsRebuild
f.clearCatchup()
f.clearRebuild()
return []Action{ActionAbortRecovery}, nil
}
case StateFailed:
return nil, nil
}
return nil, invalid(f.State, evt.Kind)
}

37
sw-block/prototype/fsmv2/events.go

@ -0,0 +1,37 @@
package fsmv2
type EventKind string
const (
EventBootstrapComplete EventKind = "BootstrapComplete"
EventDisconnect EventKind = "Disconnect"
EventReconnectCatchup EventKind = "ReconnectCatchup"
EventReconnectRebuild EventKind = "ReconnectRebuild"
EventDurableProgress EventKind = "DurableProgress"
EventCatchupProgress EventKind = "CatchupProgress"
EventPromotionHealthy EventKind = "PromotionHealthy"
EventStartRebuild EventKind = "StartRebuild"
EventRebuildBaseApplied EventKind = "RebuildBaseApplied"
EventRetentionLost EventKind = "RetentionLost"
EventCatchupTimeout EventKind = "CatchupTimeout"
EventRebuildTooSlow EventKind = "RebuildTooSlow"
EventEpochChanged EventKind = "EpochChanged"
EventFatal EventKind = "Fatal"
)
type Event struct {
Kind EventKind
Epoch uint64
Now uint64
ReplicaFlushedLSN uint64
TargetLSN uint64
PromotionHoldTill uint64
SnapshotID string
SnapshotCpLSN uint64
ReservationID string
ReservationTTL uint64
}

73
sw-block/prototype/fsmv2/fsm.go

@ -0,0 +1,73 @@
package fsmv2
import "fmt"
type State string
const (
StateBootstrapping State = "Bootstrapping"
StateInSync State = "InSync"
StateLagging State = "Lagging"
StateCatchingUp State = "CatchingUp"
StatePromotionHold State = "PromotionHold"
StateNeedsRebuild State = "NeedsRebuild"
StateRebuilding State = "Rebuilding"
StateCatchUpAfterBuild State = "CatchUpAfterRebuild"
StateFailed State = "Failed"
)
type Action string
const (
ActionNone Action = "None"
ActionGrantSyncEligibility Action = "GrantSyncEligibility"
ActionRevokeSyncEligibility Action = "RevokeSyncEligibility"
ActionStartCatchup Action = "StartCatchup"
ActionEnterPromotionHold Action = "EnterPromotionHold"
ActionStartRebuild Action = "StartRebuild"
ActionAbortRecovery Action = "AbortRecovery"
ActionFailReplica Action = "FailReplica"
)
type FSM struct {
State State
Epoch uint64
ReplicaFlushedLSN uint64
CatchupStartLSN uint64
CatchupTargetLSN uint64
PromotionBarrierLSN uint64
PromotionHoldUntil uint64
SnapshotID string
SnapshotCpLSN uint64
RecoveryReservationID string
ReservationExpiry uint64
}
func New(epoch uint64) *FSM {
return &FSM{State: StateBootstrapping, Epoch: epoch}
}
func (f *FSM) IsSyncEligible() bool {
return f.State == StateInSync
}
func (f *FSM) clearCatchup() {
f.CatchupStartLSN = 0
f.CatchupTargetLSN = 0
f.PromotionBarrierLSN = 0
f.PromotionHoldUntil = 0
f.RecoveryReservationID = ""
f.ReservationExpiry = 0
}
func (f *FSM) clearRebuild() {
f.SnapshotID = ""
f.SnapshotCpLSN = 0
}
func invalid(state State, kind EventKind) error {
return fmt.Errorf("fsmv2: invalid event %s in state %s", kind, state)
}

95
sw-block/prototype/fsmv2/fsm_test.go

@ -0,0 +1,95 @@
package fsmv2
import "testing"
func mustApply(t *testing.T, f *FSM, evt Event) []Action {
t.Helper()
actions, err := f.Apply(evt)
if err != nil {
t.Fatalf("apply %s: %v", evt.Kind, err)
}
return actions
}
func TestFSMBootstrapToInSync(t *testing.T) {
f := New(7)
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 10})
if f.State != StateInSync || !f.IsSyncEligible() || f.ReplicaFlushedLSN != 10 {
t.Fatalf("unexpected bootstrap result: state=%s eligible=%v lsn=%d", f.State, f.IsSyncEligible(), f.ReplicaFlushedLSN)
}
}
func TestFSMCatchupPromotionHoldFlow(t *testing.T) {
f := New(3)
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 5})
mustApply(t, f, Event{Kind: EventDisconnect})
mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 5, TargetLSN: 20, ReservationID: "r1", ReservationTTL: 100})
if f.State != StateCatchingUp {
t.Fatalf("expected catching up, got %s", f.State)
}
mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 20, PromotionHoldTill: 30})
if f.State != StatePromotionHold {
t.Fatalf("expected promotion hold, got %s", f.State)
}
mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 29})
if f.State != StatePromotionHold {
t.Fatalf("hold exited too early: %s", f.State)
}
mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 30})
if f.State != StateInSync || !f.IsSyncEligible() {
t.Fatalf("expected insync after hold, got %s eligible=%v", f.State, f.IsSyncEligible())
}
}
func TestFSMRebuildFlow(t *testing.T) {
f := New(11)
mustApply(t, f, Event{Kind: EventDisconnect})
mustApply(t, f, Event{Kind: EventReconnectRebuild})
if f.State != StateNeedsRebuild {
t.Fatalf("expected needs rebuild, got %s", f.State)
}
mustApply(t, f, Event{Kind: EventStartRebuild, SnapshotID: "snap-1", SnapshotCpLSN: 100, ReservationID: "rr", ReservationTTL: 200})
if f.State != StateRebuilding {
t.Fatalf("expected rebuilding, got %s", f.State)
}
mustApply(t, f, Event{Kind: EventRebuildBaseApplied, TargetLSN: 140})
if f.State != StateCatchUpAfterBuild || f.ReplicaFlushedLSN != 100 {
t.Fatalf("unexpected rebuild-base state=%s lsn=%d", f.State, f.ReplicaFlushedLSN)
}
mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 140, PromotionHoldTill: 150})
mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 150})
if f.State != StateInSync || f.SnapshotID != "snap-1" {
t.Fatalf("expected insync after rebuild, got state=%s snapshot=%q", f.State, f.SnapshotID)
}
}
func TestFSMEpochChangeAbortsRecovery(t *testing.T) {
f := New(1)
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 1})
mustApply(t, f, Event{Kind: EventDisconnect})
mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "r1", ReservationTTL: 99})
mustApply(t, f, Event{Kind: EventEpochChanged, Epoch: 2})
if f.State != StateLagging || f.RecoveryReservationID != "" || f.IsSyncEligible() {
t.Fatalf("unexpected state after epoch change: state=%s reservation=%q eligible=%v", f.State, f.RecoveryReservationID, f.IsSyncEligible())
}
}
func TestFSMReservationLostNeedsRebuild(t *testing.T) {
f := New(5)
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 9})
mustApply(t, f, Event{Kind: EventDisconnect})
mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 9, TargetLSN: 15, ReservationID: "r2", ReservationTTL: 80})
mustApply(t, f, Event{Kind: EventRetentionLost})
if f.State != StateNeedsRebuild {
t.Fatalf("expected needs rebuild after reservation lost, got %s", f.State)
}
}
func TestFSMDurableProgressWhileInSync(t *testing.T) {
f := New(2)
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 4})
mustApply(t, f, Event{Kind: EventDurableProgress, ReplicaFlushedLSN: 8})
if f.ReplicaFlushedLSN != 8 || f.State != StateInSync {
t.Fatalf("unexpected in-sync durable progress: state=%s lsn=%d", f.State, f.ReplicaFlushedLSN)
}
}

BIN
sw-block/prototype/fsmv2/fsmv2.test.exe

37
sw-block/prototype/run-tests.ps1

@ -0,0 +1,37 @@
param(
[string[]]$Packages = @(
'./sw-block/prototype/fsmv2',
'./sw-block/prototype/volumefsm',
'./sw-block/prototype/distsim'
)
)
$ErrorActionPreference = 'Stop'
$root = Split-Path -Parent (Split-Path -Parent $PSScriptRoot)
Set-Location $root
$cacheDir = Join-Path $root '.gocache_v2'
$tmpDir = Join-Path $root '.gotmp_v2'
New-Item -ItemType Directory -Force -Path $cacheDir,$tmpDir | Out-Null
$env:GOCACHE = $cacheDir
$env:GOTMPDIR = $tmpDir
foreach ($pkg in $Packages) {
$name = Split-Path $pkg -Leaf
$out = Join-Path $root ("sw-block\\prototype\\{0}\\{0}.test.exe" -f $name)
Write-Host "==> building $pkg"
go test -c -o $out $pkg
if (!(Test-Path $out)) {
throw "go test -c build failed for $pkg"
}
if ($LASTEXITCODE -ne 0) {
Write-Warning "go test -c reported a non-zero exit code for $pkg, but the test binary was produced. Continuing."
}
Write-Host "==> running $out"
cmd /c "cd /d $root && $out -test.v -test.count=1"
if ($LASTEXITCODE -ne 0) {
throw "test binary failed for $pkg"
}
}
Write-Host "Done."

148
sw-block/prototype/volumefsm/events.go

@ -0,0 +1,148 @@
package volumefsm
import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
type EventKind string
const (
EventWriteCommitted EventKind = "WriteCommitted"
EventCheckpointAdvanced EventKind = "CheckpointAdvanced"
EventBarrierCompleted EventKind = "BarrierCompleted"
EventBootstrapReplica EventKind = "BootstrapReplica"
EventReplicaDisconnect EventKind = "ReplicaDisconnect"
EventReplicaReconnect EventKind = "ReplicaReconnect"
EventReplicaNeedsRebuild EventKind = "ReplicaNeedsRebuild"
EventReplicaCatchupProgress EventKind = "ReplicaCatchupProgress"
EventReplicaPromotionHealthy EventKind = "ReplicaPromotionHealthy"
EventReplicaStartRebuild EventKind = "ReplicaStartRebuild"
EventReplicaRebuildBaseApplied EventKind = "ReplicaRebuildBaseApplied"
EventReplicaReservationLost EventKind = "ReplicaReservationLost"
EventReplicaCatchupTimeout EventKind = "ReplicaCatchupTimeout"
EventReplicaRebuildTooSlow EventKind = "ReplicaRebuildTooSlow"
EventPrimaryLeaseLost EventKind = "PrimaryLeaseLost"
EventPromoteReplica EventKind = "PromoteReplica"
)
type Event struct {
Kind EventKind
ReplicaID string
LSN uint64
CheckpointLSN uint64
ReplicaFlushedLSN uint64
TargetLSN uint64
Now uint64
HoldUntil uint64
SnapshotID string
SnapshotCpLSN uint64
ReservationID string
ReservationTTL uint64
}
func (m *Model) Apply(evt Event) error {
switch evt.Kind {
case EventWriteCommitted:
if evt.LSN > m.HeadLSN {
m.HeadLSN = evt.LSN
} else {
m.HeadLSN++
}
return nil
case EventCheckpointAdvanced:
if evt.CheckpointLSN > m.CheckpointLSN {
m.CheckpointLSN = evt.CheckpointLSN
}
return nil
case EventPrimaryLeaseLost:
m.PrimaryState = PrimaryLost
m.Epoch++
for _, r := range m.Replicas {
_, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch})
if err != nil {
return err
}
}
return nil
case EventPromoteReplica:
m.PrimaryID = evt.ReplicaID
m.PrimaryState = PrimaryServing
m.Epoch++
for _, r := range m.Replicas {
_, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch})
if err != nil {
return err
}
}
return nil
}
r := m.Replica(evt.ReplicaID)
if r == nil {
return nil
}
var fEvt fsmv2.Event
switch evt.Kind {
case EventBarrierCompleted:
fEvt = fsmv2.Event{Kind: fsmv2.EventDurableProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN}
case EventBootstrapReplica:
fEvt = fsmv2.Event{Kind: fsmv2.EventBootstrapComplete, ReplicaFlushedLSN: evt.ReplicaFlushedLSN}
case EventReplicaDisconnect:
fEvt = fsmv2.Event{Kind: fsmv2.EventDisconnect}
case EventReplicaReconnect:
if evt.ReservationID != "" {
fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectCatchup, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, TargetLSN: evt.TargetLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL}
} else {
fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild}
}
case EventReplicaNeedsRebuild:
fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild}
case EventReplicaCatchupProgress:
fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, PromotionHoldTill: evt.HoldUntil}
case EventReplicaPromotionHealthy:
fEvt = fsmv2.Event{Kind: fsmv2.EventPromotionHealthy, Now: evt.Now}
case EventReplicaStartRebuild:
fEvt = fsmv2.Event{Kind: fsmv2.EventStartRebuild, SnapshotID: evt.SnapshotID, SnapshotCpLSN: evt.SnapshotCpLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL}
case EventReplicaRebuildBaseApplied:
fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildBaseApplied, TargetLSN: evt.TargetLSN}
case EventReplicaReservationLost:
fEvt = fsmv2.Event{Kind: fsmv2.EventRetentionLost}
case EventReplicaCatchupTimeout:
fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupTimeout}
case EventReplicaRebuildTooSlow:
fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildTooSlow}
default:
return nil
}
_, err := r.FSM.Apply(fEvt)
return err
}
func (m *Model) EvaluateReconnect(replicaID string, flushedLSN, targetLSN uint64) (RecoveryDecision, error) {
decision := m.Planner.PlanReconnect(replicaID, flushedLSN, targetLSN)
r := m.Replica(replicaID)
if r == nil {
return decision, nil
}
switch decision.Disposition {
case RecoveryCatchup:
err := m.Apply(Event{
Kind: EventReplicaReconnect,
ReplicaID: replicaID,
ReplicaFlushedLSN: flushedLSN,
TargetLSN: targetLSN,
ReservationID: decision.ReservationID,
ReservationTTL: decision.ReservationTTL,
})
return decision, err
default:
if r.FSM.State == fsmv2.StateNeedsRebuild {
return decision, nil
}
err := m.Apply(Event{
Kind: EventReplicaNeedsRebuild,
ReplicaID: replicaID,
})
return decision, err
}
}

38
sw-block/prototype/volumefsm/format.go

@ -0,0 +1,38 @@
package volumefsm
import (
"fmt"
"sort"
"strings"
)
func FormatSnapshot(s Snapshot) string {
ids := make([]string, 0, len(s.Replicas))
for id := range s.Replicas {
ids = append(ids, id)
}
sort.Strings(ids)
parts := []string{
fmt.Sprintf("step=%s", s.Step),
fmt.Sprintf("epoch=%d", s.Epoch),
fmt.Sprintf("primary=%s/%s", s.PrimaryID, s.PrimaryState),
fmt.Sprintf("head=%d", s.HeadLSN),
fmt.Sprintf("write=%t:%s", s.WriteGate.Allowed, s.WriteGate.Reason),
fmt.Sprintf("ack=%t:%s", s.AckGate.Allowed, s.AckGate.Reason),
}
for _, id := range ids {
r := s.Replicas[id]
parts = append(parts, fmt.Sprintf("%s=%s@%d", id, r.State, r.FlushedLSN))
}
return strings.Join(parts, " ")
}
func FormatTrace(trace []Snapshot) string {
lines := make([]string, 0, len(trace))
for _, s := range trace {
lines = append(lines, FormatSnapshot(s))
}
return strings.Join(lines, "\n")
}

142
sw-block/prototype/volumefsm/model.go

@ -0,0 +1,142 @@
package volumefsm
import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
type Mode string
const (
ModeBestEffort Mode = "best_effort"
ModeSyncAll Mode = "sync_all"
ModeSyncQuorum Mode = "sync_quorum"
)
type PrimaryState string
const (
PrimaryServing PrimaryState = "serving"
PrimaryDraining PrimaryState = "draining"
PrimaryLost PrimaryState = "lost"
)
type Replica struct {
ID string
FSM *fsmv2.FSM
}
type Model struct {
Epoch uint64
PrimaryID string
PrimaryState PrimaryState
Mode Mode
HeadLSN uint64
CheckpointLSN uint64
RequiredReplicaIDs []string
Replicas map[string]*Replica
Planner RecoveryPlanner
}
func New(primaryID string, mode Mode, epoch uint64, replicaIDs ...string) *Model {
m := &Model{
Epoch: epoch,
PrimaryID: primaryID,
PrimaryState: PrimaryServing,
Mode: mode,
Replicas: make(map[string]*Replica, len(replicaIDs)),
Planner: StaticRecoveryPlanner{},
}
for _, id := range replicaIDs {
m.Replicas[id] = &Replica{ID: id, FSM: fsmv2.New(epoch)}
m.RequiredReplicaIDs = append(m.RequiredReplicaIDs, id)
}
return m
}
func (m *Model) Replica(id string) *Replica {
return m.Replicas[id]
}
func (m *Model) SyncEligibleCount() int {
count := 0
for _, id := range m.RequiredReplicaIDs {
r := m.Replicas[id]
if r != nil && r.FSM.IsSyncEligible() {
count++
}
}
return count
}
func (m *Model) DurableReplicaCount(targetLSN uint64) int {
count := 0
for _, id := range m.RequiredReplicaIDs {
r := m.Replicas[id]
if r != nil && r.FSM.IsSyncEligible() && r.FSM.ReplicaFlushedLSN >= targetLSN {
count++
}
}
return count
}
func (m *Model) Quorum() int {
rf := len(m.RequiredReplicaIDs) + 1
return rf/2 + 1
}
func (m *Model) CanServeWrite() bool {
return m.WriteAdmission().Allowed
}
type AdmissionDecision struct {
Allowed bool
Reason string
}
func (m *Model) WriteAdmission() AdmissionDecision {
if m.PrimaryState != PrimaryServing {
return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"}
}
switch m.Mode {
case ModeBestEffort:
return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"}
case ModeSyncAll:
if m.SyncEligibleCount() == len(m.RequiredReplicaIDs) {
return AdmissionDecision{Allowed: true, Reason: "all_replicas_sync_eligible"}
}
return AdmissionDecision{Allowed: false, Reason: "required_replica_not_in_sync"}
case ModeSyncQuorum:
if 1+m.SyncEligibleCount() >= m.Quorum() {
return AdmissionDecision{Allowed: true, Reason: "quorum_sync_eligible"}
}
return AdmissionDecision{Allowed: false, Reason: "quorum_not_available"}
default:
return AdmissionDecision{Allowed: false, Reason: "unknown_mode"}
}
}
func (m *Model) CanAcknowledgeLSN(targetLSN uint64) bool {
return m.AckAdmission(targetLSN).Allowed
}
func (m *Model) AckAdmission(targetLSN uint64) AdmissionDecision {
if m.PrimaryState != PrimaryServing {
return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"}
}
switch m.Mode {
case ModeBestEffort:
return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"}
case ModeSyncAll:
if m.DurableReplicaCount(targetLSN) == len(m.RequiredReplicaIDs) {
return AdmissionDecision{Allowed: true, Reason: "all_replicas_durable"}
}
return AdmissionDecision{Allowed: false, Reason: "required_replica_not_durable"}
case ModeSyncQuorum:
if 1+m.DurableReplicaCount(targetLSN) >= m.Quorum() {
return AdmissionDecision{Allowed: true, Reason: "quorum_durable"}
}
return AdmissionDecision{Allowed: false, Reason: "durable_quorum_not_available"}
default:
return AdmissionDecision{Allowed: false, Reason: "unknown_mode"}
}
}

421
sw-block/prototype/volumefsm/model_test.go

@ -0,0 +1,421 @@
package volumefsm
import (
"strings"
"testing"
fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
)
type scriptedPlanner struct {
decision RecoveryDecision
}
func (s scriptedPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision {
return s.decision
}
func mustApply(t *testing.T, m *Model, evt Event) {
t.Helper()
if err := m.Apply(evt); err != nil {
t.Fatalf("apply %s: %v", evt.Kind, err)
}
}
func TestModelSyncAllBlocksOnLaggingReplica(t *testing.T) {
m := New("p1", ModeSyncAll, 1, "r1", "r2")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1})
if !m.CanServeWrite() {
t.Fatal("sync_all should serve when all replicas are in sync")
}
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
if m.CanServeWrite() {
t.Fatal("sync_all should block when one required replica lags")
}
}
func TestModelSyncQuorumSurvivesOneLaggingReplica(t *testing.T) {
m := New("p1", ModeSyncQuorum, 1, "r1", "r2")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
if !m.CanServeWrite() {
t.Fatal("sync_quorum should still serve with primary + one in-sync replica")
}
}
func TestModelCatchupFlowRestoresEligibility(t *testing.T) {
m := New("p1", ModeSyncAll, 1, "r1")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10})
mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 10, ReservationID: "res-1", ReservationTTL: 100})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp {
t.Fatalf("expected catching up, got %s", got)
}
mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 10, HoldUntil: 20})
mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 20})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync {
t.Fatalf("expected in sync, got %s", got)
}
if !m.CanServeWrite() {
t.Fatal("sync_all should serve after replica returns to in-sync")
}
}
func TestModelLongGapRebuildFlow(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap1", SnapshotCpLSN: 100, ReservationID: "rebuild-1", ReservationTTL: 200})
mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 130})
mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 130, HoldUntil: 150})
mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 150})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync {
t.Fatalf("expected in sync after rebuild, got %s", got)
}
}
func TestModelPrimaryLeaseLostFencesRecovery(t *testing.T) {
m := New("p1", ModeSyncAll, 1, "r1")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "res-2", ReservationTTL: 100})
mustApply(t, m, Event{Kind: EventPrimaryLeaseLost})
if m.PrimaryState != PrimaryLost {
t.Fatalf("expected lost primary, got %s", m.PrimaryState)
}
if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging {
t.Fatalf("expected lagging after fencing, got %s", got)
}
}
func TestModelPromoteReplicaChangesEpoch(t *testing.T) {
m := New("p1", ModeSyncQuorum, 1, "r1", "r2")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 10})
oldEpoch := m.Epoch
mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"})
if m.PrimaryID != "r1" {
t.Fatalf("expected promoted primary r1, got %s", m.PrimaryID)
}
if m.Epoch != oldEpoch+1 {
t.Fatalf("expected epoch increment, got %d want %d", m.Epoch, oldEpoch+1)
}
}
func TestModelSyncQuorumWithThreeReplicasMixedStates(t *testing.T) {
m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 1})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
if !m.CanServeWrite() {
t.Fatal("sync_quorum should serve with primary + two in-sync replicas out of RF=4")
}
}
func TestModelFailoverFencesMixedReplicaStates(t *testing.T) {
m := New("p1", ModeSyncQuorum, 10, "r1", "r2", "r3")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 8})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 8})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 6})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"})
mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r2", ReplicaFlushedLSN: 8, TargetLSN: 12, ReservationID: "catch-r2", ReservationTTL: 100})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r3"})
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r3"})
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r3", SnapshotID: "snap-x", SnapshotCpLSN: 6, ReservationID: "rebuild-r3", ReservationTTL: 200})
mustApply(t, m, Event{Kind: EventPrimaryLeaseLost})
mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"})
if m.PrimaryID != "r1" {
t.Fatalf("expected r1 promoted, got %s", m.PrimaryID)
}
if got := m.Replica("r2").FSM.State; got != fsmv2.StateLagging {
t.Fatalf("expected r2 fenced back to lagging, got %s", got)
}
if got := m.Replica("r3").FSM.State; got != fsmv2.StateLagging {
t.Fatalf("expected r3 fenced back to lagging, got %s", got)
}
}
func TestModelRebuildInterruptedByEpochChange(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-2", SnapshotCpLSN: 100, ReservationID: "rebuild-2", ReservationTTL: 200})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateRebuilding {
t.Fatalf("expected rebuilding, got %s", got)
}
mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging {
t.Fatalf("expected lagging after epoch change fencing, got %s", got)
}
}
func TestModelReservationLostDuringCatchupAfterRebuild(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"})
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-3", SnapshotCpLSN: 50, ReservationID: "rebuild-3", ReservationTTL: 200})
mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 80})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchUpAfterBuild {
t.Fatalf("expected catch-up-after-rebuild, got %s", got)
}
mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild {
t.Fatalf("expected needs rebuild after reservation loss, got %s", got)
}
}
func TestModelSyncAllBarrierAcknowledgeTargetLSN(t *testing.T) {
m := New("p1", ModeSyncAll, 1, "r1", "r2")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5})
mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10})
if m.CanAcknowledgeLSN(10) {
t.Fatal("sync_all should not acknowledge target LSN before barriers advance replica durability")
}
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10})
if m.CanAcknowledgeLSN(10) {
t.Fatal("sync_all should still wait for second replica durability")
}
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 10})
if !m.CanAcknowledgeLSN(10) {
t.Fatal("sync_all should acknowledge once all required replicas are durable at target LSN")
}
}
func TestModelSyncQuorumBarrierAcknowledgeTargetLSN(t *testing.T) {
m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5})
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 5})
mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 9})
if m.CanAcknowledgeLSN(9) {
t.Fatal("sync_quorum should not acknowledge before any replica reaches target durability")
}
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 9})
if m.CanAcknowledgeLSN(9) {
t.Fatal("sync_quorum should still wait because RF=4 quorum needs primary + two durable replicas")
}
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 9})
if !m.CanAcknowledgeLSN(9) {
t.Fatal("sync_quorum should acknowledge with primary + two durable replicas in RF=4")
}
}
func TestModelWriteAdmissionReasons(t *testing.T) {
m := New("p1", ModeSyncAll, 1, "r1")
dec := m.WriteAdmission()
if dec.Allowed || dec.Reason != "required_replica_not_in_sync" {
t.Fatalf("unexpected admission before bootstrap: %+v", dec)
}
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1})
dec = m.WriteAdmission()
if !dec.Allowed || dec.Reason != "all_replicas_sync_eligible" {
t.Fatalf("unexpected admission after bootstrap: %+v", dec)
}
}
func TestModelEvaluateReconnectUsesPlanner(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
decision, err := m.EvaluateReconnect("r1", 2, 8)
if err != nil {
t.Fatalf("evaluate reconnect: %v", err)
}
if decision.Disposition != RecoveryCatchup {
t.Fatalf("expected catchup decision, got %+v", decision)
}
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp {
t.Fatalf("expected catching up, got %s", got)
}
}
func TestModelEvaluateReconnectNeedsRebuildFromPlanner(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
m.Planner = scriptedPlanner{decision: RecoveryDecision{
Disposition: RecoveryNeedsRebuild,
Reason: "payload_not_resolvable",
}}
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
decision, err := m.EvaluateReconnect("r1", 2, 8)
if err != nil {
t.Fatalf("evaluate reconnect: %v", err)
}
if decision.Disposition != RecoveryNeedsRebuild || decision.Reason != "payload_not_resolvable" {
t.Fatalf("unexpected decision: %+v", decision)
}
if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild {
t.Fatalf("expected needs rebuild, got %s", got)
}
}
func TestModelEvaluateReconnectCarriesRecoveryClasses(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
m.Planner = scriptedPlanner{decision: RecoveryDecision{
Disposition: RecoveryCatchup,
ReservationID: "extent-resv",
ReservationTTL: 42,
Reason: "extent_payload_resolvable",
Classes: []RecoveryClass{RecoveryClassExtentReferenced},
}}
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 3})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
decision, err := m.EvaluateReconnect("r1", 3, 9)
if err != nil {
t.Fatalf("evaluate reconnect: %v", err)
}
if len(decision.Classes) != 1 || decision.Classes[0] != RecoveryClassExtentReferenced {
t.Fatalf("unexpected recovery classes: %+v", decision.Classes)
}
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp {
t.Fatalf("expected catching up, got %s", got)
}
if got := m.Replica("r1").FSM.RecoveryReservationID; got != "extent-resv" {
t.Fatalf("expected reservation extent-resv, got %q", got)
}
}
func TestModelEvaluateReconnectCanChangeOverTime(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
m.Planner = scriptedPlanner{decision: RecoveryDecision{
Disposition: RecoveryCatchup,
ReservationID: "resv-1",
ReservationTTL: 10,
Reason: "temporarily_recoverable",
Classes: []RecoveryClass{RecoveryClassWALInline},
}}
decision, err := m.EvaluateReconnect("r1", 4, 12)
if err != nil {
t.Fatalf("first evaluate reconnect: %v", err)
}
if decision.Disposition != RecoveryCatchup {
t.Fatalf("expected catchup on first evaluation, got %+v", decision)
}
mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"})
if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild {
t.Fatalf("expected needs rebuild after reservation loss, got %s", got)
}
m.Planner = scriptedPlanner{decision: RecoveryDecision{
Disposition: RecoveryNeedsRebuild,
Reason: "recoverability_expired",
}}
decision, err = m.EvaluateReconnect("r1", 4, 12)
if err != nil {
t.Fatalf("second evaluate reconnect: %v", err)
}
if decision.Reason != "recoverability_expired" {
t.Fatalf("unexpected second decision: %+v", decision)
}
}
func TestRunScenarioProducesStateTrace(t *testing.T) {
m := New("p1", ModeSyncAll, 1, "r1")
trace, err := RunScenario(m, []ScenarioStep{
{Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}},
{Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}},
{Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}},
})
if err != nil {
t.Fatalf("run scenario: %v", err)
}
if len(trace) != 4 {
t.Fatalf("expected 4 snapshots, got %d", len(trace))
}
last := trace[len(trace)-1]
if last.HeadLSN != 10 {
t.Fatalf("expected head 10, got %d", last.HeadLSN)
}
if got := last.Replicas["r1"].FlushedLSN; got != 10 {
t.Fatalf("expected replica flushed 10, got %d", got)
}
if !last.AckGate.Allowed {
t.Fatalf("expected ack gate allowed at final step, got %+v", last.AckGate)
}
}
func TestScriptedRecoveryPlannerChangesDecisionOverTime(t *testing.T) {
m := New("p1", ModeBestEffort, 1, "r1")
m.Planner = &ScriptedRecoveryPlanner{
Decisions: []RecoveryDecision{
{
Disposition: RecoveryCatchup,
ReservationID: "resv-a",
ReservationTTL: 10,
Reason: "recoverable_now",
Classes: []RecoveryClass{RecoveryClassWALInline},
},
{
Disposition: RecoveryNeedsRebuild,
Reason: "recoverability_expired",
},
},
}
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4})
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"})
first, err := m.EvaluateReconnect("r1", 4, 12)
if err != nil {
t.Fatalf("first reconnect: %v", err)
}
if first.Disposition != RecoveryCatchup {
t.Fatalf("unexpected first decision: %+v", first)
}
mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"})
second, err := m.EvaluateReconnect("r1", 4, 12)
if err != nil {
t.Fatalf("second reconnect: %v", err)
}
if second.Disposition != RecoveryNeedsRebuild || second.Reason != "recoverability_expired" {
t.Fatalf("unexpected second decision: %+v", second)
}
}
func TestFormatTraceIncludesReplicaStatesAndGates(t *testing.T) {
m := New("p1", ModeSyncAll, 1, "r1")
trace, err := RunScenario(m, []ScenarioStep{
{Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}},
{Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}},
{Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}},
})
if err != nil {
t.Fatalf("run scenario: %v", err)
}
got := FormatTrace(trace)
wantParts := []string{
"step=bootstrap",
"write=true:all_replicas_sync_eligible",
"step=barrier10",
"ack=true:all_replicas_durable",
"r1=InSync@10",
}
for _, part := range wantParts {
if !strings.Contains(got, part) {
t.Fatalf("trace missing %q:\n%s", part, got)
}
}
}

70
sw-block/prototype/volumefsm/recovery.go

@ -0,0 +1,70 @@
package volumefsm
type RecoveryClass string
const (
RecoveryClassWALInline RecoveryClass = "wal_inline"
RecoveryClassExtentReferenced RecoveryClass = "extent_referenced"
)
type RecoveryDisposition string
const (
RecoveryCatchup RecoveryDisposition = "catchup"
RecoveryNeedsRebuild RecoveryDisposition = "needs_rebuild"
)
type RecoveryDecision struct {
Disposition RecoveryDisposition
ReservationID string
ReservationTTL uint64
Reason string
Classes []RecoveryClass
}
type RecoveryPlanner interface {
PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision
}
// StaticRecoveryPlanner is the minimal default planner for the prototype.
// If targetLSN >= flushedLSN and the caller provided a real target, reconnect
// is treated as catch-up; otherwise rebuild is required.
type StaticRecoveryPlanner struct{}
func (StaticRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision {
if targetLSN > flushedLSN {
return RecoveryDecision{
Disposition: RecoveryCatchup,
ReservationID: replicaID + "-resv",
ReservationTTL: 100,
Reason: "static_recoverable_window",
Classes: []RecoveryClass{RecoveryClassWALInline},
}
}
return RecoveryDecision{
Disposition: RecoveryNeedsRebuild,
Reason: "static_no_recoverable_window",
}
}
// ScriptedRecoveryPlanner returns pre-seeded reconnect decisions in order.
// Once the scripted list is exhausted, the last decision is reused.
type ScriptedRecoveryPlanner struct {
Decisions []RecoveryDecision
index int
}
func (s *ScriptedRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision {
if len(s.Decisions) == 0 {
return RecoveryDecision{
Disposition: RecoveryNeedsRebuild,
Reason: "scripted_no_decision",
}
}
if s.index >= len(s.Decisions) {
return s.Decisions[len(s.Decisions)-1]
}
d := s.Decisions[s.index]
s.index++
return d
}

61
sw-block/prototype/volumefsm/scenario.go

@ -0,0 +1,61 @@
package volumefsm
import (
"fmt"
fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2"
)
type ScenarioStep struct {
Name string
Event Event
}
type ReplicaSnapshot struct {
State fsmv2.State
FlushedLSN uint64
}
type Snapshot struct {
Step string
Epoch uint64
PrimaryID string
PrimaryState PrimaryState
HeadLSN uint64
WriteGate AdmissionDecision
AckGate AdmissionDecision
Replicas map[string]ReplicaSnapshot
}
func (m *Model) Snapshot(step string) Snapshot {
replicas := make(map[string]ReplicaSnapshot, len(m.Replicas))
for id, r := range m.Replicas {
replicas[id] = ReplicaSnapshot{
State: r.FSM.State,
FlushedLSN: r.FSM.ReplicaFlushedLSN,
}
}
return Snapshot{
Step: step,
Epoch: m.Epoch,
PrimaryID: m.PrimaryID,
PrimaryState: m.PrimaryState,
HeadLSN: m.HeadLSN,
WriteGate: m.WriteAdmission(),
AckGate: m.AckAdmission(m.HeadLSN),
Replicas: replicas,
}
}
func RunScenario(m *Model, steps []ScenarioStep) ([]Snapshot, error) {
trace := make([]Snapshot, 0, len(steps)+1)
trace = append(trace, m.Snapshot("initial"))
for _, step := range steps {
if err := m.Apply(step.Event); err != nil {
return trace, fmt.Errorf("scenario step %q: %w", step.Name, err)
}
trace = append(trace, m.Snapshot(step.Name))
}
return trace, nil
}

BIN
sw-block/prototype/volumefsm/volumefsm.test.exe

17
sw-block/test/README.md

@ -0,0 +1,17 @@
# V2 Test Reference
This directory holds V2-facing test reference material copied from the project test database.
Files:
- `test_db.md`
- copied from `learn/projects/sw-block/test/test_db.md`
- full block-service test inventory
- `v2_selected.md`
- V2-focused working subset
- includes the currently selected simulator-relevant cases and the 4 Phase 13 V2-boundary tests
Use:
- `learn/projects/sw-block/test/test_db.md` as the project-wide source inventory
- `sw-block/test/v2_selected.md` as the active V2 reference/worklist

1675
sw-block/test/test_db.md
File diff suppressed because it is too large
View File

105
sw-block/test/test_db_v2.md

@ -0,0 +1,105 @@
# V2 Test Database
Date: 2026-03-27
Status: working subset
## Purpose
This is the V2-focused review subset derived from:
- `sw-block/test/test_db.md`
- `learn/projects/sw-block/phases/phase13_test.md`
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
Use this file to review and track the tests that most directly help:
- V2 protocol design
- simulator coverage
- V1 / V1.5 / V2 comparison
- V2 acceptance boundaries
This is intentionally much smaller than the full `test_db.md`.
## Review Codes
### Status
- `picked`
- `reviewed`
- `mapped`
### Sim
- `sim_core`
- `sim_reduced`
- `real_only`
- `v2_boundary`
- `sim_not_needed_yet`
## V2 Boundary Tests
| # | Test Name | File | Line | Level | Status | Sim | Notes |
|---|---|---|---|---|---|---|---|
| 1 | `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | V1/V1.5 sender identity loss; should become V2 acceptance case |
| 2 | `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | `NeedsRebuild` must remain sticky under stable per-replica sender identity |
| 3 | `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | Catch-up correctness depends on identity continuity and proper recovery ownership |
| 4 | `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | Multiple reconnect cycles are a V2 sender-loop / recovery-session acceptance target |
## Core Protocol Tests
| # | Test Name | File | Line | Level | Status | Sim | Notes |
|---|---|---|---|---|---|---|---|
| 1 | `TestRecovery` | `recovery_test.go` | | `unit` | `picked` | `sim_core` | Crash recovery correctness is fundamental to block protocol reasoning |
| 2 | `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Durable-progress truth; barrier must count flushed progress, not send progress |
| 3 | `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Progress monotonicity invariant |
| 4 | `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Only eligible replica states count for strict durability |
| 5 | `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Epoch fencing on barrier path |
| 6 | `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | `sync_all` strictness during degraded state |
| 7 | `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | End-to-end strict replication contract |
| 8 | `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Contrasts best_effort vs strict modes |
| 9 | `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | No false durability from degraded shipper |
| 10 | `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | | `unit` | `picked` | `sim_core` | Availability semantics under strict mode |
| 11 | `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | Recoverability after degraded shipper |
| 12 | `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Short-gap catch-up |
| 13 | `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Catch-up vs rebuild boundary |
| 14 | `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Recovery fencing during catch-up |
| 15 | `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Recovery data correctness |
| 16 | `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Replay idempotence |
| 17 | `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | State-machine correctness during recovery |
| 18 | `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Rebuild lifecycle closure |
| 19 | `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Rebuild fencing |
| 20 | `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Safe rebuild failure behavior |
| 21 | `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention rule under lag |
| 22 | `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention timeout boundary |
| 23 | `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention budget boundary |
| 24 | `TestComponent_FailoverPromote` | `component_test.go` | | `component` | `picked` | `sim_core` | Core failover baseline |
| 25 | `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Strict-mode failover |
| 26 | `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Restart/rejoin lifecycle |
| 27 | `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` | | `unit` | `picked` | `sim_core` | Promotion safety and stale candidate rejection |
| 28 | `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` | | `unit` | `picked` | `sim_core` | Multi-promotion lineage |
| 29 | `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` | | `unit` | `picked` | `sim_core` | Mode normalization |
| 30 | `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Best-effort contract |
| 31 | `CP13-8 T4a: sync_all blocks during outage` | `manual` | | `integration` | `picked` | `sim_core` | Strict outage semantics |
## Reduced / Supporting Tests
| # | Test Name | File | Line | Level | Status | Sim | Notes |
|---|---|---|---|---|---|---|---|
| 1 | `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` | | `unit` | `picked` | `sim_reduced` | Advisory WAL-head recovery shape |
| 2 | `testRecoverNoSuperblockPersist` | `recovery_test.go` | | `unit` | `picked` | `sim_reduced` | Recovery despite optimized persist behavior |
| 3 | `TestQAGroupCommitter` | `blockvol_qa_test.go` | | `unit` | `picked` | `sim_reduced` | Commit batching semantics |
| 4 | `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` | | `unit` | `picked` | `sim_reduced` | Backpressure behavior |
| 5 | `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_reduced` | Idempotent flush shape |
| 6 | `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Progress initialization after rebuild |
| 7 | `TestComponent_ManualPromote` | `component_test.go` | | `component` | `picked` | `sim_reduced` | Manual control-path shape |
| 8 | `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Heartbeat observability |
| 9 | `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Control-plane visibility |
| 10 | `TestComponent_ExpandThenFailover` | `component_test.go` | | `component` | `picked` | `sim_reduced` | State continuity across operations |
| 11 | `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_reduced` | Default mode behavior |
| 12 | `CP13-8 T4b: recovery after restart` | `manual` | | `integration` | `picked` | `sim_reduced` | Recovery-time shape and control-plane/local-reconnect interaction |
## Notes
- This file is the actionable V2 subset, not the master inventory.
- If `tester` later finalizes a broader 70-case picked set, expand this file from that selection.
- The 4 V2-boundary tests must remain present even if they fail on V1/V1.5.

115
sw-block/test/v2_selected.md

@ -0,0 +1,115 @@
# V2-Selected Test Worklist
Date: 2026-03-27
Status: working
## Purpose
This is the V2-facing subset of the larger block-service test database.
Sources:
- `sw-block/test/test_db.md`
- `learn/projects/sw-block/phases/phase13_test.md`
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
This file is for:
- tests that should help V2 design and simulator work
- explicit inclusion of the 4 Phase 13 V2-boundary failures
- a working set that `tester`, `sw`, and design can refine further
## Current Inclusion Rule
Include tests that are:
- `sim_core`
- `sim_reduced`
- `v2_boundary`
Prefer tests that directly inform:
- barriers and durability truth
- catch-up vs rebuild
- failover / promotion
- WAL retention / tail-chasing
- mode semantics
- endpoint / identity / reassignment behavior
## Phase 13 V2-Boundary Tests
These must stay visible in the V2 worklist:
| Test | File | Why It Matters To V2 |
|---|---|---|
| `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | Sender identity and reconnect ownership |
| `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | `NeedsRebuild` must remain sticky and identity-safe |
| `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | Catch-up must preserve data correctness under identity continuity |
| `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | Multiple reconnect cycles require stable per-replica sender ownership |
## High-Value V2 Working Set
This is the current distilled working set from `phase13_test.md`.
| Test | File | Current Result | Mapping | Why It Helps V2 |
|---|---|---|---|---|
| `TestRecovery` | `recovery_test.go` | PASS | `sim_core` | Crash recovery correctness |
| `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | PASS | `sim_core` | Barrier truth / durable progress |
| `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Monotonic progress invariant |
| `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-gated strict durability |
| `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | Epoch fencing |
| `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` | PASS | `sim_core` | `sync_all` strictness during outage |
| `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` | PASS | `sim_core` | End-to-end strict replication |
| `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` | PASS | `sim_core` | Mode difference vs strict sync |
| `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` | PASS | `sim_core` | No false durability |
| `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | PASS | `sim_core` | Availability semantics |
| `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | PASS | `sim_core` | Recoverability after degraded shipper |
| `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | PASS | `sim_core` | Short-gap catch-up |
| `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Catch-up vs rebuild boundary |
| `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery fencing |
| `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery data correctness |
| `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` | PASS | `sim_core` | Replay idempotence |
| `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-machine correctness |
| `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild lifecycle |
| `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild fencing |
| `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` | PASS | `sim_core` | No partial/unsafe rebuild success |
| `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention rule |
| `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention timeout boundary |
| `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention budget boundary |
| `TestComponent_FailoverPromote` | `component_test.go` | PASS | `sim_core` | Failover baseline |
| `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` | PASS | `sim_core` | Strict-mode failover |
| `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` | PASS | `sim_core` | Restart/rejoin lifecycle |
| `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Promotion safety |
| `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Multi-promotion lineage |
| `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` | PASS | `sim_core` | Mode normalization |
| `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` | PASS | `sim_core` | Best-effort contract |
| `CP13-8 T4a: sync_all blocks during outage` | `manual` | PASS | `sim_core` | Strict outage semantics |
| `CP13-8 T4b: recovery after restart` | `manual` | PASS | `sim_reduced` | Recovery-time shape |
## Reduced / Supporting Cases To Keep In View
| Test | File | Current Result | Mapping | Why It Helps V2 |
|---|---|---|---|---|
| `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` | PASS | `sim_reduced` | Advisory WAL-head recovery shape |
| `testRecoverNoSuperblockPersist` | `recovery_test.go` | PASS | `sim_reduced` | Recoverability despite optimized persist behavior |
| `TestQAGroupCommitter` | `blockvol_qa_test.go` | PASS | `sim_reduced` | Commit batching semantics |
| `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` | PASS | `sim_reduced` | Backpressure behavior |
| `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` | PASS | `sim_reduced` | Idempotent flush shape |
| `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Progress initialization |
| `TestComponent_ManualPromote` | `component_test.go` | PASS | `sim_reduced` | Manual control-path shape |
| `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Heartbeat observability |
| `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Control-plane visibility |
| `TestComponent_ExpandThenFailover` | `component_test.go` | PASS | `sim_reduced` | Cross-operation state continuity |
| `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` | PASS | `sim_reduced` | Default mode behavior |
## Working Note
`phase13_test.md` currently contains the mapped subset from the real test inventory.
This V2 copy is intentionally narrower:
- preserve the core tests that define the protocol story
- preserve the 4 V2-boundary tests explicitly
- keep a smaller reduced set for supporting invariants
If `tester` finalizes a broader 70-case working set, extend this file rather than editing the full copied database directly.
Loading…
Cancel
Save